When you want to get to know and love your data

Archive for January, 2013

Modeling Short-term Glucose Effects on Primary Metabolism


Power Calculations – relationship between test power, effect size and sample size

I was interested in modeling the relationship between the power and sample size, while holding the significance level constant (p = 0.05) , for the common two-sample t-Test. Luckily R has great support for power analysis and I found the function I was looking for in the package pwr.

To calculate the power for the two-sample T-test at different effect and sample sizes I needed to wrap the basic function power.t.test().

# Need pwr package

# t-TEST

d<-seq(.1,2,by=.1) # effect sizes
n<-1:150 # sample sizes


			power.t.test(n=n[j],d=d[i],sig.level=0.05,power=NULL,type= "two.sample")$power

t.test.power.effect[is.na(t.test.power.effect)]<-0 # some powesr couldn't be calculated, set these to zero
colnames(t.test.power.effect)<-paste (d,"effect size")
The object t.test.power.effect is 150 x 20 column data frame which lists the power for from 1 to 150 samples and effects sizes from 0 to 2 by 0.1. While this is useful as a look up table we would optimally like to see a visualization of it. Here is some example code to plot this data using base and ggplot2 packages.

#plot results using base
obj<-t.test.power.effect # object to plot
color<-rainbow(length(cols), alpha=.5) # colors
lwd=5 # line thickness

#highligh important effect sizes
imp<-c(2,5,8) # cuts
cuts<-c("small","medium","large") # based on cohen 1988


#initialize plot
plot(1,type="n",frame.plot=FALSE,xlab="sample size",ylab="power",xlim=c(1,150),ylim=c(0,1),main="t-Test", axes = FALSE)
#add custom axis and grid
abline(v=seq(0,150,by=10),col = "lightgray", lty = "dotted")
abline(h=seq(0,1,by=.05),col = "lightgray", lty = "dotted")
#plot lines
for(i in 1:length(cols)){lines(1:150,obj[,cols[i]],col=color[i],lwd=lwd,lty=lty[i])}
legend("top",legend=wording,col=color,lwd=3,lty=lty,title="Effect Size",bty="n")

Which makes the following graph.
power calculation for t-test

Based on this graph, we can see the relationship between power, effect sizes and sample number. I’ve marked the cutoffs suggested by Cohen 1988 delineating small, medium and large effect sizes. Based on this we can see that if we are designing an experiment and are trying to select a sample size for which our test will be powerd at 0.8 we need to consider the expected effect of our experimental treatment. If we think that or treatment should have a moderate effect we should consider some where around 60 samples per group. However and even better analysis would be to directly calculate the sample number needed to achieve some power and significance level given experimentally derived effects sizes based on preliminary data!

And just for kicks here is the same data plotted using ggplot2.

#plot using ggplot2
#plot results using ggplot2
x11() # graphic device on windows
obj<-cbind(size=1:150,t.test.power.effect) #flip object for melting
melted<-cbind(melt(obj, id="size"),effect=rep(d,each=150)) # melt and bind with effect for mapping
ggplot(data=melted, aes(x=melted$size, y=melted$value, color=as.factor(melted$effect))) + geom_line(size=2,alpha=.5) +
ylab("power") + xlab("sample size") + ggtitle("t-Test")+theme_minimal()

# wow ggplot2 is amazing in its brevity
# need to tweak legend and lty, but otherwise very similar
power calculation for t-test ggplot2
A little tweaking and these graphs are basically the same. Wow I really need to stop using base for my plots and fully embrace learning ggplot2!

Covariate Adjustement for PLS-DA Models

A typical experiment may involve the testing of a wide variety of factors. For instance, here is an example of an experiment aimed at determining metabolic differences between two plant cultivars at three different ontological stages and in two different tissues. Exploratory principal components analysis (PCA) can be used to evaluate the major modes of variance in the data prior to conducting any univariate tests.

PCA complete data set

Based on the PCA (autoscaled data) we can see that the majority of the differences are driven by differences between tissues. This is evident from the scores separation in (a) between leaf and fruit tissues, which is driven by metabolites with large positive/negative loadings  on the first dimension or x-axis in (b). A lesser mode of variance is captured in the second dimension, and particularly in fruit we can see that there is some separation in scores between the two cultivars and their different ontological stages. Based on this it was concluded to carry out test in leaf and fruit tissue separately. Additionally in order to identify the effects of cultivar on the metabolomic profiles which are independent of stage and vice versa, a linear covariate adjustments were applied to the data.  

covar adjusted data

Again using PCA and focusing on fruit tissue, we can evaluate the variance in the data given our hypotheses (differences between cultivars or stages). Looking at (a) we can see that there is not a clear separation in scores in any one dimension between cultivars or stages. However there is separation in two dimensions. This is problematic in that this suggest that there is an interaction between cultivar and stage, which will complicate any univariate tests for these factors. We can see that carrying out linear covariate adjustment either for  cultivar (b) or stage (c) translate the variance for the target hypothesis into one dimension, which therefore simplifies its testing. Note, this is exactly what is done when doing an analysis of covariance or ANCOVA. However if we want to use this same favorable variance environment for multivariate modeling like for example partial least squares projection to latent structures discriminant analysis (PLS-DA) we need to covariate adjust the data which in this case is achieved by taking the residuals from linear model for the covariate we want to adjust for.

PLS-DA of covariate adjusted data

Now that we have adjusted the data for covariate effects we can test the primary hypotheses (differences between cultivars, stages and tissues) using PLS-DA. Quick  visual inspection of model scores can be used to get a feel for the quality of the models. Ideally we would like to see a scores separation between the various levels of our hypotheses in one dimension.  We can see that  both fruit models  are of higher quality than that for leaf.  However to fully validate these models we  need to carry out some permutation testing or something similar. The benefit of PLS-DA is that   we can use  the information about the variables contribution to the scores separation or loadings to identify metabolomic differences between cultivars or with increasing maturity or stage.

net cultHere is an example where PLS-DA variable loadings are mapped into a biochemical context using a chemical similarity network.  This network represents differences in metabolites due to cultivar, wherein significant differences  in metabolite means (ANCOVA, FDR adjusted p-value < 0.05)  between cultivars are represented by nodes or vertices which are colored  based on the sign of the loading and their size  used to encode the magnitude of their loading in the model.

net 2

We can now compare the two networks  representing metabolomic differences due to cultivar (far top) or to stage (above) to identify biochemical changes due to these factors which are independent of each others effects (or interaction).