When you want to get to know and love your data

Posts tagged “covariate adjustment

Tutorials Covering Biological Data Analysis Strategies

I’ve posted two new tutorials focused on intermediate and advanced strategies for biological, and specifically metabolomic data analysis (click titles for pdfs).




Multivariate Modeling Strategy

The following is an example of a clinical study aimed at identification of circulating metabolites related to disease phenotype or grade/severity/type (tissue histology, 4 classifications including controls).

The challenge is to make sense of 300 metabolic measurements for 300 patients.

The goal is to identify metabolites related to disease, while accounting covariate meta data such as gender and smoking.

The steps

  1. Exploratory Data Analysis – principal components analysis (PCA)
  2. Statistical Analysis – covariate adjustment and analysis of covariance or ANCOVA
  3. Multivariate Classification Modeling – orthogonal signal correction partial least squares discriminant analysis (O-PLS-DA)

Data exploration is useful for getting an idea of the data structure and to identify unusual or unexpected trends.

PCA raw

PCA above conducted on autoscaled data (300 samples and 300 measurements) was useful for identifying an interesting 2-cluster structure in the sample scores (top left). Unfortunately the goal of the study, disease severity, could not explain this pattern (top center). An  unknown covariate was identified causing the observed clustering of samples (top right).

Next various covariate adjustment strategies were applied to the data and evaluated using the unsupervised PCA (bottom left) and the supervised O-PLS-DA.

feture selection O-PLS-DA

Even after the initial covariate adjustment for the 2-cluster effect there remained a newly visible covariate (top ;left), the source of which could not me linked to the meta data.

After data pre-treatment and evaluation of testing strategies (top right) the next challenge is to select the best classifiers of disease status. Feature selection was undertaken to improve model performance and simplify its performance.

feture selection O-PLS-DA

Variable correlation with O-PLS-DA sample scores and magnitude of variable loading in the model were used to select from the the full feature set (~300)   only 64 (21%) top features which explained most of the models classification performance.

Feature Selection

In conclusion preliminary data exploration was used to identify an unknown source of variance which negatively affected the experimental goal to identify metabolic predictors of disease severity. Multivariate techniques, PCA and O-PLS-DA, were used to identify an optimal data covariate adjustment and hypothesis testing strategy. Finally O-PLS-DA modeling including feature selection, training/testing validations (n=100) and permutation testing (n=100) were used to identify the top features (21%) which were most predictive of patients classifications as displaying or not displaying the disease phenotype.

Visualization of Multivariate Biological Models (PLS-DA and O-PLS-DA)

Its not uncommon to be faced by multiple questions at the same time. For instance imagine the following experimental design. You have one MAIN question: what is different between groups A  and B, but among groups A and B are subgroups 1 and 2. This complicates things because now the answer to the MAIN question (what is different between A and B) may be slightly different for the two sub groups A|1, A|2 and B|1, B|2.



In statistics we can account for these types of experimental designs by choosing different tests. For instance in the case outlined above we could use a two-way analysis of variance (2-way ANOVA) to identify differences between A|B which are independent of differences between 1|2 (and interaction between A|B and 1|2). In the case of multivariate modeling we can achieve a similar effect by using covariate adjustments. For example we can use the residuals from a simple linear model for  differences between 1|2 as the 1|2-effect adjusted data to be used to test for differences between A|B. Here is a visual example of this approach using:

2) 1|2–adjusted PLS-DA model for A|B
1) PCA to evaluate the data variance between A and B (GREEN and RED) and 1 and 2 (SMALL or LARGE)

3) 1|2–adjusted O-PLS-DA model for A|B

Based on the PCA we see that the differences between A|B are also affected by 1|2. This is evident in distribution of scores based on LARGE|SMALL among A ( A|1 (GREEN|SMALL) is more different (further right) from all B than A|2 (GREEN|LARGE). The same can be said for B, and in particular the greatest differences between all groups is between those which have the greatest separation in the X-axis (1st principal component) which are RED|LARGE and GREEN|SMALL. 

To identify the greatest difference between RED|GREEN which is independent of differences due to SMALL|LARGE, we can use a SMALL|LARGE -adjusted data to create a PLS-DA model to discriminate between RED|GREEN.

This projection of the differences between A|B is the same for SMALL|LARGE groups. Ideally we want the two groups scores to be maximally separated in the X-axis or 1st LV. We see that this is not the case above, and instead the explanation of how the variables  contribute to differences between  GREEN|RED needs to be answered by explaining scores variance in X and Y axes or  two dimensions.

Next we try the O-PLS-DA algorithm, which aims to rotate the projection of the data to maximize the separation between GREEN|RED on the X-axis and capture unrelated or orthogonal variance on the Y-axis.
The O-PLS-DA model loadings for the 1st LV provide information regarding differences in variable magnitudes between the two groups (GREEN|RED).

We can use network mapping to visualize these weights within a domain specific context. In the case of metabolomics data this is best achieved using biochemical/chemical similarity networks.

We can create these networks by assigning edges between vertices (representing metabolites) based on biochemical relationships (KEGG RPAIRs ) or chemical similarities (Tanimoto coefficient >0.7). We can then map the O-PLS-DA model loadings to this network’s visual properties (vertex: size, color, border, and inset graphic).


For example we can map vertex size to the matabolite’s importance in the explained discrimination between groups (loading on O-PLS-DA LV 1) and color the direction of change (blue, decrease; red, increase). Metabolites displaying significant differences between RED and GREEN groups (two-way ANOVA, p < 0.05 adjusting for 1|2) are shown at maximum size, with a black border and contain a box-plot  visualization.

Here is  network mapping the O-PLS-DA model loadings into a biological context and displaying graphs for import parameters means among groups stratified by A|B and 1|2 (left to right: A|1, A|2,B|1,B|2).


Here is  another network with the same edge and vertex properties as above, except the inset graphs show differences between groups A|B adjusted for the effect of 1|2.


Covariate Adjustement for PLS-DA Models

A typical experiment may involve the testing of a wide variety of factors. For instance, here is an example of an experiment aimed at determining metabolic differences between two plant cultivars at three different ontological stages and in two different tissues. Exploratory principal components analysis (PCA) can be used to evaluate the major modes of variance in the data prior to conducting any univariate tests.

PCA complete data set

Based on the PCA (autoscaled data) we can see that the majority of the differences are driven by differences between tissues. This is evident from the scores separation in (a) between leaf and fruit tissues, which is driven by metabolites with large positive/negative loadings  on the first dimension or x-axis in (b). A lesser mode of variance is captured in the second dimension, and particularly in fruit we can see that there is some separation in scores between the two cultivars and their different ontological stages. Based on this it was concluded to carry out test in leaf and fruit tissue separately. Additionally in order to identify the effects of cultivar on the metabolomic profiles which are independent of stage and vice versa, a linear covariate adjustments were applied to the data.  

covar adjusted data

Again using PCA and focusing on fruit tissue, we can evaluate the variance in the data given our hypotheses (differences between cultivars or stages). Looking at (a) we can see that there is not a clear separation in scores in any one dimension between cultivars or stages. However there is separation in two dimensions. This is problematic in that this suggest that there is an interaction between cultivar and stage, which will complicate any univariate tests for these factors. We can see that carrying out linear covariate adjustment either for  cultivar (b) or stage (c) translate the variance for the target hypothesis into one dimension, which therefore simplifies its testing. Note, this is exactly what is done when doing an analysis of covariance or ANCOVA. However if we want to use this same favorable variance environment for multivariate modeling like for example partial least squares projection to latent structures discriminant analysis (PLS-DA) we need to covariate adjust the data which in this case is achieved by taking the residuals from linear model for the covariate we want to adjust for.

PLS-DA of covariate adjusted data

Now that we have adjusted the data for covariate effects we can test the primary hypotheses (differences between cultivars, stages and tissues) using PLS-DA. Quick  visual inspection of model scores can be used to get a feel for the quality of the models. Ideally we would like to see a scores separation between the various levels of our hypotheses in one dimension.  We can see that  both fruit models  are of higher quality than that for leaf.  However to fully validate these models we  need to carry out some permutation testing or something similar. The benefit of PLS-DA is that   we can use  the information about the variables contribution to the scores separation or loadings to identify metabolomic differences between cultivars or with increasing maturity or stage.

net cultHere is an example where PLS-DA variable loadings are mapped into a biochemical context using a chemical similarity network.  This network represents differences in metabolites due to cultivar, wherein significant differences  in metabolite means (ANCOVA, FDR adjusted p-value < 0.05)  between cultivars are represented by nodes or vertices which are colored  based on the sign of the loading and their size  used to encode the magnitude of their loading in the model.

net 2

We can now compare the two networks  representing metabolomic differences due to cultivar (far top) or to stage (above) to identify biochemical changes due to these factors which are independent of each others effects (or interaction).