When you want to get to know and love your data

Archive for May, 2013

Tutorials Covering Biological Data Analysis Strategies

I’ve posted two new tutorials focused on intermediate and advanced strategies for biological, and specifically metabolomic data analysis (click titles for pdfs).



Multivariate Modeling Strategy

The following is an example of a clinical study aimed at identification of circulating metabolites related to disease phenotype or grade/severity/type (tissue histology, 4 classifications including controls).

The challenge is to make sense of 300 metabolic measurements for 300 patients.

The goal is to identify metabolites related to disease, while accounting covariate meta data such as gender and smoking.

The steps

  1. Exploratory Data Analysis – principal components analysis (PCA)
  2. Statistical Analysis – covariate adjustment and analysis of covariance or ANCOVA
  3. Multivariate Classification Modeling – orthogonal signal correction partial least squares discriminant analysis (O-PLS-DA)

Data exploration is useful for getting an idea of the data structure and to identify unusual or unexpected trends.

PCA raw

PCA above conducted on autoscaled data (300 samples and 300 measurements) was useful for identifying an interesting 2-cluster structure in the sample scores (top left). Unfortunately the goal of the study, disease severity, could not explain this pattern (top center). An  unknown covariate was identified causing the observed clustering of samples (top right).

Next various covariate adjustment strategies were applied to the data and evaluated using the unsupervised PCA (bottom left) and the supervised O-PLS-DA.

feture selection O-PLS-DA

Even after the initial covariate adjustment for the 2-cluster effect there remained a newly visible covariate (top ;left), the source of which could not me linked to the meta data.

After data pre-treatment and evaluation of testing strategies (top right) the next challenge is to select the best classifiers of disease status. Feature selection was undertaken to improve model performance and simplify its performance.

feture selection O-PLS-DA

Variable correlation with O-PLS-DA sample scores and magnitude of variable loading in the model were used to select from the the full feature set (~300)   only 64 (21%) top features which explained most of the models classification performance.

Feature Selection

In conclusion preliminary data exploration was used to identify an unknown source of variance which negatively affected the experimental goal to identify metabolic predictors of disease severity. Multivariate techniques, PCA and O-PLS-DA, were used to identify an optimal data covariate adjustment and hypothesis testing strategy. Finally O-PLS-DA modeling including feature selection, training/testing validations (n=100) and permutation testing (n=100) were used to identify the top features (21%) which were most predictive of patients classifications as displaying or not displaying the disease phenotype.

Viewing Time-Dimensional Data (in multivariate space)

The idea is that we have collected information about 30 samples at 4 intervals for 200 variables. This makes 30 * 4 * 200 = 24,000 data points!

That is a lot to keep track of if we want to start the data analysis by looking at sample-wise (30) differences in variables (200) which are also dependent on time (4).

One idea is to use orthogonal signal correction partial least squares  (O-PLS) to ask the question:

1) what is the most conserved linear ordering of my data based on

2) description of my data =  3 (group)s of samples at 4 (points in time) and the starting point or t= 0 (so a total  of 5 points in time).

Here is an example O-PLS scores plot for the samples (30*5 = 150 ) with polygons around the boundaries of each unique sample-group classification  ( 3 * 5 = 15).

group polygon

We can try to summarize the position of each group in this multivariate space (15 * 200) by plotting each groups median score  and standard error for the first two O-PLS latent variables (LVs).

scores and lines

Above is an enticing representation of the time-course differences between 3 groups of samples for 5 time measurement points (t= 0, 30, 60, 90 and 120 minutes). Now that we have established how our samples look based on 200 measurements or variables we can examine the variable loadings for this model.

group loadings

Above the loadings or relative contribution of each variable to the description of the samples  is plotted for O-PLS LV1 and 2. Based on the position of the variables in the x-axis (LV1) we can say something about their relative changes in time (because O-PLS samples scores are also distributed in the x-axis with respect with time), and the variable LV2 loading (y-axis) can be used to describe changes/differences between the groups (note sample group classification pattern in the y-axis (LV2) which is independent of the change in time (x-axis, LV1).

scores loadings

Above we can visualize a how the sample and variable descriptions are related. For instance variables far left in the loadings (FA) start out relatively increased and then decrease as samples position increases to the right. Analogously as time increases  there is an increase in the majority of variables (note the large cloud of loadings on LV1 (x-axis above)).

Another interesting thing to try is to visualize the change in groups scores  which are independent of time = 0 (subtract t=0 abundance for 200 variables from t = 30, 60, 90 and 120 minute time-points on a sample-wise basis).
baseline group

Above are a baseline (t= 0) normalized changes in time (above left, point color) for three groups of samples (above left, point shape). As before we can study the relationship between samples and variables on a multivariate basis by comparing the samples scores (position in LV 1 and LV2) to variable loadings.

This process (O-PLS) can be helpful for ranking the original 200 variables in two dimensions (2 lists)

1) with respect to change with time (x-axis)

2) difference between groups (y-axis).

It is interesting to note that without baseline adjustment, the group young NGT has the lowest starting FA (group scores at t= 0 are to the right of the other two groups). The relative differences between group t =0 and t = 120 positions can be used to visualize the change in FA over time (decrease, note negative loading in LV1 ).

Finally we can try to connect our multivariate observations with the easily interpretable visualizations of a single variable ( FA baseline adjusted), as a box plot representing the medians (horizontal line center of box plot)  and 25-75th qantiles (rectangle  top and bottom boundaries ) for the 3 groups over 4 time points.

group box plots

The box plot visualization above captures a similar trend in the relative position in groups as the one we previously described using  all 200 variables. This make sense given the extreme loading observed for FA, and therefore the implied contribution (influence) of this variable on the observed distribution of the sample scores.


In an effort to spread the word on how easy it is to make amazing data visualizations and harness the power of the internet to do science I’ve started a new repository for Biological and Multivariate Data Analysis Tutorials:  TeachingDemos.


Data Base query and translation

Check out an application of my two new R packages, CTSgetR and CIRgetR,  for translation between chemical identifiers in R using  the Chemical Translation Service (CTS) and Chemical Identifier Resolver CIR.


Multivariate Analysis

Network Visualizations