When you want to get to know and love your data

Posts tagged “O-PLS-DA

High Dimensional Biological Data Analysis and Visualization

High dimensional biological data shares many qualities with other forms of data. Typically it is wide (samples << variables), complicated by experiential design and made up of complex relationships driven by both biological and analytical sources of variance. Luckily the powerful combination of R, Cytoscape (< v3) and the R package RCytoscape can be used to generate high dimensional and highly informative representations of complex biological (and really any type of) data. Check out the following examples of network mapping in action or view a more indepth presentation of the techniques used below.

Partial correlation network highlighting changes in tumor compared to control tissue from the same patient.

Tissue network cancer

Biochemical and structural similarity network of changes in tumor compared to control tissue from the same patient.

Cancer tissue network

Hierarchical clusters (color) mapped to a biochemical and structural similarity network displaying difference before and after drug administration.

cough syrup network

Partial correlation network displaying changes in metabolite relationships in response to drug treatment.

Treatment response network

Partial correlation network displaying changes in disease and response to drug treatment.

Treatment effects network

Check out the full presentation below.

Creative Commons License

Tutorials- Statistical and Multivariate Analysis for Metabolomics

2014 winter LC-MS stats courseI recently had the pleasure in participating in the 2014 WCMC Statistics for Metabolomics Short Course. The course was hosted by the NIH West Coast Metabolomics Center and focused on statistical and multivariate strategies for metabolomic data analysis. A variety of topics were covered using 8 hands on tutorials which focused on:

  • data quality overview
  • statistical and power analysis
  • clustering
  • principal components analysis (PCA)
  • partial least squares (O-/PLS/-DA)
  • metabolite enrichment analysis
  • biochemical and structural similarity network construction
  • network mapping

I am happy to have taught the course using all open source software, including: R, and Cytoscape. The data analysis and visualization were done using Shiny-based apps:  DeviumWeb and MetaMapR. Check out some of the slides below or download all the class material and try it out for yourself.

Creative Commons License
2014 WCMC LC-MS Data Processing and Statistics for Metabolomics by Dmitry Grapov is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Special thanks to the developers of Shiny and Radiant by Vincent Nijs.

Classification with O-PLS-DA

unnamed-chunk-5Partial least squares (PLS) is a versatile algorithm which can be used to predict either continuous or discrete/categorical variables. Classification with PLS is termed PLS-DA, where the DA stands for discriminant analysis.  The PLS-DA algorithm has many favorable properties for dealing with multivariate data; one of the most important of which is how variable collinearity is dealt with, and the model’s ability to rank variables’ predictive capacities within a multivariate context. Orthogonal signal correction PLS-DA or O-PLS-DA is an extension of PLS-DA which seeks to maximize the explained variance between groups in a single dimension or the first latent variable (LV), and separate the within group variance (orthogonal to classification goal) into orthogonal LVs. The variable loadings and/or coefficient weights from a validated O-PLS-DA model can be used to rank all variables with respect to their performance for discriminating between groups. This can be used part of a dimensional reduction or feature selection task which seek to identify the top predictors for a given model.

Like with most predictive modeling or forecasting tasks, model validation is a critical requirement. Otherwise the produced models maybe overfit or perform no better than coin flips. Model validation is the process of defining the models performance, and thus ensuring that the model’s internal variable rankings are actually informative.

Below is a demonstration of the development and validation of an O-PLS-DA multivariate classification model for the famous Iris data set.

O-PLS-DA model validation  Tutorial

The Iris data only contains 4 variables, but the sample sizes are favorable for demonstrating a two tiered testing and training scheme (internal and external cross-validation). However O-PLS really shines when building models with many correlated variables (coming soon).

Sessions in Metabolomics 2013

The international summer sessions in metabolomics 2013 came to a happy conclusion this past Friday Sept 6th 2013.  I had the pleasure of teaching the topics covering metabolomic data analysis. The class was split into lecture and lab sections. The lab section consisted of a hands on data analysis of:

  • fresh vs. lyophilized treatment comparison for tomatillo  leaf primary metabolomics
  • tomatillo vs. pumpkin leaf primary metabolites

The majority of the data analyses were implemented using the open source software imDEV and Devium-web.

Download the FULL LAB. Take a look at the goals folder for each lesson.  You can follow along with the lesson plans by looking at each subsections respective excel file (.xlsx). When you are done with a section unhide all the worksheets (right click on a tab at the bottom) to view the solutions .

The lectures, preceding the lab, covered the basics of metabolomic data analysis  including:

  • Data Quality Overview and Statistical Analysis
  • Multivariete Data analysis
  • Metabolomic Case Studies

Orthogonal Signal Correction Partial Least Squares (O-PLS) in R


I often need to analyze and model very wide data (variables >>>samples), and because of this I gravitate to robust yet relatively simple methods. In my opinion partial least squares (PLS) is a particular useful algorithm. Simply put, PLS is an extension of principal components analysis (PCA), a non-supervised  method to maximizing  variance explained in X, which instead maximizes the covariance between X and Y(s). Orthogonal signal correction partial least squares (O-PLS) is a variant of PLS which uses orthogonal signal correction to maximize the explained covariance between X and Y on the first latent variable, and components >1 capture variance in X which is orthogonal (or unrelated) to Y.

Because R does not have a simple interface for O-PLS, I am in the process of writing a package, which depends on the existing package pls.

Today I wanted to make a small example of conducting O-PLS in R, and  at the same time take a moment to try out the R package knitr and RStudio for markdown generation.

You can take a look at the O-PLS/O-PLS-DA tutorials.

I was extremely impressed with ease of using knitr and generating markdown from code using RStudio. A big thank you to Yihui Xie and the RStudio developers (Joe Cheng). This is an amazing capability which I will make much more use of in the future!

Network Mapping Video

Here are a video and slides for a presentation of mine about my favorite topic :

Multivariate Modeling Strategy

The following is an example of a clinical study aimed at identification of circulating metabolites related to disease phenotype or grade/severity/type (tissue histology, 4 classifications including controls).

The challenge is to make sense of 300 metabolic measurements for 300 patients.

The goal is to identify metabolites related to disease, while accounting covariate meta data such as gender and smoking.

The steps

  1. Exploratory Data Analysis – principal components analysis (PCA)
  2. Statistical Analysis – covariate adjustment and analysis of covariance or ANCOVA
  3. Multivariate Classification Modeling – orthogonal signal correction partial least squares discriminant analysis (O-PLS-DA)

Data exploration is useful for getting an idea of the data structure and to identify unusual or unexpected trends.

PCA raw

PCA above conducted on autoscaled data (300 samples and 300 measurements) was useful for identifying an interesting 2-cluster structure in the sample scores (top left). Unfortunately the goal of the study, disease severity, could not explain this pattern (top center). An  unknown covariate was identified causing the observed clustering of samples (top right).

Next various covariate adjustment strategies were applied to the data and evaluated using the unsupervised PCA (bottom left) and the supervised O-PLS-DA.

feture selection O-PLS-DA

Even after the initial covariate adjustment for the 2-cluster effect there remained a newly visible covariate (top ;left), the source of which could not me linked to the meta data.

After data pre-treatment and evaluation of testing strategies (top right) the next challenge is to select the best classifiers of disease status. Feature selection was undertaken to improve model performance and simplify its performance.

feture selection O-PLS-DA

Variable correlation with O-PLS-DA sample scores and magnitude of variable loading in the model were used to select from the the full feature set (~300)   only 64 (21%) top features which explained most of the models classification performance.

Feature Selection

In conclusion preliminary data exploration was used to identify an unknown source of variance which negatively affected the experimental goal to identify metabolic predictors of disease severity. Multivariate techniques, PCA and O-PLS-DA, were used to identify an optimal data covariate adjustment and hypothesis testing strategy. Finally O-PLS-DA modeling including feature selection, training/testing validations (n=100) and permutation testing (n=100) were used to identify the top features (21%) which were most predictive of patients classifications as displaying or not displaying the disease phenotype.