When you want to get to know and love your data

Archive for September, 2013

Classification with O-PLS-DA

unnamed-chunk-5Partial least squares (PLS) is a versatile algorithm which can be used to predict either continuous or discrete/categorical variables. Classification with PLS is termed PLS-DA, where the DA stands for discriminant analysis.  The PLS-DA algorithm has many favorable properties for dealing with multivariate data; one of the most important of which is how variable collinearity is dealt with, and the model’s ability to rank variables’ predictive capacities within a multivariate context. Orthogonal signal correction PLS-DA or O-PLS-DA is an extension of PLS-DA which seeks to maximize the explained variance between groups in a single dimension or the first latent variable (LV), and separate the within group variance (orthogonal to classification goal) into orthogonal LVs. The variable loadings and/or coefficient weights from a validated O-PLS-DA model can be used to rank all variables with respect to their performance for discriminating between groups. This can be used part of a dimensional reduction or feature selection task which seek to identify the top predictors for a given model.

Like with most predictive modeling or forecasting tasks, model validation is a critical requirement. Otherwise the produced models maybe overfit or perform no better than coin flips. Model validation is the process of defining the models performance, and thus ensuring that the model’s internal variable rankings are actually informative.

Below is a demonstration of the development and validation of an O-PLS-DA multivariate classification model for the famous Iris data set.

O-PLS-DA model validation  Tutorial

The Iris data only contains 4 variables, but the sample sizes are favorable for demonstrating a two tiered testing and training scheme (internal and external cross-validation). However O-PLS really shines when building models with many correlated variables (coming soon).

Sessions in Metabolomics 2013

The international summer sessions in metabolomics 2013 came to a happy conclusion this past Friday Sept 6th 2013.  I had the pleasure of teaching the topics covering metabolomic data analysis. The class was split into lecture and lab sections. The lab section consisted of a hands on data analysis of:

  • fresh vs. lyophilized treatment comparison for tomatillo  leaf primary metabolomics
  • tomatillo vs. pumpkin leaf primary metabolites

The majority of the data analyses were implemented using the open source software imDEV and Devium-web.

Download the FULL LAB. Take a look at the goals folder for each lesson.  You can follow along with the lesson plans by looking at each subsections respective excel file (.xlsx). When you are done with a section unhide all the worksheets (right click on a tab at the bottom) to view the solutions .

The lectures, preceding the lab, covered the basics of metabolomic data analysis  including:

  • Data Quality Overview and Statistical Analysis
  • Multivariete Data analysis
  • Metabolomic Case Studies