Partial least squares (PLS) is a versatile algorithm which can be used to predict either continuous or discrete/categorical variables. Classification with PLS is termed PLS-DA, where the DA stands for discriminant analysis. The PLS-DA algorithm has many favorable properties for dealing with multivariate data; one of the most important of which is how variable collinearity is dealt with, and the model’s ability to rank variables’ predictive capacities within a multivariate context. Orthogonal signal correction PLS-DA or O-PLS-DA is an extension of PLS-DA which seeks to maximize the explained variance between groups in a single dimension or the first latent variable (LV), and separate the within group variance (orthogonal to classification goal) into orthogonal LVs. The variable loadings and/or coefficient weights from a validated O-PLS-DA model can be used to rank all variables with respect to their performance for discriminating between groups. This can be used part of a dimensional reduction or feature selection task which seek to identify the top predictors for a given model.
Like with most predictive modeling or forecasting tasks, model validation is a critical requirement. Otherwise the produced models maybe overfit or perform no better than coin flips. Model validation is the process of defining the models performance, and thus ensuring that the model’s internal variable rankings are actually informative.
Below is a demonstration of the development and validation of an O-PLS-DA multivariate classification model for the famous Iris data set.
- Data pretreatment and preparation
- Model optimization
- Permutation testing
- Internal cross-validation
- External cross-validation
The Iris data only contains 4 variables, but the sample sizes are favorable for demonstrating a two tiered testing and training scheme (internal and external cross-validation). However O-PLS really shines when building models with many correlated variables (coming soon).
The following is an example of a clinical study aimed at identification of circulating metabolites related to disease phenotype or grade/severity/type (tissue histology, 4 classifications including controls).
The challenge is to make sense of 300 metabolic measurements for 300 patients.
The goal is to identify metabolites related to disease, while accounting covariate meta data such as gender and smoking.
- Exploratory Data Analysis – principal components analysis (PCA)
- Statistical Analysis – covariate adjustment and analysis of covariance or ANCOVA
- Multivariate Classification Modeling – orthogonal signal correction partial least squares discriminant analysis (O-PLS-DA)
Data exploration is useful for getting an idea of the data structure and to identify unusual or unexpected trends.
PCA above conducted on autoscaled data (300 samples and 300 measurements) was useful for identifying an interesting 2-cluster structure in the sample scores (top left). Unfortunately the goal of the study, disease severity, could not explain this pattern (top center). An unknown covariate was identified causing the observed clustering of samples (top right).
Next various covariate adjustment strategies were applied to the data and evaluated using the unsupervised PCA (bottom left) and the supervised O-PLS-DA.
Even after the initial covariate adjustment for the 2-cluster effect there remained a newly visible covariate (top ;left), the source of which could not me linked to the meta data.
After data pre-treatment and evaluation of testing strategies (top right) the next challenge is to select the best classifiers of disease status. Feature selection was undertaken to improve model performance and simplify its performance.
Variable correlation with O-PLS-DA sample scores and magnitude of variable loading in the model were used to select from the the full feature set (~300) only 64 (21%) top features which explained most of the models classification performance.
In conclusion preliminary data exploration was used to identify an unknown source of variance which negatively affected the experimental goal to identify metabolic predictors of disease severity. Multivariate techniques, PCA and O-PLS-DA, were used to identify an optimal data covariate adjustment and hypothesis testing strategy. Finally O-PLS-DA modeling including feature selection, training/testing validations (n=100) and permutation testing (n=100) were used to identify the top features (21%) which were most predictive of patients classifications as displaying or not displaying the disease phenotype.
Working with wide data is already hard enough, add to this row outliers and things can get murky fast.
Here is an example of an anlysis of a wide data set, 24 rows x 84 columns.
Using imDEV, written in R, to calculate and visualize a principal components analysis (PCA) on this data set. We find that 7 components capture >80% of the variance in the data or X. We can also clearly see that the first dimension, capturing 35% of the variance in X, is skewed towards one outlier, larger black point in the plots in the lower left triangle of the figure below, representing PCA scores.
In this plot representing the results from a PCA:
- Bottom left triangle = PCA SCORES, red and cyan ellipses display 2 groups, outlier is marked by a larger black point
- Diagonal = PCA Eigen values or variance in X explained by the component
- Top right triangle = PCA Loadings, representing linear combinations variable weights to reconstruct Sample scores
In actuality the large variance in the outlier is due to a single value imputation used to back fill missing values in 40% of the columns (variable) for this row (sample).
Removing this sample (and another more moderate outlier) and using partial least squares projection to latent structures discriminant analysis or PLS-DA to generate a projection of X which maximizes its covariance with Y which here is sample membership among the two groups noted by point and ellipse color (red = group 1, cyan = group 2).
The PLS scores separation for the two groups is largely captured in the first latent variable (LV1, x-axis). However we can’t be sure that this separation is better than random chance. To test this we can generate permuted NULL models by using our original X data to discriminate between randomly permuted sample group assignments (Y). When doing the random assignments we can optionally conserve the proportion of cases in each group or maintain this at 1:1.
Comparing our models (vertical hashed line) cross-validated fit to the training data , Q^2, to 100 randomly permuted models (TOP PANEL above, cyan distribution), we see that generally our “one” model is better fit than that achieved for the random data. Another interesting parameter is the comparison of our model’s root mean error of prediction, RMSEP, or out of sample error to the permuted models’ (BOTTOM PANEL above, dark green distribution).
To have more confidence in the assessment of our model we can conduct training and testing validations. We can do this by randomly splitting our original X data into 2/3 training and 1/3 test sets. By using the training data to fit the model, and then using this to predict group memberships for the test data we can get an idea of the model’s out of sample classification error, visualized below using an receiver operator characteristic curve (ROC, RIGHT PANEL).
Wow this one model looks “perfect”, based on its assessment using one training/testing evaluation (ROC curve above). However it is important to repeat this procedure to evaluate its performance for other random splits of the data into training and test sets.
After permuting the samples training/test assignments 100 times; now we see that our original “one” model (cyan line in ROC curve in the TOP PANEL) was overly optimistic compared to the average performance of 100 models (green lines and distribution above).
Now that we have more confidence in our models performance we can compare the distributions for its performance metrics to those we calculated for the permuted NULL models above. In the case of the “one” model we can use a single sample t-Test or for the “100” model a two-sample t-Test to determine the probability of achieving a similar performance to our model by random chance.
Now by looking at our models RMSEP compared to random chance (BOTTOM LEFT PANEL, out of sample error for our model, green, compared to random chance, yellow) we can be confident that our model is worth exploring further.
For example, we can now investigate the variables’ weights and loadings in the PLS model to understand key differences in these parameters which are driving the above models discrimination performance of our two groups of interest.