Multivariate Modeling Strategy
The following is an example of a clinical study aimed at identification of circulating metabolites related to disease phenotype or grade/severity/type (tissue histology, 4 classifications including controls).
The challenge is to make sense of 300 metabolic measurements for 300 patients.
The goal is to identify metabolites related to disease, while accounting covariate meta data such as gender and smoking.
- Exploratory Data Analysis – principal components analysis (PCA)
- Statistical Analysis – covariate adjustment and analysis of covariance or ANCOVA
- Multivariate Classification Modeling – orthogonal signal correction partial least squares discriminant analysis (O-PLS-DA)
Data exploration is useful for getting an idea of the data structure and to identify unusual or unexpected trends.
PCA above conducted on autoscaled data (300 samples and 300 measurements) was useful for identifying an interesting 2-cluster structure in the sample scores (top left). Unfortunately the goal of the study, disease severity, could not explain this pattern (top center). An unknown covariate was identified causing the observed clustering of samples (top right).
Next various covariate adjustment strategies were applied to the data and evaluated using the unsupervised PCA (bottom left) and the supervised O-PLS-DA.
Even after the initial covariate adjustment for the 2-cluster effect there remained a newly visible covariate (top ;left), the source of which could not me linked to the meta data.
After data pre-treatment and evaluation of testing strategies (top right) the next challenge is to select the best classifiers of disease status. Feature selection was undertaken to improve model performance and simplify its performance.
Variable correlation with O-PLS-DA sample scores and magnitude of variable loading in the model were used to select from the the full feature set (~300) only 64 (21%) top features which explained most of the models classification performance.
In conclusion preliminary data exploration was used to identify an unknown source of variance which negatively affected the experimental goal to identify metabolic predictors of disease severity. Multivariate techniques, PCA and O-PLS-DA, were used to identify an optimal data covariate adjustment and hypothesis testing strategy. Finally O-PLS-DA modeling including feature selection, training/testing validations (n=100) and permutation testing (n=100) were used to identify the top features (21%) which were most predictive of patients classifications as displaying or not displaying the disease phenotype.