Working with wide data is already hard enough, add to this row outliers and things can get murky fast.
Here is an example of an anlysis of a wide data set, 24 rows x 84 columns.
Using imDEV, written in R, to calculate and visualize a principal components analysis (PCA) on this data set. We find that 7 components capture >80% of the variance in the data or X. We can also clearly see that the first dimension, capturing 35% of the variance in X, is skewed towards one outlier, larger black point in the plots in the lower left triangle of the figure below, representing PCA scores.
In this plot representing the results from a PCA:
- Bottom left triangle = PCA SCORES, red and cyan ellipses display 2 groups, outlier is marked by a larger black point
- Diagonal = PCA Eigen values or variance in X explained by the component
- Top right triangle = PCA Loadings, representing linear combinations variable weights to reconstruct Sample scores
In actuality the large variance in the outlier is due to a single value imputation used to back fill missing values in 40% of the columns (variable) for this row (sample).
Removing this sample (and another more moderate outlier) and using partial least squares projection to latent structures discriminant analysis or PLS-DA to generate a projection of X which maximizes its covariance with Y which here is sample membership among the two groups noted by point and ellipse color (red = group 1, cyan = group 2).
The PLS scores separation for the two groups is largely captured in the first latent variable (LV1, x-axis). However we can’t be sure that this separation is better than random chance. To test this we can generate permuted NULL models by using our original X data to discriminate between randomly permuted sample group assignments (Y). When doing the random assignments we can optionally conserve the proportion of cases in each group or maintain this at 1:1.
Comparing our models (vertical hashed line) cross-validated fit to the training data , Q^2, to 100 randomly permuted models (TOP PANEL above, cyan distribution), we see that generally our “one” model is better fit than that achieved for the random data. Another interesting parameter is the comparison of our model’s root mean error of prediction, RMSEP, or out of sample error to the permuted models’ (BOTTOM PANEL above, dark green distribution).
To have more confidence in the assessment of our model we can conduct training and testing validations. We can do this by randomly splitting our original X data into 2/3 training and 1/3 test sets. By using the training data to fit the model, and then using this to predict group memberships for the test data we can get an idea of the model’s out of sample classification error, visualized below using an receiver operator characteristic curve (ROC, RIGHT PANEL).
Wow this one model looks “perfect”, based on its assessment using one training/testing evaluation (ROC curve above). However it is important to repeat this procedure to evaluate its performance for other random splits of the data into training and test sets.
After permuting the samples training/test assignments 100 times; now we see that our original “one” model (cyan line in ROC curve in the TOP PANEL) was overly optimistic compared to the average performance of 100 models (green lines and distribution above).
Now that we have more confidence in our models performance we can compare the distributions for its performance metrics to those we calculated for the permuted NULL models above. In the case of the “one” model we can use a single sample t-Test or for the “100” model a two-sample t-Test to determine the probability of achieving a similar performance to our model by random chance.
Now by looking at our models RMSEP compared to random chance (BOTTOM LEFT PANEL, out of sample error for our model, green, compared to random chance, yellow) we can be confident that our model is worth exploring further.
For example, we can now investigate the variables’ weights and loadings in the PLS model to understand key differences in these parameters which are driving the above models discrimination performance of our two groups of interest.
The Iris data set is a famous for its use to compare unsupervised classifiers.
The goal is to use information about flower characteristics to accurately classify the 3 species of Iris. We can look at scatter plots of the 4 variables in the data set and see that no single variable nor bivariate combination can achieve this.
One approach to improve the separation between the two closely related Iris species, I.versicolor (blue) and I.virginica (green), is to use a combination of all 4 measurements, by constructing principal components (PCs).
Using the singular value decomposition to calculate PCs we see that the sample scores above are not resolved for the two species of interest.
Another approach is to use a supervised projection method like partial least squares (PLS), to identify Latent Variables (LVs) which are data projections similar to those of PCA, but which are also correlated with the species label. Interestingly this approach leads to a projection which changes the relative orientation of I. versicolor and I. verginica to I. setaosa. However, this supervised approach is not enough to identify a hyperplane of separation between all three species.
Non-linear PCA via neural networks can be used to identify the hypersurface of separation, shown above. Looking at the scores we can see that this approach is the most success for resolving the two closely related species. However, the loadings from this method, which help relate how the variables are combined achieve the classification, are impossible to interpret. In the case of the function used above(nlPca, pcaMethods R package) the loadings are literally NA.