Andrew’s encoding of Multivariate Data (looks informative)
Recently I came across an interesting visualization for multivariate data named the Andrews curve (plot) (original post here). This is a very interesting trigonometric transformation of a multivariate data set to x and y coordinate encoding. After a quick check I was happy to see there is a package in R for making Andrew’s plots, andrews. Here is an example of an andrews plot for a data set describing various features of automobiles, “mtcars“, which is also colored according to the number of cylinders in each vehicle (4, 6 or 8).
This is an interesting perspective of 11 measurements for 32 cars (shares similarity with a parallel coordinates plot). Based on this data visualization, the 8 cylinder cars seem the most similar with regards to other parameters judging from the “tightness” of their patterns (yellow lines). While the 2 and 6 cylinder cars seem more similar to each other.
Its hard to compare the Anderson encoded and original data, but we can try with a scatterplot visualization.
This visualization supports the previous observation, the number of cylinders has a large effect on the continuous variables like miles per hour (mpg). The effect of the other potential covariates (discreet variables like va, am, gear) is less obvious but may also be present. This would be important to include or account for when conducting predictive modeling.
To try to identify further covariates we can take a look at the at the principal component (PCA) scores, which is another method for multivariate visualization, but in this case is limited to the first two largest discreet modes of variance in the data (principal plane or component 1 and component 2).
Based on the scores, it is evident that sample clustering is fairly well explained by the number of cylinders and other correlated parameters. We can also see that loadings for PC1 (x-axis) can be used to explain cylinder # fairly well, but there is something else causing a separation in y.
Instead of autoscaling the data (mean=0, sd=1; as previously done prior to the PCA above) we can instead make an andrews encoding of the data. This will apply a trigonometric transformation to each of the variables to produce 101 x and y values for each of the 32 cars. We can combine these to create a new matrix (32 by 202) with rows representing sample (n=32) and columns the x (n=101) and y (n=101) encodings. This effectively increase our number of variables from 12 to 202, but hopefully also gives a deeper insight into any class structure.
Interestingly this encoding highlights the previously noted and yet unexplained factor (evident in scores difference in y between same cylinder vehicles). Next, we can can check the other discreet variables in the data to see if any of them can help explain the clustering pattern observed above.
After quick check it is evident that the the type of transmission (am; manual (1) or automatic (0)) nicely explains the second mode of scores variance, which is not captured by cylinders.
This is less obvious in the autoscaled PCA.
Further inspection of the andrews encoded PCA also suggest that there is yet another potential covariate, as evident from the two clusters of 8 cylinder and automatic transmission vehicles (8|0).
At first blush the andrews method coupled with a dimensional reduction technique seems like a very interesting technique for identifying covariate contributions to patterns in the data. It would be interesting to compare variable loadings from PCA of autoscaled and andrews encoded data, but it is not obvious how to do this…
If you want to replicate the analyses above or just want to apply these visualizations to your own data get all the necessary code from the example found HERE.