When you want to get to know and love your data

Andrew’s encoding of Multivariate Data (looks informative)

Recently I came across an interesting visualization for multivariate data named the Andrews curve (plot)  (original  post here). This is a very interesting trigonometric transformation of a multivariate data set  to x and y coordinate encoding.  After a quick check I was happy to see there is a package in R for making Andrew’s plots, andrews. Here is an example of an andrews plot for  a data set describing various features of automobiles,  “mtcars, which  is also colored according to the number of cylinders in each vehicle (4, 6 or 8).

andrews plotThis is an interesting perspective of 11 measurements for 32 cars (shares similarity with  a parallel coordinates plot). Based on this  data visualization, the 8 cylinder cars seem the most similar with regards to other parameters judging from the “tightness” of their patterns (yellow lines). While the 2 and 6 cylinder cars seem more similar to each other.

Here is my modified visualizations of the Andersons plot using ggplot2  (get code HERE).

andrews-lines

modified andrews plot

Its hard to compare the Anderson encoded and  original data, but we can try with a scatterplot visualization.

pairsThis visualization supports the previous observation, the number of cylinders has a large effect on the continuous variables like miles per hour (mpg).  The effect of the other potential covariates (discreet variables like va, am, gear) is less obvious but may also be present. This would be important to include or account for when conducting predictive modeling.

To try to identify further covariates we can take a look at the at the principal component (PCA) scores, which is another method for multivariate visualization, but in this case is limited to the first two largest discreet modes of variance in the data (principal plane  or  component 1 and component 2).

PCA

Based on the scores, it is evident that sample clustering is fairly well explained by the number of cylinders  and other correlated parameters. We can also see that loadings for PC1 (x-axis) can be used to explain cylinder # fairly well, but there is something else  causing a separation in y.

Instead of autoscaling the data (mean=0, sd=1;  as previously done prior to the PCA above) we can instead make an andrews encoding of the data. This will apply a trigonometric transformation to each of the variables to produce 101 x and y values for each of the 32 cars.  We can combine these to create a new matrix  (32 by 202) with rows representing sample (n=32) and columns the x (n=101) and y (n=101)  encodings. This effectively increase our number of variables from 12 to 202, but hopefully also gives a deeper insight into any class structure.andrews encoded PCA

Interestingly this  encoding highlights the previously noted and yet unexplained  factor (evident in scores difference in y between same cylinder vehicles). Next, we can can check the other discreet variables in the data  to see if any of them can help explain the clustering pattern observed above.

After quick check it is evident that the the type of transmission (am; manual (1) or automatic (0)) nicely explains the second mode of scores variance, which is not captured by cylinders.

andrews PCA cyl and am

This is less obvious in the autoscaled PCA.

PCA cyl and am

Further inspection of the andrews encoded PCA also suggest that there is yet another potential covariate, as evident from the two clusters of 8 cylinder and automatic transmission vehicles (8|0).

At first blush the andrews method coupled with a dimensional reduction technique seems like a very interesting technique for identifying covariate contributions to patterns in the data. It would be interesting to compare variable loadings from PCA of  autoscaled and andrews encoded data, but it is not obvious how to do this…

If you want to replicate the analyses above or just want to apply these visualizations to your own data get all the necessary code from the example found HERE.

Advertisements

3 responses

  1. sarah

    For this paragraph and the figure below it….
    “Instead of autoscaling the data (mean=0, sd=1; as previously done prior to the PCA above) we can instead make an andrews encoding of the data. This will apply a trigonometric transformation to each of the variables to produce 101 x and y values for each of the 32 cars. We can combine these to create a new matrix (32 by 202) with rows representing sample (n=32) and columns the x (n=101) and y (n=101) encodings. This effectively increase our number of variables from 12 to 202, but hopefully also gives a deeper insight into any class structure”

    I’m using Matlab, I’m very new to multivariate analysis and PCA. Instead of determining PCA for my raw data set, can you explain how I would apply andrews encoding PCA to my data to get the plot below the quoted paragraph above..

    Sorry for such a simple question, but I’m learning!

    S

    May 11, 2016 at 4:34 pm

    • Hi,

      Take a look in the gist: https://gist.github.com/dgrapov/5384152
      lines 202-213. This shows how to calculate the encoding followed by PCA.

      You should be able to run the function plot.data with kind=’andrews-PCA’ to recreate the exact plot (see code at the bottom).

      -Dmitry

      May 19, 2016 at 2:12 am

  2. sarah

    Great post by the way – really informative compared to what else I’ve read!

    May 11, 2016 at 4:34 pm

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s