When you want to get to know and love your data

Posts tagged “box plot

Viewing Time-Dimensional Data (in multivariate space)

The idea is that we have collected information about 30 samples at 4 intervals for 200 variables. This makes 30 * 4 * 200 = 24,000 data points!

That is a lot to keep track of if we want to start the data analysis by looking at sample-wise (30) differences in variables (200) which are also dependent on time (4).

One idea is to use orthogonal signal correction partial least squares  (O-PLS) to ask the question:

1) what is the most conserved linear ordering of my data based on

2) description of my data =  3 (group)s of samples at 4 (points in time) and the starting point or t= 0 (so a total  of 5 points in time).

Here is an example O-PLS scores plot for the samples (30*5 = 150 ) with polygons around the boundaries of each unique sample-group classification  ( 3 * 5 = 15).

group polygon

We can try to summarize the position of each group in this multivariate space (15 * 200) by plotting each groups median score  and standard error for the first two O-PLS latent variables (LVs).

scores and lines

Above is an enticing representation of the time-course differences between 3 groups of samples for 5 time measurement points (t= 0, 30, 60, 90 and 120 minutes). Now that we have established how our samples look based on 200 measurements or variables we can examine the variable loadings for this model.

group loadings

Above the loadings or relative contribution of each variable to the description of the samples  is plotted for O-PLS LV1 and 2. Based on the position of the variables in the x-axis (LV1) we can say something about their relative changes in time (because O-PLS samples scores are also distributed in the x-axis with respect with time), and the variable LV2 loading (y-axis) can be used to describe changes/differences between the groups (note sample group classification pattern in the y-axis (LV2) which is independent of the change in time (x-axis, LV1).

scores loadings

Above we can visualize a how the sample and variable descriptions are related. For instance variables far left in the loadings (FA) start out relatively increased and then decrease as samples position increases to the right. Analogously as time increases  there is an increase in the majority of variables (note the large cloud of loadings on LV1 (x-axis above)).

Another interesting thing to try is to visualize the change in groups scores  which are independent of time = 0 (subtract t=0 abundance for 200 variables from t = 30, 60, 90 and 120 minute time-points on a sample-wise basis).
baseline group

Above are a baseline (t= 0) normalized changes in time (above left, point color) for three groups of samples (above left, point shape). As before we can study the relationship between samples and variables on a multivariate basis by comparing the samples scores (position in LV 1 and LV2) to variable loadings.

This process (O-PLS) can be helpful for ranking the original 200 variables in two dimensions (2 lists)

1) with respect to change with time (x-axis)

2) difference between groups (y-axis).

It is interesting to note that without baseline adjustment, the group young NGT has the lowest starting FA (group scores at t= 0 are to the right of the other two groups). The relative differences between group t =0 and t = 120 positions can be used to visualize the change in FA over time (decrease, note negative loading in LV1 ).

Finally we can try to connect our multivariate observations with the easily interpretable visualizations of a single variable ( FA baseline adjusted), as a box plot representing the medians (horizontal line center of box plot)  and 25-75th qantiles (rectangle  top and bottom boundaries ) for the 3 groups over 4 time points.

group box plots

The box plot visualization above captures a similar trend in the relative position in groups as the one we previously described using  all 200 variables. This make sense given the extreme loading observed for FA, and therefore the implied contribution (influence) of this variable on the observed distribution of the sample scores.


Visualizing the Iris Data

I’ve been working on additional scatter plot matrix plotting capabilities for the imCorrelations module.

Here is a little preview of a modified gpairs function from the YaleToolkit R package which is used to visualize the Iris data set. This scatterplot matrix allows for many interesting combinations of plots, which can be annotated with colors based on categorical variable(s).

The upper and lower matrix triangles can be modified with a variety of inputs:

  • scatterplots: points, best-fit-line, loess, qqplot for linear model residuals, best-fit-line confidence interval, correlation statistics
  • conditional plots: boxplot, stripplot, barcode

    Scatterplot matrix for overview of correlations and regressions, displaying box plots for Iris data species, variable histograms, correlation statistics, stripcharts and best fit lines.

This can be easily modified to rapidly visualize and overview variable dependencies.

Displaying Iris data, confidence intervals for best fit lines, residual quantile-quantile plots and variable barcode plots.