When you want to get to know and love your data

Posts tagged “histogram

Dynamic Data Visualizations in the Browser Using Shiny

Clipboard03

After being busy the last two weeks teaching and attending academic conferences, I finally found some time to do what I love, program data visualizations using R. After being interested in Shiny for a while, I finally decided to pull the trigger and build my first Shiny app!

I wanted to make a proof of concept app which contained the following dynamics which are the basics of any UI design:

1) dynamic UI options

2) dynamically updated plot based on UI inputs

Here is what I came up with.

boxplot

Check out the app for yourself  or the R code HERE.

library(shiny)
runGist('5792778')

The app consists of a user interface (UI)  for selecting the data, variable to plot , grouping factor for colors and four plotting options: boxplot (above), histogram, density plot and bar graph. As an added bonus the user can select to show or hide jittered points in the boxplot visualization.

Generally #2 above was well described and easy to implement, but it took a lot of trial and error to figure out how to implement #1. Basically to generate dynamic UI objects, the UI objects need to be called using the function shiny:::uiOutput()  in the ui.R file and their arguments set in the server.R file using the function shiny:::renderUI(). After getting this to work everything else fell in place.

Having some experience with making UI’s in VBA (visual basic) and gWidgets; Shiny is a joy to work with once you understand some of its inner workings. One aspect I felt which made the learning experience frustrating was the lack of informative errors coming from Shiny functions. Even using all the R debugging tools having Shiny constantly tell me something was not correctly called from a reactive environment or the error was in the runApp() did not really help. My advice to anyone learning Shiny is to take a look at the tutorials, and particularly the section on Dynamic UI. Then pick a small example to reverse engineer. Don’t start off too complicated else you will have a hard time understanding which sections of code are not working as expected.

Finally here are some screen shots, and keep an eye out for more advanced shiny apps in the near future.

density plot histogram bar


Discriminating Between Iris Species

The Iris data set is a famous for its use to compare unsupervised classifiers.

The goal is to use information about flower characteristics to accurately classify the 3 species of Iris. We can look at scatter plots of the 4 variables in the data set and see that no single variable nor bivariate combination can achieve this.

One approach to improve the separation between the two closely related Iris species, I.versicolor (blue) and I.virginica (green), is to use a combination of all 4 measurements, by constructing principal components (PCs).

Using the singular value decomposition to calculate PCs we see that the sample scores above are not resolved for the two species of interest.

Another approach is to use a supervised projection method like partial least squares (PLS), to identify Latent Variables (LVs) which are data projections similar to those of PCA, but  which are also correlated with the species label. Interestingly this approach leads to a projection which changes the relative orientation of  I. versicolor and I. verginica to I. setaosa. However,  this supervised approach is not enough to identify a hyperplane of separation between all three species.

Non-linear PCA via neural networks can be used to identify the hypersurface of separation, shown above. Looking at the scores we can see that  this  approach is the most success for resolving the  two closely related species. However, the loadings from this method, which help relate how the variables are combined achieve the classification, are impossible to interpret. In the case of the function used above(nlPca, pcaMethods R package)  the loadings are literally NA.


Visualizing the Iris Data

I’ve been working on additional scatter plot matrix plotting capabilities for the imCorrelations module.

Here is a little preview of a modified gpairs function from the YaleToolkit R package which is used to visualize the Iris data set. This scatterplot matrix allows for many interesting combinations of plots, which can be annotated with colors based on categorical variable(s).

The upper and lower matrix triangles can be modified with a variety of inputs:

  • scatterplots: points, best-fit-line, loess, qqplot for linear model residuals, best-fit-line confidence interval, correlation statistics
  • conditional plots: boxplot, stripplot, barcode

    Scatterplot matrix for overview of correlations and regressions, displaying box plots for Iris data species, variable histograms, correlation statistics, stripcharts and best fit lines.

This can be easily modified to rapidly visualize and overview variable dependencies.

Displaying Iris data, confidence intervals for best fit lines, residual quantile-quantile plots and variable barcode plots.