When you want to get to know and love your data

Posts tagged “r-bloggers

Push it to the limit: SOM + Clustering + Networks

What is the highest dimensional visualization you can think of? Now imagine it being interactive. The following details a Frankenstein visualization packing a smorgasbord of multivariate goodness.

Enter first, self-organizing maps (SOM). I first fell into a love dream with SOMs after using the kohonen package. The  wines data set example is a beautiful display of information.


Eloquently, making the visualization above is relatively easy. SOM is used to organize the data into related groups on a grid. Hierarchical cluster analysis (HCA) is used to classify the SOM codes into three groups.


HCA cluster information is mapped to the SOM grid using hexagon background colors. The radial bar plots show the variable (wine compounds’) patterns for samples (wines).



The goal for this project was to reproduce the kohonen.plot using ggplot2 and make it interactive using shiny.


The main idea was to use SOM to calculated the grid coordinates, geom_hexagon for the grid packing and any ggplot for the hexagon-inset sub plots. Some basic inset plots could be bar or line plots.

Part of the beauty is the organization of any ggplot you can think of (optionally grouping the input data or SOM codes) based on the SOM unit classification.

A Pavlovian response might be; does it network?


Yes we can (network). Above is an example of different correlation patterns between wine components in related groups of wines. For example the green grid points identify wines showing a correlation between phenols and flavanoids (probably reds?). Their distance from each other could be explained (?) by the small grid size (see below).

The next question might be, does it scale?


more lines

There is potential. The 4 x 4 grid shows radial bar plot patterns for 16 sub groups among the 3 larger sample groups. The next next 6 x 6 plot shows wine compound profiles for 36 ~related subsets of wines.

A useful side effect is that we can use SOM quality metrics to give us an extra-dimensional view into tuning the visualization. For example we can visualize the number of samples per grid point or distances between grid points (dissimilarity in patterns).

This is useful to identify parts of the somClustPlot showing the number of mapped samples and greatest differences.

One problem I experienced was getting the hexagon packing just right. I ended making controls to move the hexagons  ~up/down and zoom in/out on the plot. It is not perfect but shows potential (?) for scaffolding highly multivariate visualizations? Some of my other concerns include the stochastic nature of SOM and the need for som random initialization for the embedding. Make sure to use it with set.seed() to make it reproducible, and might want to try a few seeds. Maybe someone out there knows how to make this aspect of  SOM more robust?

Try’in to 3D network: Quest (shiny + plotly)

I have an unnatural obsession with 4-dimensional networks. It might have started with a dream, but VR  might make it a reality one day. For now I will settle for  3D networks in Plotly.


Presentation: R users group (more)

More: networkly

Network Visualization with Plotly and Shiny

R users: networkly: network visualization in R using Plotly

In addition to their more common uses, networks  can be used as powerful multivariate data visualizations and exploration tools. Networks not only provide mathematical representations of data but are also one of the few data visualization methods capable of easily displaying multivariate variable relationships. The process of network mapping involves using the network manifold to display a variety of other information e.g. statistical, machine learning or functional analysis results (see more mapped network examples).


The combination of Plotly and Shiny is awesome for creating your very own network mapping tools. Networkly is an R package which can be used to create 2-D and 3-D interactive networks which are rendered with plotly and can be easily integrated into shiny apps or markdown documents. All you need to get started is an edge list and node attributes which can then be used to generate interactive 2-D and 3-D networks with customizable edge (color, width, hover, etc) and node (color, size, hover, label, etc) properties.

2-Dimensional Network (interactive version)2dnetwork

3-Dimensional Network  (interactive version)


View all code used to generate the networks above.


Data Analysis Workflow: ‘Omics style

Follow along with the presentation and recreate all the analysis results for yourself.


Metabolomics and Beyond: Challenges and Strategies for Next-gen Omic Analyses

Recently I had the pleasure of giving lecture for the Metabolomics Society on Challenges and Strategies for Next-gen Omic Analyses. You can check out all of my slides and video of the lecture below.

dplyr Tutorial: verbs + split-apply-combine

At a recent Saint Louis R users meeting I had the pleasure of giving a basic introduction to the awesome dplyr R package. For me, data analysis ubiquitously involves splitting the data based on grouping variable and then applying some function to the subsets or what is termed split-apply-combine. Having personally recently incorporated dplyr into my data wrangling workflows; I’ve found this package’s syntax and performance a joy to work with. My feeling about dplyr are as follows.

Data wrangling without dplyr.

Data wrangling with dplyr.

This tutorial features an introduction to common dplyr verbs and an overview of implementing split-apply-combine in dplyr.


Some of my conclusions were; not only does dplyr make writing data wrangling code clearer and far faster, the packages calculation speed is also very high (non-sophisticated comparison to base).

The plot above shows the calculation time for 10 replications in seconds (y-axis) for calculating the median of varying number of groups (x-axis), rows (y-facet) and columns (x-facet) with (green line) and without (red line) dplyr.

2014 Metabolomic Data Analysis and Visualization Workshop and Tutorials

Recently I had the pleasure of teaching statistical and multivariate data analysis and visualization at the annual Summer Sessions in Metabolomics 2014, organized by the NIH West Coast Metabolomics Center.

Similar to last year, I’ve posted all the content (lectures, labs and software) for any one to follow along with at their own pace. I also plan to release videos for all the lectures and labs including use cases for the freely available data analysis software listed below.

You can check out the introduction lecture to the covered material below.

New additions to the course include lecture and lab on Data normalization and updated and improved software.


Stay tuned for videos of all of the material!

Creative Commons License
2014 Metabolomics Data Analysis and Visualization Tutorials Dmitry Grapov is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.