When you want to get to know and love your data

Posts tagged “multivariate

Push it to the limit: SOM + Clustering + Networks

What is the highest dimensional visualization you can think of? Now imagine it being interactive. The following details a Frankenstein visualization packing a smorgasbord of multivariate goodness.

Enter first, self-organizing maps (SOM). I first fell into a love dream with SOMs after using the kohonen package. The  wines data set example is a beautiful display of information.


Eloquently, making the visualization above is relatively easy. SOM is used to organize the data into related groups on a grid. Hierarchical cluster analysis (HCA) is used to classify the SOM codes into three groups.


HCA cluster information is mapped to the SOM grid using hexagon background colors. The radial bar plots show the variable (wine compounds’) patterns for samples (wines).



The goal for this project was to reproduce the kohonen.plot using ggplot2 and make it interactive using shiny.


The main idea was to use SOM to calculated the grid coordinates, geom_hexagon for the grid packing and any ggplot for the hexagon-inset sub plots. Some basic inset plots could be bar or line plots.

Part of the beauty is the organization of any ggplot you can think of (optionally grouping the input data or SOM codes) based on the SOM unit classification.

A Pavlovian response might be; does it network?


Yes we can (network). Above is an example of different correlation patterns between wine components in related groups of wines. For example the green grid points identify wines showing a correlation between phenols and flavanoids (probably reds?). Their distance from each other could be explained (?) by the small grid size (see below).

The next question might be, does it scale?


more lines

There is potential. The 4 x 4 grid shows radial bar plot patterns for 16 sub groups among the 3 larger sample groups. The next next 6 x 6 plot shows wine compound profiles for 36 ~related subsets of wines.

A useful side effect is that we can use SOM quality metrics to give us an extra-dimensional view into tuning the visualization. For example we can visualize the number of samples per grid point or distances between grid points (dissimilarity in patterns).

This is useful to identify parts of the somClustPlot showing the number of mapped samples and greatest differences.

One problem I experienced was getting the hexagon packing just right. I ended making controls to move the hexagons  ~up/down and zoom in/out on the plot. It is not perfect but shows potential (?) for scaffolding highly multivariate visualizations? Some of my other concerns include the stochastic nature of SOM and the need for som random initialization for the embedding. Make sure to use it with set.seed() to make it reproducible, and might want to try a few seeds. Maybe someone out there knows how to make this aspect of  SOM more robust?

2014 UC Davis Proteomics Workshop

Recently I had the pleasure of teaching data analysis at the 2014 UC Davis Proteomics Workshop. This included a hands on lab for making gene ontology enrichment networks. You can check out my lecture and tutorial below or download all the material.



Creative Commons License
2014 UC Davis Proteomics Workshop Dmitry Grapov is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Multivariate Data Analysis and Visualization Through Network Mapping

Recently I had the pleasure of speaking about one of my favorite topics, Network Mapping. This is a continuation of a general theme I’ve previously discussed and involves the merger of statistical and multivariate data analysis results with a network.

Over the past year I’ve been working on two major tools, DeviumWeb and MetaMapR, which aid the process of biological data (metabolomic) network mapping.


DeviumWeb– is a shiny based GUI written in R which is useful for:

  • data manipulation, transformation and visualization
  • statistical analysis (hypothesis testing, FDR, power analysis, correlations, etc)
  • clustering (heiarchical, TODO: k-means, SOM, distribution)
  • principal components analysis (PCA)
  • orthogonal partial least squares multivariate modeling (O-/PLS/-DA)


MetaMapR– is also a shiny based GUI written in R which is useful for calculation and visualization of various networks including:

  • biochemical
  • structural similarity
  • mass spectral similarity
  • correlation

Both of theses projects are under development, and my ultimate goal is to design a one-stop-shop ecosystem for network mapping.

In addition to network mapping,the video above and presentation below also discuss normalization schemes for longitudinal data and genomic, proteomic and metabolomic functional analysis both on a pathway and global level.

As always happy network mapping!

Creative Commons License

High Dimensional Biological Data Analysis and Visualization

High dimensional biological data shares many qualities with other forms of data. Typically it is wide (samples << variables), complicated by experiential design and made up of complex relationships driven by both biological and analytical sources of variance. Luckily the powerful combination of R, Cytoscape (< v3) and the R package RCytoscape can be used to generate high dimensional and highly informative representations of complex biological (and really any type of) data. Check out the following examples of network mapping in action or view a more indepth presentation of the techniques used below.

Partial correlation network highlighting changes in tumor compared to control tissue from the same patient.

Tissue network cancer

Biochemical and structural similarity network of changes in tumor compared to control tissue from the same patient.

Cancer tissue network

Hierarchical clusters (color) mapped to a biochemical and structural similarity network displaying difference before and after drug administration.

cough syrup network

Partial correlation network displaying changes in metabolite relationships in response to drug treatment.

Treatment response network

Partial correlation network displaying changes in disease and response to drug treatment.

Treatment effects network

Check out the full presentation below.

Creative Commons License

Connecting Data with Context: Metabolomic Examples

I recently gave a presentation of some of my work in network mapping to my research lab. The following covers my progress in the development of my metabolomic network mapping tool MetaMapR, and its application to a variety of data sets including a comparison of normal and malignant lung tissue from the same patient.

Network Mapping Video

Here are a video and slides for a presentation of mine about my favorite topic :

American Society for Mass Spectrometry 2013

I am getting ready to present at the upcoming American Society for Mass Spectrometry (ASMS) conference in Minneapolis, Minnesota (dont’cha know).

If you are around check out my talk  in the section Oral: ThOB am – Informatics: Metabolomics on Thursday (06/14) at 8:30 am in room L100. Here is teaser

WCMC network

Above is a network representation of biochemical (red edges, KEGG RPAIRS) and structural similarities (gray edges, Tanimoto coefficient> 0.7) of > 1100 biological molecules (see here for some of their descriptions). Keep an eye out for all the R code used to generate this network as well as all the slides from my talk.

Here is my talk abstract.

Multivariate and network tools for analysis and visualization of metabolomic data
Dmitry Grapov1, 2; Oliver Fiehn1, 2
1West Coast Metabolomics Center, Davis, CA; 2University of California Davis, Davis, California
NOVEL ASPECT: A software tool for calculation and mapping of statistical and multivariate results from metabolomic experiments into biologically relevant contexts.
INTRODUCTION: While a variety of tools capable of producing network representations of metabolomic data exist, none are fully integrated with statistical and multivariate methods necessary to analyze, visualize and summarize the high dimensional data. We have developed an open source toolset for the analysis of high dimensional biological data which combines the computational capabilities of the R statistical programming environment with the network mapping and visualization features of Cytoscape. A graphical user interface is used to seamlessly integrate calculation and interpretation of statistical and multivariate results in the context of network graphs which are constructed based on biological relationships, chemical similarities or empirical variable dependencies.
METHODS: An R based GUI utilizing RCytoscape and CytoscapeRPC is used to connect R and Cytoscape. Data import, manipulation  and export are achieved through an interface to MS Excel and Google Docs. R packages provide a variety of analyses methods including: parametric and non-parametric multiple hypotheses testing, false discovery rate correction, exploratory principal and independent components analyses, hierarchical and model based clustering, and multivariate predictive modeling such as partial least squares and support vector machines. Relationships between biological parameters can be represented in the form of networks which are connected based on user defined edge lists or from pubchem chemical identifiers which are used to construct biochemical and chemical similarity networks based on the KEGG reactant pairs and Tanimoto distances, or Gaussian Markov networks based partial correlations.
ABSTRACT: Comparisons of plasma primary metabolite excursion patterns during an oral glucose tolerance test (OGTT) were used to model changes in metabolism associated with a diet and exercise intervention. Plasma aliquots, taken at 30 minute intervals (0-120 minutes) were analyzed by GC/TOF and used to compare metabolite levels (n=323) in a cohort of overweight women before and after a 14 week dietary and exercise regimen. Mixed effects models, partial least squares and partial least squares discriminant analysis (PLS-DA)  were used to study OGTT and intervention-associated changes in metabolite baselines, area under the curve for OGTT-associated excursions , and metabolite time course patterns. Metabolic changes due to the oral infusion of glucose were visualized by mapping statistical test p-values and intervention-adjusted PLS model for time during the OGTT variable coefficient weights into a network connected based on KEGG reactant pairs and Tanimoto distances > 70. Vertices, representing metabolites were sized and colored based on the absolute PLS coefficient magnitude and sign respectively. Metabolites showing significant perturbations during the OGTT (false discovery rate (q = 0.05) adjusted p-value < 0.05) were highlighted with node-inset graphs displaying  means and confidence intervals during the time course for before and after intervention comparisons. This network was useful for identifying OGTT-associated interactions between the major biochemical domains (lipids, amino acids, organic acids, and carbohydrates). In a follow-up analysis a Gaussian Markov partial correlation network was used to investigate intervention-associated changes in metabolite-metabolite and metabolite-clinical parameter (insulin, hormones) dependency relationships.