When you want to get to know and love your data

Posts tagged “Devium

Using Repeated Measures to Remove Artifacts from Longitudinal Data

Recently I was tasked with evaluating and most importantly removing analytical variance form a longitudinal metabolomic analysis carried out over a few years and including >2,5000 measurements for >5,000 patients. Even using state-of-the-art analytical instruments and techniques long term biological studies are plagued with unwanted trends which are unrelated to the original experimental design and stem from analytical sources of variance (added noise by the process of measurement). Below is an example of a metabolomic measurement with and without analytical variance.


The noise pattern can be estimated based on replicated measurements of quality control samples embedded at a ratio of 1:10 within the larger experimental design. The process of data normalization is used to remove analytical noise from biological signal on a variable specific basis. At the bottom of this post, you can find an in-depth presentation of how data quality can be estimated and a comparison of many common data normalization approaches. From my analysis I concluded that a relatively simple LOESS normalization is a very powerful method for removal of analytical variance. While LOESS (or LOWESS), locally weighted scatterplot smoothing, is a relatively simple approach to implement; great care has to be taken when optimizing each variable-specific model.

In particular, the span parameter or alpha controls the degree of smoothing and is a major determinant if the model  (calculated from repeated measures) is underfit, just right or overfit with regards to correcting analytical noise in samples. Below is a visualization of the effect of the span parameter on the model fit.



One method to estimate the appropriate span parameter is to use cross-validation with quality control samples. Having identified an appropriate span, a LOESS model can be generated from repeated measures data (black points) and is used to remove the analytical noise from all samples (red points).


Having done this we can now evaluate the effect of removing analytical noise from quality control samples (QCs, training data, black points above) and samples (test data, red points) by calculating the relative standard deviation of the measured variable (standard deviation/mean *100). In the case of the single analyte, ornithine, we can see (above) that the LOESS normalization will reduce the overall analytical noise to a large degree. However we can not expect that the performance for the training data (noise only) will converge with that of the test set, which contains both noise and true biological signal.

In addition to evaluating the normalization specific removal of analytical noise on a univariate level we can also use principal components analysis (PCA) to evaluate this for all variables simultaneously. Below is an example of the PCA scores for non-normalized and LOESS normalized data.

PCA normalizations

We can clearly see that the two largest modes of variance in the raw data explain differences in when the samples were analyzed, which is termed batch effects. Batch effects can mask true biological variability, and one goal of normalizations is to remove them, which we can see is accomplished in the LOESS normalized data (above right).

However be forewarned, proper model validation is critical to avoiding over-fitting and producing complete nonsense.

bad norm

In case you are interested the full analysis and presentation can be found below as well as the majority of the R code used for the analysis and visualizations.

Creative Commons License

High Dimensional Biological Data Analysis and Visualization

High dimensional biological data shares many qualities with other forms of data. Typically it is wide (samples << variables), complicated by experiential design and made up of complex relationships driven by both biological and analytical sources of variance. Luckily the powerful combination of R, Cytoscape (< v3) and the R package RCytoscape can be used to generate high dimensional and highly informative representations of complex biological (and really any type of) data. Check out the following examples of network mapping in action or view a more indepth presentation of the techniques used below.

Partial correlation network highlighting changes in tumor compared to control tissue from the same patient.

Tissue network cancer

Biochemical and structural similarity network of changes in tumor compared to control tissue from the same patient.

Cancer tissue network

Hierarchical clusters (color) mapped to a biochemical and structural similarity network displaying difference before and after drug administration.

cough syrup network

Partial correlation network displaying changes in metabolite relationships in response to drug treatment.

Treatment response network

Partial correlation network displaying changes in disease and response to drug treatment.

Treatment effects network

Check out the full presentation below.

Creative Commons License

Tutorials- Statistical and Multivariate Analysis for Metabolomics

2014 winter LC-MS stats courseI recently had the pleasure in participating in the 2014 WCMC Statistics for Metabolomics Short Course. The course was hosted by the NIH West Coast Metabolomics Center and focused on statistical and multivariate strategies for metabolomic data analysis. A variety of topics were covered using 8 hands on tutorials which focused on:

  • data quality overview
  • statistical and power analysis
  • clustering
  • principal components analysis (PCA)
  • partial least squares (O-/PLS/-DA)
  • metabolite enrichment analysis
  • biochemical and structural similarity network construction
  • network mapping

I am happy to have taught the course using all open source software, including: R, and Cytoscape. The data analysis and visualization were done using Shiny-based apps:  DeviumWeb and MetaMapR. Check out some of the slides below or download all the class material and try it out for yourself.

Creative Commons License
2014 WCMC LC-MS Data Processing and Statistics for Metabolomics by Dmitry Grapov is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Special thanks to the developers of Shiny and Radiant by Vincent Nijs.

Orthogonal Signal Correction Partial Least Squares (O-PLS) in R


I often need to analyze and model very wide data (variables >>>samples), and because of this I gravitate to robust yet relatively simple methods. In my opinion partial least squares (PLS) is a particular useful algorithm. Simply put, PLS is an extension of principal components analysis (PCA), a non-supervised  method to maximizing  variance explained in X, which instead maximizes the covariance between X and Y(s). Orthogonal signal correction partial least squares (O-PLS) is a variant of PLS which uses orthogonal signal correction to maximize the explained covariance between X and Y on the first latent variable, and components >1 capture variance in X which is orthogonal (or unrelated) to Y.

Because R does not have a simple interface for O-PLS, I am in the process of writing a package, which depends on the existing package pls.

Today I wanted to make a small example of conducting O-PLS in R, and  at the same time take a moment to try out the R package knitr and RStudio for markdown generation.

You can take a look at the O-PLS/O-PLS-DA tutorials.

I was extremely impressed with ease of using knitr and generating markdown from code using RStudio. A big thank you to Yihui Xie and the RStudio developers (Joe Cheng). This is an amazing capability which I will make much more use of in the future!

Interactive Heatmaps (and Dendrograms) – A Shiny App

Heatmaps are a great way to visualize data matrices. Heatmap color and organization can be used to  encode information about the data and metadata to help learn about the data at hand. An example of this could be looking at the raw data  or hierarchically clustering samples and variables based on their similarity or differences. There are a variety packages and functions in R for creating heatmaps,  including  heatmap.2. I find pheatmap particularly useful for the relative ease in annotating the top of the heat map using an arbitrary number of items (the legend needs to be controlled for best effect, not implemented).  

Heatmaps are also fun to use to interact with data!

Here is an example of a Heatmap and Dendrogram Visualizer built using the Shiny framework (and link to the code).

To run locally use the following code.

runGitHub("Devium", username = "dgrapov",ref = "master", subdir = "Shiny/Heatmap", port = 8100)

It was interesting to debug this app using the variety of data sets available in the R  datasets package (limiting  options to data.frames).


My goals were to make an interface to:

  • transform data and visualize using Z-scales, spearman, pearson and biweight correlations
  • rotate the data (transpose dimensions) to view  row or column space separately


  • visualize data/relationships presented as heatmaps or dendrograms


  • use hierarchical clustering to organize  data


  • add a top panel of annotation to display variables independent of the internal heatmap scales


  • use slider to visually select number(s) of sample or variable clusters (dendrogram cut height)


There are a few other options like changing heatmap color scales, adding borders or names that you can experiment with. I’ve preloaded many famous data sets found in the R data sets package a few of my favorites are iris and mtcars. There are other datsets some of which were useful for incorporating into the build to facilitate debugging and testing. The aspect of  dimension switching was probably the most difficult to keep straight (never mind legends, these may be hardest of all). What are left are informative (I hope) errors, usually coming from stats and data dimension mismatches. Try taking a look at the data structure on the “Data” tab or switching UI options for: Data, Dimension or Transformation until issues resolve. A final note before mentioning a few points about working with Shiny, missing data is set to zero and factors are omitted when making the internal heatmap but allowed in top row annotations.

Building with R and Shiny

This was my third try at building web/R/applications using Shiny.

Here are some other examples:

Basic plotting with ggplot2

Principal Components Analysis ( I suggest loading a simple .csv with headers)

It has definitely gotten easier building UIs and deploying them to the web using the excellent Rstudio and Shiny tools. Unfortunately this leaves me more time to be confused by “server side” issues.

My over all thoughts (so far) :

  • I have a lot to learn and the possibilities are immense
  • when things work as expected it is a stupendous joy! (thank you to Shiny, R and everyone who helped!)
  • when tracking down unexpected behavior I found it helpful to print app state at different levels to the browser using some simple mechanism like for instance:
#partial example

#create reactive objects to to "listen" for changes in states or R objects of interests
 ui.opts$data<-tmp.data # either or
 tmp.data # may not be necessary

#prepare to print objects/info to browser
output$any.name <- renderPrint({

#show/print info


Over all two thumbs up.

Tutorial- Building Biological Networks

I love networks! Nothing is better for visualizing complex multivariate relationships be it social, virtual or biological.Bionetwork1

I recently gave a hands-on network building tutorial using R and Cytoscape to build large biological networks. In these networks Nodes represent metabolites and edges can be many things, but I specifically focused on biochemical relationships and chemical similarities. Your imagination is the limit.

genotype network


network DM

If you are interested check out the presentation below.

Here is all the R code and links to relevant data you will need to let you follow along with the tutorial.

#load needed functions: R package in progress - "devium", which is stored on github
# get sample chemical identifiers here:https://docs.google.com/spreadsheet/ccc?key=0Ap1AEMfo-fh9dFZSSm5WSHlqMC1QdkNMWFZCeWdVbEE#gid=1
#Pubchem CIDs = cids
cids # overview
nrow(cids) # how many
str(cids) # structure, wan't numeric 
cids<-as.numeric(as.character(unlist(cids))) # hack to break factor

#making an edge list based on CIDs from KEGG reactant pairs
dim(KEGG.edge.list) # a two column list with CID to CID connections based on KEGG RPAIS
# how did I get this?
#1) convert from CID to KEGG using get.CID.KEGG.pairs(), which is a table stored:https://gist.github.com/dgrapov/4964546
#2) get KEGG RPAIRS using get.KEGG.pairs() which is a table stored:https://gist.github.com/dgrapov/4964564
#3) return CID pairs

#get EDGES based on chemical similarity (Tanimoto distances >0.07)
tanimoto.edges<-CID.to.tanimoto(cids=cids, cut.off = .7, parallel=FALSE)
# how did I get this?
#1) Use R package ChemmineR to querry Pubchem PUG to get molecular fingerprints
#2) calculate simialrity coefficient
#3) return edges with similarity above cut.off

#after a little bit of formatting make combined KEGG + tanimoto edge list
# https://docs.google.com/spreadsheet/ccc?key=0Ap1AEMfo-fh9dFZSSm5WSHlqMC1QdkNMWFZCeWdVbEE#gid=2

#now upload this and a sample node attribute table (https://docs.google.com/spreadsheet/ccc?key=0Ap1AEMfo-fh9dFZSSm5WSHlqMC1QdkNMWFZCeWdVbEE#gid=1)
#to Cytoscape 

You can also download all the necessary materials HERE, which include:

  1. tutorial in powerpoint
  2. R script
  3. Network edge list and node attributes table
  4. Cytoscape file
Happy network making!


Data analysis approaches to modeling changes in primary metabolism