When you want to get to know and love your data

Archive for June, 2014

Multivariate Data Analysis and Visualization Through Network Mapping


Recently I had the pleasure of speaking about one of my favorite topics, Network Mapping. This is a continuation of a general theme I’ve previously discussed and involves the merger of statistical and multivariate data analysis results with a network.



Over the past year I’ve been working on two major tools, DeviumWeb and MetaMapR, which aid the process of biological data (metabolomic) network mapping.

deviuWeb

DeviumWeb– is a shiny based GUI written in R which is useful for:

  • data manipulation, transformation and visualization
  • statistical analysis (hypothesis testing, FDR, power analysis, correlations, etc)
  • clustering (heiarchical, TODO: k-means, SOM, distribution)
  • principal components analysis (PCA)
  • orthogonal partial least squares multivariate modeling (O-/PLS/-DA)

 
MetaMapR

MetaMapR– is also a shiny based GUI written in R which is useful for calculation and visualization of various networks including:

  • biochemical
  • structural similarity
  • mass spectral similarity
  • correlation


Both of theses projects are under development, and my ultimate goal is to design a one-stop-shop ecosystem for network mapping.


In addition to network mapping,the video above and presentation below also discuss normalization schemes for longitudinal data and genomic, proteomic and metabolomic functional analysis both on a pathway and global level.


As always happy network mapping!

Creative Commons License


ASMS 2014


I’ve recently participated in the American Society of Mass Spectrommetry (ASMS) conference and had a great time. I met some great people and have a few new ideas for future projects. Specifically giving a go at using self-organizing maps (SOM) and  the R package mcclust  for clustering alternatives to hierarchical and k-means methods.


I had the pleasure of speaking at the conference in the Informatics-Metabolomics section, and was also a co-author on a project detailing a multi-metabolomics strategy (primary metabolites, lipids, and oxylipins) for the study of type 1 diabetes in an animal model. Keep an eye out for my full talk in an upcoming post.

ASMS 2014 j fahrman


Using Repeated Measures to Remove Artifacts from Longitudinal Data


Recently I was tasked with evaluating and most importantly removing analytical variance form a longitudinal metabolomic analysis carried out over a few years and including >2,5000 measurements for >5,000 patients. Even using state-of-the-art analytical instruments and techniques long term biological studies are plagued with unwanted trends which are unrelated to the original experimental design and stem from analytical sources of variance (added noise by the process of measurement). Below is an example of a metabolomic measurement with and without analytical variance.

normalization


The noise pattern can be estimated based on replicated measurements of quality control samples embedded at a ratio of 1:10 within the larger experimental design. The process of data normalization is used to remove analytical noise from biological signal on a variable specific basis. At the bottom of this post, you can find an in-depth presentation of how data quality can be estimated and a comparison of many common data normalization approaches. From my analysis I concluded that a relatively simple LOESS normalization is a very powerful method for removal of analytical variance. While LOESS (or LOWESS), locally weighted scatterplot smoothing, is a relatively simple approach to implement; great care has to be taken when optimizing each variable-specific model.

In particular, the span parameter or alpha controls the degree of smoothing and is a major determinant if the model  (calculated from repeated measures) is underfit, just right or overfit with regards to correcting analytical noise in samples. Below is a visualization of the effect of the span parameter on the model fit.

LOESS_span

 


One method to estimate the appropriate span parameter is to use cross-validation with quality control samples. Having identified an appropriate span, a LOESS model can be generated from repeated measures data (black points) and is used to remove the analytical noise from all samples (red points).

loess_norm50

Having done this we can now evaluate the effect of removing analytical noise from quality control samples (QCs, training data, black points above) and samples (test data, red points) by calculating the relative standard deviation of the measured variable (standard deviation/mean *100). In the case of the single analyte, ornithine, we can see (above) that the LOESS normalization will reduce the overall analytical noise to a large degree. However we can not expect that the performance for the training data (noise only) will converge with that of the test set, which contains both noise and true biological signal.

In addition to evaluating the normalization specific removal of analytical noise on a univariate level we can also use principal components analysis (PCA) to evaluate this for all variables simultaneously. Below is an example of the PCA scores for non-normalized and LOESS normalized data.

PCA normalizations


We can clearly see that the two largest modes of variance in the raw data explain differences in when the samples were analyzed, which is termed batch effects. Batch effects can mask true biological variability, and one goal of normalizations is to remove them, which we can see is accomplished in the LOESS normalized data (above right).


However be forewarned, proper model validation is critical to avoiding over-fitting and producing complete nonsense.

bad norm

In case you are interested the full analysis and presentation can be found below as well as the majority of the R code used for the analysis and visualizations.

Creative Commons License