Metabolomic network analysis can be used to interpret experimental results within a variety of contexts including: biochemical relationships, structural and spectral similarity and empirical correlation. Machine learning is useful for modeling relationships in the context of pattern recognition, clustering, classification and regression based predictive modeling. The combination of developed metabolomic networks and machine learning based predictive models offer a unique method to visualize empirical relationships while testing key experimental hypotheses. The following presentation focuses on data analysis, visualization, machine learning and network mapping approaches used to create richly mapped metabolomic networks. Learn more at www.createdatasol.com

The following presentation also shows a sneak peak of a new data analysis visualization software, DAVe: Data Analysis and Visualization engine. Check out some early features. DAVe is built in R and seeks to support a seamless environment for advanced data analysis and machine learning tasks and biological functional and network analysis.

As an aside, building the main site (in progress) was a fun opportunity to experiment with Jekyll, Ruby and embedding slick interactive canvas elements into websites. You can checkout all the code here https://github.com/dgrapov/CDS_jekyll_site.

Metabolomics and the greater sphere of ‘Omic analyses are a burgeoning set tools for investigation of environmental and organismal mechanisms and interactions. Carrying out data analyses within complex biological system contexts is rewarding but also difficult. The following presentation considers components involved in conducting multivariate data analysis, modeling and visualization within biological contexts.

High dimensional biological data shares many qualities with other forms of data. Typically it is wide (samples << variables), complicated by experiential design and made up of complex relationships driven by both biological and analytical sources of variance. Luckily the powerful combination of R, Cytoscape (< v3) and the R package RCytoscape can be used to generate high dimensional and highly informative representations of complex biological (and really any type of) data. Check out the following examples of network mapping in action or view a more indepth presentation of the techniques used below.

Partial correlation network highlighting changes in tumor compared to control tissue from the same patient.

Biochemical and structural similarity network of changes in tumor compared to control tissue from the same patient.

Oh oh, there seem to be some outliers: serum samples looking like urine and vice versa. Fix these and evaluate using PCA and hierarchical clustering on rank correlations.

Now things look more believable. Next let us test the effects of data pre-treatment on PLS-DA model scores for a 3 group comparison in serum. Ideally group scores would be maximally resolved in the dimension of the first latent variable (x) and inter-group variance would be orthogonal or in the y-axis.

Compared to raw data (TOP) where ~ 3 top variables (glucose, urea and mannitol) dominate the variance structure, the autoscaled model, due to variable-wise mean subtraction and division by the standard deviation, displays a more balanced contribution to scores variance by variables. The larger separation between WHITE and RED class scores along the x-axis suggest improved classifier performance over raw data model and overview of samples with scores outside their respective group’s Hotelling’s T ellipse (95%) might point to a sample outlier to further investigate or potentially exclude from the current test.

Figure 1. The type 2 diabetes-associated lipidomic changes projected in context of their biological relationships in obese African-American women.

Metabolites are represented by circular “nodes” linked by “edges” with arrows designating the direction of the biosynthetic gradient (i.e. substrate to product). Some metabolites are linked by more than one enzymatic step. Node sizes represent magnitudes of differences in plasma metabolite geometric means (ΔGM). Arrow widths represent magnitudes of changes in product over substrate ratios (ΔP:S). Colors of node borders and arrows represent the significance and direction of changes relative to non-diabetics as per the figure legend. Differences are significant at p<0.05 by Mann-Whitney U test adjusted for FDR (q = 0.1).

Scatterplot matrix for overview of correlations and regressions, displaying box plots for Iris data species, variable histograms, correlation statistics, stripcharts and best fit lines.

Spearman’s correlations were used to generate multi-dimensionally scaled parameter connectivity networks for variable intercorrelations. Networks were oriented with fasting glucose at the origin and SFA in the lower right quadrant. Colored ellipses represent the 95% probability locations of metabolite classes (Hoettlings T2, p<0.05). Nodes indicate clinical parameters (diamonds), <20-carbon fatty acid metabolites (circles) and ≥20-carbon fatty acid metabolites (triangles), with discriminant model variables and glucose enlarged. Significant correlations between species are designated by orange (positive) or blue (negative) connecting lines (p<0.05, non-diabetic; p<0.01, diabetic participants).

Horizontal scatter plots of the log transformed concentrations for each model variable are shown. The horizontal arrangement of metabolite scatter plots is scaled to their loading in the discriminant model. A given species importance in the classification increases with increasing displacement from the origin (broken line). The direction of the displacement, left or right, designates whether the species was decreased (left) or increased (right) in the diabetic relative to the non-diabetic patients. The overall model discrimination performance is presented as a scatter plot of subject model scores (inset).