## Principal Components Analysis Shiny App

I’ve recently started experimenting with making Shiny apps, and today I wanted to make a basic app for calculating and visualizing principal components analysis (PCA). Here is the basic interface I came up with. Test drive the app for yourself or check out the the R code HERE.

library(shiny) runGist("5846650")

Above is an example of the user interface which consists of data upload (from.csv for now), and options for conducting PCA using the pcaMethods package. The various outputs include visualization of the eigenvalues and cross-validated eigenvalues (q2), which are helpful for selecting the optimal number of model components.The PCA scores plot can be used to evaluate extreme (leverage) or moderate (DmodX) outliers. A Hotelling’s T-squared confidence intervals as an ellipse would also be a good addition for this.

The variable loadings can be used to evaluate the effects of data scaling and other pre-treatments.

The next step is to interface the calculation of PCA to a dynamic plot which can be used to map meta data to plotting characteristics.

## Biochemical, Chemical and Mass Spectral Similarity Network

Here is an example of a network leveraging three dominant aspects of metabolomic experiments (biochemical, chemical and mass spectral knowledge) to connect measured variables. This is a network for a blinded data set (sample ids are not known), which I’ve made for a member of my lab presenting their work at the Metabolomics Society Conference in Glasgow, Scotland.

With out knowing the experimental design we can still analyze our data for analytical effects. For example below is a principal components analysis of ~400 samples and 600 variables, where I’ve annotated the sample scores to show data aquisition date (color) and experimental samples or laboratory quality controls (shape). One thing to look for are trends or scores grouping in the PCA scores which are correlated to analytical conditions like, batch, date, technician, etc.

Finally we can take a look at the PCA variable loadings which highlights a major bottleneck in metabolomics experiments, the large amount of structurally unknown molecular features.

Even using feature rich electron impact mass spectra (GC/TOF) only 40% of the unknown variables could be connected to known species based on a cosine correlation >0.75. To give you an idea the cosine correlation or dot product between the mass spectra of two structurally very similar molecules xylose and xylitol is ~ 0.8.

## American Society for Mass Spectrometry 2013

I am getting ready to present at the upcoming American Society for Mass Spectrometry (ASMS) conference in Minneapolis, Minnesota (dont’cha know).

If you are around check out my talk in the section Oral: ThOB am – Informatics: Metabolomics on Thursday (06/14) at 8:30 am in room L100. Here is teaser

Above is a network representation of biochemical (red edges, KEGG RPAIRS) and structural similarities (gray edges, Tanimoto coefficient> 0.7) of > 1100 biological molecules (see here for some of their descriptions). Keep an eye out for all the R code used to generate this network as well as all the slides from my talk.

Here is my talk abstract.

**Multivariate and network tools for analysis and visualization of metabolomic data**

^{1, 2}; Oliver Fiehn

^{1, 2}

^{1}West Coast Metabolomics Center, Davis, CA;

^{2}University of California Davis, Davis, California

**NOVEL ASPECT:**A software tool for calculation and mapping of statistical and multivariate results from metabolomic experiments into biologically relevant contexts.

**INTRODUCTION:**While a variety of tools capable of producing network representations of metabolomic data exist, none are fully integrated with statistical and multivariate methods necessary to analyze, visualize and summarize the high dimensional data. We have developed an open source toolset for the analysis of high dimensional biological data which combines the computational capabilities of the R statistical programming environment with the network mapping and visualization features of Cytoscape. A graphical user interface is used to seamlessly integrate calculation and interpretation of statistical and multivariate results in the context of network graphs which are constructed based on biological relationships, chemical similarities or empirical variable dependencies.

**METHODS:**An R based GUI utilizing RCytoscape and CytoscapeRPC is used to connect R and Cytoscape. Data import, manipulation and export are achieved through an interface to MS Excel and Google Docs. R packages provide a variety of analyses methods including: parametric and non-parametric multiple hypotheses testing, false discovery rate correction, exploratory principal and independent components analyses, hierarchical and model based clustering, and multivariate predictive modeling such as partial least squares and support vector machines. Relationships between biological parameters can be represented in the form of networks which are connected based on user defined edge lists or from pubchem chemical identifiers which are used to construct biochemical and chemical similarity networks based on the KEGG reactant pairs and Tanimoto distances, or Gaussian Markov networks based partial correlations.

**ABSTRACT:**Comparisons of plasma primary metabolite excursion patterns during an oral glucose tolerance test (OGTT) were used to model changes in metabolism associated with a diet and exercise intervention. Plasma aliquots, taken at 30 minute intervals (0-120 minutes) were analyzed by GC/TOF and used to compare metabolite levels (n=323) in a cohort of overweight women before and after a 14 week dietary and exercise regimen. Mixed effects models, partial least squares and partial least squares discriminant analysis (PLS-DA) were used to study OGTT and intervention-associated changes in metabolite baselines, area under the curve for OGTT-associated excursions , and metabolite time course patterns. Metabolic changes due to the oral infusion of glucose were visualized by mapping statistical test p-values and intervention-adjusted PLS model for time during the OGTT variable coefficient weights into a network connected based on KEGG reactant pairs and Tanimoto distances > 70. Vertices, representing metabolites were sized and colored based on the absolute PLS coefficient magnitude and sign respectively. Metabolites showing significant perturbations during the OGTT (false discovery rate (q = 0.05) adjusted p-value < 0.05) were highlighted with node-inset graphs displaying means and confidence intervals during the time course for before and after intervention comparisons. This network was useful for identifying OGTT-associated interactions between the major biochemical domains (lipids, amino acids, organic acids, and carbohydrates). In a follow-up analysis a Gaussian Markov partial correlation network was used to investigate intervention-associated changes in metabolite-metabolite and metabolite-clinical parameter (insulin, hormones) dependency relationships.