## Anaerobic Stress in Seeds – A Chemical Similarity Network Story

The chemical similarity network or CSN is a great tool for organizing biological data based on known biochemistry or chemical structural similarity. Here is an example CSN for visualizing metabolomic changes (measured via GC/TOF) due to anaerobic stress in germinating seeds.

In this network edges are formed for chemical similarity scores > 75. Node color describes significant (adjusted p-value < 0.05, q-value = 0.05, paired t-Test) increase (red), decrease (blue) or no change (gray) in anaerobic relative to aerobic treatments. Node size is inversely proportional to the tests p-value.

This CSN was not hard to construct and minimally requires knowledge of analyte PubChem chemical identifiers (CIDs). CIDs can be used to calculate the chemical similarity matrix using online tools provided by PubChem. This symmetric matrix can be easily formatted to create an edge list containing the basic information: source, target and similarity score.

Here is a function for converting square symmetric matrices to edge lists using the R statistical programming environment.

mat.to.edge.list<-function(mat) { #accessory function all.pairs<-function(r,type="one") { switch(type, one = list(first = rep(1:r,rep(r,r))[lower.tri(diag(r))], second = rep(1:r, r)[lower.tri(diag(r))]), two = list(first = rep(1:r, r)[lower.tri(diag(r))], second = rep(1:r,rep(r,r))[lower.tri(diag(r))])) ids<-all.pairs(ncol(mat)) tmp<-as.data.frame(do.call("rbind",lapply(1:length(ids$first) ,function(i) { value<-mat[ids$first[i],ids$second[i]] name<-c(colnames(mat)[ids$first[i]],colnames(mat)[ids$secon d[i]]) c(name,value) }))) colnames(tmp)<-c("source","target","value") return(tmp) }

The function mat.to.edge.list will convert a square symmetric matrix to an edge list through the extraction of the upper triangle excluding the diagonal or self edges.

This edge list can now be visualized as a CSN using some software (see brief instructions here). I prefer to use Cytoscape for this. The edge list merely contains instructions for which vertices or nodes representing metabolites should be connected.

An additional node annotation or attribute table can also be imported into Cytoscape and used to alter the node properties based on statistical results.

December 31, 2012 | Categories: Uncategorized | Tags: chemical similarity network, Cytoscape, ExCytR, metabolomics, network, R | Leave a comment

## Making Chemical Similarity Networks

Chemical similarity networks (CSN) can be used to explore multivariate metabolomic data within a biological context. In CSN networks, nodes represent metabolites and edges are formed between metabolite product-to-precursor pairs or structurally similar chemical species.

Here is an example of a chemical similarity network generated from a GC/TOF metabolomic experiment on serum.

This was done following the steps outlined below.

A) Get similarity matrix from pub chem: http://pubchem.ncbi.nlm.nih.gov//score_matrix/score_matrix.cgi

1) paste in CIDs (pubchem ids) in “IDs List”

2) hit submit button on top

3) copy results<–paste below in #2

B) Use Metamapp to generate edge attribute files

1) select chemical and biochemical map option

2) paste in 2 column matrix with CIDs and KEGG ids in field: “Enter CID KEGG Id Pair”

2) paste results from pub chem similarity score in field “Enter Similarity Matrix Data”

C) use KEGG react pairs or network “edge attribute files” to connect metabolites

1) optionally filter connections based on score to select top hits (>75)

2) optionally convert CIDS to metabolite names (need to replace spaces in name with some character, “_”)

3) save as txt or csv file

D) visualize in Cytoscape

1) import table using setting “Network from table (Text/Ms Excel)”

2) select the three columns as 1) source 2) interaction type 3) interaction target

The next thing to do is to

E) annotate node attributes based on statistical test results or biochemical domain knowledge

**This is where ExCytR will be very helpful…(to be continued…)**

December 27, 2012 | Categories: Uncategorized | Tags: chemical similarity network, Cytoscape, ExCytR, KEGG, Metamapp | Leave a comment

## Comparison of Serum vs Urine metabolites +

Primary metabolites in human serum or urine.

Oh oh, there seem to be some outliers: serum samples looking like urine and vice versa. Fix these and evaluate using PCA and hierarchical clustering on rank correlations.

Now things look more believable. Next let us test the effects of data pre-treatment on PLS-DA model scores for a 3 group comparison in serum. Ideally group scores would be maximally resolved in the dimension of the first latent variable (x) and inter-group variance would be orthogonal or in the y-axis.

Compared to raw data (TOP) where ~ 3 top variables (glucose, urea and mannitol) dominate the variance structure, the autoscaled model, due to variable-wise mean subtraction and division by the standard deviation, displays a more balanced contribution to scores variance by variables. The larger separation between WHITE and RED class scores along the x-axis suggest improved classifier performance over raw data model and overview of samples with scores outside their respective group’s Hotelling’s T ellipse (95%) might point to a sample outlier to further investigate or potentially exclude from the current test.

December 16, 2012 | Categories: Uncategorized | Tags: autoscaling, clustering, imDEV, metabolomics, normalizations, outliers, PCA, PLS-DA | Leave a comment

## NobeBox: an intuitive programmatic drawing application

Today I played a round with NodeBox, a fun and infinitely customizable data visualization application written in java and python. The GUI was easy to navigate and offered many interesting options to play with.

Looking at the well written tutorial and a few minutes of experimentation is all it takes to make some very interesting data art objects.

This is actually based on a simple grid of augmented rectangles copied and translated many times.

December 8, 2012 | Categories: Uncategorized | Tags: data art, NodeBox, programmatic drawing | Leave a comment

## ExCytR Concept

The concept is to make a GUI to provide a static and dynamic linking between data and its network representations.

Static access will involve making networks based on data and metadata stored in some table or spreadsheet.

Dynamic control will provide interactive access to network construction and annotation properties.

Together, these will provide rapid generation of information rich networks, based on tests of internal data properties or from exogenous semantic knowledge. Here is an example of a network representation of a time course metabolomic experiment. This network is used to encode dependence between top parameters of a PLS-DA model discriminating between pre- and post-experimental interventions. Larger nodes show variables meeting the 5% significance cut off (p < 0.05) for a mixed effects model to identify intervention related differences between unbalanced baseline and area under the curve for metabolite excursion measurements during an oral glucose tolerance test (OGTT). Node color signifies increase (red) or decrease (blue) in post- relative to pre-intervention average values. Node shape and outline display metabolite classification and presence in a PLS-DA model respectively. Node graphs, created in ggplots2, show box plots for pre- (red) and post-intervention (green) class distribution medians, upper and lower quartiles, and outliers.

The interactions between model parameters which exist only in pre-intervention samples are shown in the network below.

Connections are made between metabolites which have a non-zero partial correlation extracted based on a qpnetwork trimmed at a threshold where node and edge number is ~equal. In this network all edges meet the 5% significance based on tests of persons correlations.

December 1, 2012 | Categories: Uncategorized | Tags: Cytoscape, ExCytR, metabolomics, network, qpgraph, R | Leave a comment