When you want to get to know and love your data

Archive for December, 2012

Anaerobic Stress in Seeds – A Chemical Similarity Network Story

The chemical similarity network or CSN is a great tool for organizing biological data based on known biochemistry or chemical structural similarity. Here is an example CSN for visualizing metabolomic  changes (measured via GC/TOF) due to anaerobic stress in germinating seeds.

seed anaerobic stress - Chemical Similarity Network

In this network edges are formed for chemical similarity scores > 75. Node color describes significant (adjusted p-value < 0.05, q-value = 0.05, paired t-Test) increase (red), decrease (blue) or no change (gray) in anaerobic relative to aerobic treatments. Node size is inversely proportional to the tests p-value.

This CSN was not hard to construct and minimally requires knowledge of analyte PubChem chemical identifiers (CIDs). CIDs can be used to calculate the chemical similarity matrix using online tools provided by PubChem. This symmetric matrix can be easily formatted to create an edge list containing the basic information: source, target and similarity score.

square symmetric matrix vs. edge list
Here is a function for converting square symmetric matrices to edge lists using the R statistical programming environment.


#accessory function
    one = list(first = rep(1:r,rep(r,r))[lower.tri(diag(r))],
    second = rep(1:r, r)[lower.tri(diag(r))]),
    two = list(first = rep(1:r, r)[lower.tri(diag(r))],
    second = rep(1:r,rep(r,r))[lower.tri(diag(r))]))
 tmp<-as.data.frame(do.call("rbind",lapply(1:length(ids$first) ,function(i)
   name<-c(colnames(mat)[ids$first[i]],colnames(mat)[ids$secon   d[i]])

The function mat.to.edge.list will convert a square symmetric matrix to an edge list through the extraction of the upper triangle excluding the diagonal or self edges.

This edge list can now be visualized as a CSN using some software (see brief instructions here). I prefer to use Cytoscape for this. The edge list merely contains instructions for which vertices or nodes representing metabolites should be connected.

node attribute table

An additional node annotation or attribute table can also be imported into Cytoscape and used to alter the node properties based on statistical results.

Making Chemical Similarity Networks

Chemical similarity networks (CSN) can be used to explore multivariate metabolomic data within a biological context. In CSN networks, nodes represent metabolites and edges are formed between metabolite product-to-precursor  pairs or structurally similar chemical species.

Here is an example of a chemical similarity network generated from a GC/TOF metabolomic experiment on serum.

Chemical Similarity Network

This was done following the steps outlined below.

A) Get similarity matrix from pub chem: http://pubchem.ncbi.nlm.nih.gov//score_matrix/score_matrix.cgi

1) paste in CIDs (pubchem ids) in “IDs List”

2) hit submit button on top

3) copy results<–paste below in #2


B) Use Metamapp to generate edge attribute files

1) select chemical and biochemical map option

2) paste in 2 column matrix with CIDs and KEGG ids in field: “Enter CID KEGG Id Pair”

2) paste results from pub chem similarity score in field “Enter Similarity Matrix Data”


C) use KEGG react pairs or network “edge attribute files” to connect metabolites

1) optionally filter connections based on score to select top hits (>75)

2) optionally convert CIDS to metabolite names (need to replace spaces in name with some character, “_”)

3) save as txt or csv file


D) visualize in Cytoscape

1) import table using setting “Network from table (Text/Ms Excel)”

2) select the three columns as 1) source 2) interaction type 3) interaction target


The next thing to do is to

E) annotate node attributes based on statistical test results or biochemical domain knowledge


This is where ExCytR will be very helpful…(to be continued…)

Comparison of Serum vs Urine metabolites +

Primary metabolites in human serum or urine.

serum urine idOh oh, there seem to be some outliers: serum samples  looking like urine and vice versa. Fix these and evaluate using PCA and hierarchical clustering on rank correlations.

fix assignments

Now things look more believable. Next let us test the effects of data pre-treatment on PLS-DA model scores for a 3 group comparison in serum. Ideally group scores would be maximally resolved in the dimension of the first latent variable (x) and inter-group variance would be orthogonal or in the y-axis.

scaling vs normalization

Compared to raw data (TOP) where ~ 3 top variables (glucose, urea and mannitol) dominate the variance structure, the autoscaled model, due to variable-wise  mean subtraction and division by the standard deviation, displays a more balanced contribution to scores variance by variables. The larger separation between  WHITE  and RED class scores  along the x-axis suggest  improved classifier performance over raw data model and overview of samples with scores outside their respective group’s Hotelling’s T ellipse (95%) might point to  a sample outlier to further investigate or potentially exclude from the current test.

NobeBox: an intuitive programmatic drawing application

Today I played a round with NodeBox, a fun and infinitely customizable data visualization application written in java and python. The GUI was easy to navigate and offered many interesting options to play with.
nodebox gui

Looking at the well written tutorial and a few minutes of experimentation  is all it takes to make some very interesting data art objects.

nodebox testThis is actually based on  a simple grid of augmented rectangles copied and translated many times.

ExCytR Concept

The concept is to make a GUI to provide a static and dynamic linking between data and its network representations.

Static access will involve making networks based on data and metadata stored in some table or spreadsheet.

Dynamic control will provide interactive access to network construction and annotation properties.

Together, these will provide rapid generation of information rich networks, based on tests of  internal data properties or from exogenous semantic knowledge. Here is an example of a network representation of a time course metabolomic experiment. This network is used to encode dependence between top  parameters of a PLS-DA model discriminating between pre- and post-experimental interventions. Larger nodes show variables meeting the 5% significance cut off (p < 0.05) for a mixed effects model to identify intervention related differences between unbalanced baseline and area under the curve for metabolite excursion measurements during an oral glucose tolerance test (OGTT). Node color signifies increase (red) or decrease (blue) in post- relative to pre-intervention average values. Node shape and outline display metabolite classification and presence in a PLS-DA model respectively. Node graphs, created in ggplots2, show box plots for pre- (red) and post-intervention (green) class distribution medians, upper and lower quartiles, and outliers.



The interactions between model parameters which exist only in  pre-intervention samples are shown in the network below.

Connections are made between metabolites which have a non-zero partial correlation extracted based on a qpnetwork trimmed at a threshold where node and edge number is ~equal. In this network all edges meet the 5% significance based on tests of persons correlations.