When you want to get to know and love your data

Archive for April, 2013

Biological Circuit

OPLS-DA network

Network of relationships between protein and glycan components of  human milk. Edge properties show the strength (line width) and direction (color) of  correlations (spearmans rho, p<0.0001) between biological molecules which are represented by vertices which display the importance (size) and direction of change (color) in milk components between two experimental groups.


Connecting Lipids

networkAs the title states this is too easy. The hard part is how to decide when enough is enough. Here 1194 lipids tested for changes due to genotype are visualized, connected based on their Spearmans’s correlations (p<1e-3).

Here is the legend based on test between two groups.


Translating between identifiers: R interface to the Chemical Translation Service (CTS)


To enhance inference using  domain knowledge it is necessary to match your query to a database containing domain knowledge.

The Chemical Translation Service (CTS) can be used to translate between molecular identifiers for many (~400K) naturally occurring biological small molecules or metabolites, which enables

CTSgetR , is an easy to use R interface to CTS, which enables translation between the following repositories of biological domain knowledge:

  • “Chemical Name”
  • “InChIKey”
  • “InChI Code”
  • “PubChem CID”
  • “Pubchem SID”
  • “ChemDB”
  • “ZINC”
  • “Southern Research Institute”
  • “Specs”
  • “MolPort”
  • “ASINEX”
  • “ChemBank”
  • “MLSMR”
  • “Emory University Molecular Libraries Screening Center”
  • “ChemSpider”
  • “DiscoveryGate”
  • “Ambinter”
  • “Vitas-M Laboratory”
  • “ChemBlock”

Check out an example translation from the universal molecular identifier, InchiKey, to the well referenced  PubChem Chemical Identifier (CID)


Andrew’s encoding of Multivariate Data (looks informative)

Recently I came across an interesting visualization for multivariate data named the Andrews curve (plot)  (original  post here). This is a very interesting trigonometric transformation of a multivariate data set  to x and y coordinate encoding.  After a quick check I was happy to see there is a package in R for making Andrew’s plots, andrews. Here is an example of an andrews plot for  a data set describing various features of automobiles,  “mtcars, which  is also colored according to the number of cylinders in each vehicle (4, 6 or 8).

andrews plotThis is an interesting perspective of 11 measurements for 32 cars (shares similarity with  a parallel coordinates plot). Based on this  data visualization, the 8 cylinder cars seem the most similar with regards to other parameters judging from the “tightness” of their patterns (yellow lines). While the 2 and 6 cylinder cars seem more similar to each other.

Here is my modified visualizations of the Andersons plot using ggplot2  (get code HERE).


modified andrews plot

Its hard to compare the Anderson encoded and  original data, but we can try with a scatterplot visualization.

pairsThis visualization supports the previous observation, the number of cylinders has a large effect on the continuous variables like miles per hour (mpg).  The effect of the other potential covariates (discreet variables like va, am, gear) is less obvious but may also be present. This would be important to include or account for when conducting predictive modeling.

To try to identify further covariates we can take a look at the at the principal component (PCA) scores, which is another method for multivariate visualization, but in this case is limited to the first two largest discreet modes of variance in the data (principal plane  or  component 1 and component 2).


Based on the scores, it is evident that sample clustering is fairly well explained by the number of cylinders  and other correlated parameters. We can also see that loadings for PC1 (x-axis) can be used to explain cylinder # fairly well, but there is something else  causing a separation in y.

Instead of autoscaling the data (mean=0, sd=1;  as previously done prior to the PCA above) we can instead make an andrews encoding of the data. This will apply a trigonometric transformation to each of the variables to produce 101 x and y values for each of the 32 cars.  We can combine these to create a new matrix  (32 by 202) with rows representing sample (n=32) and columns the x (n=101) and y (n=101)  encodings. This effectively increase our number of variables from 12 to 202, but hopefully also gives a deeper insight into any class structure.andrews encoded PCA

Interestingly this  encoding highlights the previously noted and yet unexplained  factor (evident in scores difference in y between same cylinder vehicles). Next, we can can check the other discreet variables in the data  to see if any of them can help explain the clustering pattern observed above.

After quick check it is evident that the the type of transmission (am; manual (1) or automatic (0)) nicely explains the second mode of scores variance, which is not captured by cylinders.

andrews PCA cyl and am

This is less obvious in the autoscaled PCA.

PCA cyl and am

Further inspection of the andrews encoded PCA also suggest that there is yet another potential covariate, as evident from the two clusters of 8 cylinder and automatic transmission vehicles (8|0).

At first blush the andrews method coupled with a dimensional reduction technique seems like a very interesting technique for identifying covariate contributions to patterns in the data. It would be interesting to compare variable loadings from PCA of  autoscaled and andrews encoded data, but it is not obvious how to do this…

If you want to replicate the analyses above or just want to apply these visualizations to your own data get all the necessary code from the example found HERE.

Tutorial- Building Biological Networks

I love networks! Nothing is better for visualizing complex multivariate relationships be it social, virtual or biological.Bionetwork1

I recently gave a hands-on network building tutorial using R and Cytoscape to build large biological networks. In these networks Nodes represent metabolites and edges can be many things, but I specifically focused on biochemical relationships and chemical similarities. Your imagination is the limit.

genotype network


network DM

If you are interested check out the presentation below.

Here is all the R code and links to relevant data you will need to let you follow along with the tutorial.

#load needed functions: R package in progress - "devium", which is stored on github
# get sample chemical identifiers here:https://docs.google.com/spreadsheet/ccc?key=0Ap1AEMfo-fh9dFZSSm5WSHlqMC1QdkNMWFZCeWdVbEE#gid=1
#Pubchem CIDs = cids
cids # overview
nrow(cids) # how many
str(cids) # structure, wan't numeric 
cids<-as.numeric(as.character(unlist(cids))) # hack to break factor

#making an edge list based on CIDs from KEGG reactant pairs
dim(KEGG.edge.list) # a two column list with CID to CID connections based on KEGG RPAIS
# how did I get this?
#1) convert from CID to KEGG using get.CID.KEGG.pairs(), which is a table stored:https://gist.github.com/dgrapov/4964546
#2) get KEGG RPAIRS using get.KEGG.pairs() which is a table stored:https://gist.github.com/dgrapov/4964564
#3) return CID pairs

#get EDGES based on chemical similarity (Tanimoto distances >0.07)
tanimoto.edges<-CID.to.tanimoto(cids=cids, cut.off = .7, parallel=FALSE)
# how did I get this?
#1) Use R package ChemmineR to querry Pubchem PUG to get molecular fingerprints
#2) calculate simialrity coefficient
#3) return edges with similarity above cut.off

#after a little bit of formatting make combined KEGG + tanimoto edge list
# https://docs.google.com/spreadsheet/ccc?key=0Ap1AEMfo-fh9dFZSSm5WSHlqMC1QdkNMWFZCeWdVbEE#gid=2

#now upload this and a sample node attribute table (https://docs.google.com/spreadsheet/ccc?key=0Ap1AEMfo-fh9dFZSSm5WSHlqMC1QdkNMWFZCeWdVbEE#gid=1)
#to Cytoscape 

You can also download all the necessary materials HERE, which include:

  1. tutorial in powerpoint
  2. R script
  3. Network edge list and node attributes table
  4. Cytoscape file
Happy network making!