Network of relationships between protein and glycan components of human milk. Edge properties show the strength (line width) and direction (color) of correlations (spearmans rho, p<0.0001) between biological molecules which are represented by vertices which display the importance (size) and direction of change (color) in milk components between two experimental groups.
As the title states this is too easy. The hard part is how to decide when enough is enough. Here 1194 lipids tested for changes due to genotype are visualized, connected based on their Spearmans’s correlations (p<1e-3).
Here is the legend based on test between two groups.
To enhance inference using domain knowledge it is necessary to match your query to a database containing domain knowledge.
The Chemical Translation Service (CTS) can be used to translate between molecular identifiers for many (~400K) naturally occurring biological small molecules or metabolites, which enables
CTSgetR , is an easy to use R interface to CTS, which enables translation between the following repositories of biological domain knowledge:
- “Chemical Name”
- “InChI Code”
- “PubChem CID”
- “Pubchem SID”
- “Southern Research Institute”
- “Emory University Molecular Libraries Screening Center”
- “Vitas-M Laboratory”
Recently I came across an interesting visualization for multivariate data named the Andrews curve (plot) (original post here). This is a very interesting trigonometric transformation of a multivariate data set to x and y coordinate encoding. After a quick check I was happy to see there is a package in R for making Andrew’s plots, andrews. Here is an example of an andrews plot for a data set describing various features of automobiles, “mtcars“, which is also colored according to the number of cylinders in each vehicle (4, 6 or 8).
This is an interesting perspective of 11 measurements for 32 cars (shares similarity with a parallel coordinates plot). Based on this data visualization, the 8 cylinder cars seem the most similar with regards to other parameters judging from the “tightness” of their patterns (yellow lines). While the 2 and 6 cylinder cars seem more similar to each other.
Its hard to compare the Anderson encoded and original data, but we can try with a scatterplot visualization.
This visualization supports the previous observation, the number of cylinders has a large effect on the continuous variables like miles per hour (mpg). The effect of the other potential covariates (discreet variables like va, am, gear) is less obvious but may also be present. This would be important to include or account for when conducting predictive modeling.
To try to identify further covariates we can take a look at the at the principal component (PCA) scores, which is another method for multivariate visualization, but in this case is limited to the first two largest discreet modes of variance in the data (principal plane or component 1 and component 2).
Based on the scores, it is evident that sample clustering is fairly well explained by the number of cylinders and other correlated parameters. We can also see that loadings for PC1 (x-axis) can be used to explain cylinder # fairly well, but there is something else causing a separation in y.
Instead of autoscaling the data (mean=0, sd=1; as previously done prior to the PCA above) we can instead make an andrews encoding of the data. This will apply a trigonometric transformation to each of the variables to produce 101 x and y values for each of the 32 cars. We can combine these to create a new matrix (32 by 202) with rows representing sample (n=32) and columns the x (n=101) and y (n=101) encodings. This effectively increase our number of variables from 12 to 202, but hopefully also gives a deeper insight into any class structure.
Interestingly this encoding highlights the previously noted and yet unexplained factor (evident in scores difference in y between same cylinder vehicles). Next, we can can check the other discreet variables in the data to see if any of them can help explain the clustering pattern observed above.
After quick check it is evident that the the type of transmission (am; manual (1) or automatic (0)) nicely explains the second mode of scores variance, which is not captured by cylinders.
This is less obvious in the autoscaled PCA.
Further inspection of the andrews encoded PCA also suggest that there is yet another potential covariate, as evident from the two clusters of 8 cylinder and automatic transmission vehicles (8|0).
At first blush the andrews method coupled with a dimensional reduction technique seems like a very interesting technique for identifying covariate contributions to patterns in the data. It would be interesting to compare variable loadings from PCA of autoscaled and andrews encoded data, but it is not obvious how to do this…
If you want to replicate the analyses above or just want to apply these visualizations to your own data get all the necessary code from the example found HERE.
I recently gave a hands-on network building tutorial using R and Cytoscape to build large biological networks. In these networks Nodes represent metabolites and edges can be many things, but I specifically focused on biochemical relationships and chemical similarities. Your imagination is the limit.
If you are interested check out the presentation below.
Here is all the R code and links to relevant data you will need to let you follow along with the tutorial.
</pre> #load needed functions: R package in progress - "devium", which is stored on github source("http://pastebin.com/raw.php?i=Y0YYEBia") <pre> # get sample chemical identifiers here:https://docs.google.com/spreadsheet/ccc?key=0Ap1AEMfo-fh9dFZSSm5WSHlqMC1QdkNMWFZCeWdVbEE#gid=1 #Pubchem CIDs = cids cids # overview nrow(cids) # how many str(cids) # structure, wan't numeric cids<-as.numeric(as.character(unlist(cids))) # hack to break factor #get KEGG RPAIRS #making an edge list based on CIDs from KEGG reactant pairs KEGG.edge.list<-CID.to.KEGG.pairs(cid=cids,database=get.KEGG.pairs(),lookup=get.CID.KEGG.pairs()) head(KEGG.edge.list) dim(KEGG.edge.list) # a two column list with CID to CID connections based on KEGG RPAIS # how did I get this? #1) convert from CID to KEGG using get.CID.KEGG.pairs(), which is a table stored:https://gist.github.com/dgrapov/4964546 #2) get KEGG RPAIRS using get.KEGG.pairs() which is a table stored:https://gist.github.com/dgrapov/4964564 #3) return CID pairs #get EDGES based on chemical similarity (Tanimoto distances >0.07) tanimoto.edges<-CID.to.tanimoto(cids=cids, cut.off = .7, parallel=FALSE) head(tanimoto.edges) # how did I get this? #1) Use R package ChemmineR to querry Pubchem PUG to get molecular fingerprints #2) calculate simialrity coefficient #3) return edges with similarity above cut.off #after a little bit of formatting make combined KEGG + tanimoto edge list # https://docs.google.com/spreadsheet/ccc?key=0Ap1AEMfo-fh9dFZSSm5WSHlqMC1QdkNMWFZCeWdVbEE#gid=2 #now upload this and a sample node attribute table (https://docs.google.com/spreadsheet/ccc?key=0Ap1AEMfo-fh9dFZSSm5WSHlqMC1QdkNMWFZCeWdVbEE#gid=1) #to Cytoscape
You can also download all the necessary materials HERE, which include:
- tutorial in powerpoint
- R script
- Network edge list and node attributes table
- Cytoscape file
Happy network making!