When you want to get to know and love your data

Posts tagged “clustering

High Dimensional Biological Data Analysis and Visualization


High dimensional biological data shares many qualities with other forms of data. Typically it is wide (samples << variables), complicated by experiential design and made up of complex relationships driven by both biological and analytical sources of variance. Luckily the powerful combination of R, Cytoscape (< v3) and the R package RCytoscape can be used to generate high dimensional and highly informative representations of complex biological (and really any type of) data. Check out the following examples of network mapping in action or view a more indepth presentation of the techniques used below.


Partial correlation network highlighting changes in tumor compared to control tissue from the same patient.

Tissue network cancer


Biochemical and structural similarity network of changes in tumor compared to control tissue from the same patient.

Cancer tissue network


Hierarchical clusters (color) mapped to a biochemical and structural similarity network displaying difference before and after drug administration.

cough syrup network


Partial correlation network displaying changes in metabolite relationships in response to drug treatment.

Treatment response network


Partial correlation network displaying changes in disease and response to drug treatment.

Treatment effects network


Check out the full presentation below.

Creative Commons License


Network Mapping Video

Here are a video and slides for a presentation of mine about my favorite topic :


Comparison of Serum vs Urine metabolites +

Primary metabolites in human serum or urine.

serum urine idOh oh, there seem to be some outliers: serum samples  looking like urine and vice versa. Fix these and evaluate using PCA and hierarchical clustering on rank correlations.

fix assignments

Now things look more believable. Next let us test the effects of data pre-treatment on PLS-DA model scores for a 3 group comparison in serum. Ideally group scores would be maximally resolved in the dimension of the first latent variable (x) and inter-group variance would be orthogonal or in the y-axis.

scaling vs normalization

Compared to raw data (TOP) where ~ 3 top variables (glucose, urea and mannitol) dominate the variance structure, the autoscaled model, due to variable-wise  mean subtraction and division by the standard deviation, displays a more balanced contribution to scores variance by variables. The larger separation between  WHITE  and RED class scores  along the x-axis suggest  improved classifier performance over raw data model and overview of samples with scores outside their respective group’s Hotelling’s T ellipse (95%) might point to  a sample outlier to further investigate or potentially exclude from the current test.