Discriminating Between Iris Species | Creative Data Solutions

Discriminating Between Iris Species

The Iris data set is a famous for its use to compare unsupervised classifiers.

The goal is to use information about flower characteristics to accurately classify the 3 species of Iris. We can look at scatter plots of the 4 variables in the data set and see that no single variable nor bivariate combination can achieve this.

One approach to improve the separation between the two closely related Iris species, I.versicolor (blue) and I.virginica (green), is to use a combination of all 4 measurements, by constructing principal components (PCs).

Using the singular value decomposition to calculate PCs we see that the sample scores above are not resolved for the two species of interest.

Another approach is to use a supervised projection method like partial least squares (PLS), to identify Latent Variables (LVs) which are data projections similar to those of PCA, but which are also correlated with the species label. Interestingly this approach leads to a projection which changes the relative orientation of I. versicolor and I. verginica to I. setaosa. However, this supervised approach is not enough to identify a hyperplane of separation between all three species.

Non-linear PCA via neural networks can be used to identify the hypersurface of separation, shown above. Looking at the scores we can see that this approach is the most success for resolving the two closely related species. However, the loadings from this method, which help relate how the variables are combined achieve the classification, are impossible to interpret. In the case of the function used above(nlPca, pcaMethods R package) the loadings are literally NA.

This entry was posted on August 4, 2012 by dgrapov. It was filed under Uncategorized and was tagged with classification, data analysis, histogram, imDEV, Iris Data, neural networks, non-linear PCA, PCA, pcaMethods, PLS-DA, R, scatterplot, scatterplot matrix.

Leave a comment Cancel reply

Follow Creative Data Solutions on WordPress.com

Top Posts & Pages

Data visualization Gallery

Figure 1. The type 2 diabetes-associated lipidomic changes projected in context of their biological relationships in obese African-American women.

Figure 1. The type 2 diabetes-associated lipidomic changes projected in context of their biological relationships in obese African-American women.

Bionetwork1

Tissue network cancer

Treatment effects network

PCA normalizations

WCMC network

journal.pone.0048852.g001

network_1

OPLS-DA network

genotype network

PLS_DA repeated measures trajectory

loess_norm50

ASMS 2014 j fahrman

mc 2

composite2

333

journal.pone.0048852.g002

Spearman’s correlations were used to generate multi-dimensionally scaled parameter connectivity networks for variable intercorrelations. Networks were oriented with fasting glucose at the origin and SFA in the lower right quadrant. Colored ellipses represent the 95% probability locations of metabolite classes (Hoettlings T2, p<0.05). Nodes indicate clinical parameters (diamonds), <20-carbon fatty acid metabolites (circles) and ≥20-carbon fatty acid metabolites (triangles), with discriminant model variables and glucose enlarged. Significant correlations between species are designated by orange (positive) or blue (negative) connecting lines (p<0.05, non-diabetic; p<0.01, diabetic participants).

Spearman’s correlations were used to generate multi-dimensionally scaled parameter connectivity networks for variable intercorrelations. Networks were oriented with fasting glucose at the origin and SFA in the lower right quadrant. Colored ellipses represent the 95% probability locations of metabolite classes (Hoettlings T2, p<0.05). Nodes indicate clinical parameters (diamonds), <20-carbon fatty acid metabolites (circles) and ≥20-carbon fatty acid metabolites (triangles), with discriminant model variables and glucose enlarged. Significant correlations between species are designated by orange (positive) or blue (negative) connecting lines (p<0.05, non-diabetic; p<0.01, diabetic participants).

LOESS_span

cough syrup network

PLS-DA NETWORK

Horizontal scatter plots of the log transformed concentrations for each model variable are shown. The horizontal arrangement of metabolite scatter plots is scaled to their loading in the discriminant model. A given species importance in the classification increases with increasing displacement from the origin (broken line). The direction of the displacement, left or right, designates whether the species was decreased (left) or increased (right) in the diabetic relative to the non-diabetic patients. The overall model discrimination performance is presented as a scatter plot of subject model scores (inset).

Horizontal scatter plots of the log transformed concentrations for each model variable are shown. The horizontal arrangement of metabolite scatter plots is scaled to their loading in the discriminant model. A given species importance in the classification increases with increasing displacement from the origin (broken line). The direction of the displacement, left or right, designates whether the species was decreased (left) or increased (right) in the diabetic relative to the non-diabetic patients. The overall model discrimination performance is presented as a scatter plot of subject model scores (inset).

C and E figure

netmaping

Treatment response network

Metabolites are represented by circular “nodes” linked by “edges” with arrows designating the direction of the biosynthetic gradient (i.e. substrate to product). Some metabolites are linked by more than one enzymatic step. Node sizes represent magnitudes of differences in plasma metabolite geometric means (ΔGM). Arrow widths represent magnitudes of changes in product over substrate ratios (ΔP:S). Colors of node borders and arrows represent the significance and direction of changes relative to non-diabetics as per the figure legend. Differences are significant at p<0.05 by Mann-Whitney U test adjusted for FDR (q = 0.1).

Metabolites are represented by circular “nodes” linked by “edges” with arrows designating the direction of the biosynthetic gradient (i.e. substrate to product). Some metabolites are linked by more than one enzymatic step. Node sizes represent magnitudes of differences in plasma metabolite geometric means (ΔGM). Arrow widths represent magnitudes of changes in product over substrate ratios (ΔP:S). Colors of node borders and arrows represent the significance and direction of changes relative to non-diabetics as per the figure legend. Differences are significant at p<0.05 by Mann-Whitney U test adjusted for FDR (q = 0.1).

network

network_1

g9135

Cancer tissue network

network

known partial correlation network2

imDEV clouds

Scatterplot matrix for overview of correlations and regressions, displaying box plots for Iris data species, variable histograms, correlation statistics, stripcharts and best fit lines.

Scatterplot matrix for overview of correlations and regressions, displaying box plots for Iris data species, variable histograms, correlation statistics, stripcharts and best fit lines.

Topics

Suggested Blogs

Creative Data Solutions

When you want to get to know and love your data

R news and tutorials contributed by hundreds of R bloggers