My Blog List

Monday, May 2, 2011

Splitting MDL datasets into subsets of informative and robust clusters

Yesterday i used Mclust R package on the entirety of MDL dataset to split MDL datasets into several partially-intersecting 5 subsets (see legend, each row represents a single participant, identified by the row number, e.g V+1=V1;...;V+164=V164). 5 clusters  was attained with 4 dimensions retained.

From Mclust's page

MCLUST is an R package for normal mixture modeling via EM, model-based clustering, discriminant analysis and density estimation.

The method as explained by  Dienekes Pontikos

In short, this method exploits the clusteredness of individuals along different dimensions of the MDS representation of dense genotypic data. It uses a powerful model-based clustering algorithm (MCLUST) that can infer the existence of clusters of different size, shape, and orientation in the MDS space, and which automatically optimizes for the Bayes Information Criterion, balancing off detail with parsimony.

The only parameter that I need to specify to MCLUST is the number of MDS dimensions to retain (for a more detailed analysis, see here), as extra dimensions may add "clusteredness" but also noise. In order to decide on how many dimensions to retain, I empirically run MCLUST with a different number of dimensions (from 2 to 50).



No comments:

Post a Comment