My Blog List

Saturday, January 21, 2012

Comparing fineSTRUCTURE clustering to Mclust clusterings

The first weeks of 2012 years marked a milestone in  BGA blogging paradigm , introducing substantial shift of methodologies.
First of all, Eurogenes blog has applied recently published Chromopainter tools to its intra-North Euro cluster analysis (with more than 400 samples and 270K SNPs, in linkage mode, and 200K burn-ins and iterations).

Dodecad Project  has also improved its  Cluster Galore method  to be  used with linked haplotype data (this refined method is , however, designed, to work not wiyj fineSTRUCTURE, but with different software fastIBD).

Although both methods are different in technical performance and design, they still strive for the same goal of inferring the population structure. This type of structure inference consists of two parts: deriving a matrix of relationships between sampled individuals and clustering these relationships.  As was noted by Dan Lawson , given a distance-like matrix such as the number of SNPs IBS or the ChromoPainter coancestry matrix, it is possible to apply a wide variety of clustering algorithms. 

Still fineSTRUCTURE has two advantages. Firstly, because fineSTRUCTURE performs MCMC it is less likely to get stuck in local optima than computationally cheaper methods that climb gradients. Secondly, and most importantly, (when used correctly) fineSTRUCTURE is well calibrated as it has no unknown and hard to estimate tuning parameters. 


Experiment
In order to evaluate the performance of different clustering methods on a real-world dataset, i used sampled individuals from my project. To make this test experiment even harder, i've applied Chromopainter's algorithm (*linkage mode) to a homogenous uniform subset of very similar Baltic Populations: Lithuanians, Belorussians, Ukrainians (90  samples plus a couple of reference Belorussian, Lithuanian and Ukrainian samples with 90K SNPs).  The same dataset was used in Shellfish to carry out a  principal component analysis of genome-wide SNP data, and in PLINK MDS-calculations. Obtained PCA and MDS data was subsequently analyzed using the general-purpose clustering software Mclust.


I ran Chromopainter's algorithm separately on each of 22 chromosomes, the output files were merged into one single file. Then i applied fineSTRUCTURE's MCMC algorithm to infer the population structure, PCA components and cluster trees. Below are plots showing individual co-ancestry matrix, individual and population agglomerate plots and PCA plots.








It appears that fineSTRUCTURE inferred 5 clusters (starting with smallest): (1) Ashkenazi, (2) individuals of West-European origin with (moderate to minor) East-European "admixture" (3) individuals from East-Europe and (4) Lithuanians, (5) the biggest cluster including Poles, Belorusians, Russians and Ukrainians.
For our project's particular purposes, it is important to note that such a clear split between Belorussians and Lithuanians is introduced for the first time. Although the innuendo of this division could be inferred from our earlier experiments, the statistical signal of separation was weak  enough to be ignored, because (as it is shown on the plot below) some regions of the plots are highly uncertain.






The cluster assignments (fineStructure) for project's individuals can be seen here.
 I have also published corresponding cluster assignments by Mclust (PLINK+MClust and Shellfish+MClust).  Direct agreement between these two latter solutions: 0 of 4 pairs, iterations for permutation matching: 24, cases in matched pairs: 34.58 %

     2  3  1  4
  1  0  2  3  0
  2  2  8  1  3
  3 13 11 16 18
  4  4  4  9 13