Earlier this month I experimented with Chromopainter and fineSTUCTURE software. For
the sake of comparision with ADMIXTURE/LAMP results, i used PLINK
linkage file with previously extracted SNPs on Chr.6. The PLINK file
included thinned set of LD-pruned SNPs (9155 SNPs in total). I also
limited the dataset to include the projectäs participants only. I
streamlined the phasing of this dataset by using Gusev's phasing_pipeline (which is indispensable for conversion between PLINK and BEAGLE format, and phasing genotypes with BEAGLE).
After that i deployed plink2chromopainter.pl, a conversion script for going from PLINK linkage-style PED and MAP files to ChromoPainters PHASE and MAP files. The converted MAP file was merged with HapMap's pre-compiled recombination file for Chr.6 (in order to accomplish this task i used a simple buggy but efficient AWK script). This trick allowed me to avoid/skip the painstaking exercise in UNLINKED model of Chromopainter .
As a next step, i changed default -k parameter to 200 (to give approximately 200 random samples to estimate the local variance). It is important to mention, that Chromopainter could be run in two modes: "donor mode" (which assumes the prior existence of "donor" and "recipient" haplotypic populations) and "all against all" mode. I we used the latter approach in which a single haplotype within an individual is reconstructed using the haplotypes from all other individuals in the sample as potential donors. This process is repeated for every haplotype in turn, so every individual is ultimately reconstructed in terms of all the other individuals, i.e every individual in my sample is conditioned on all the other individual.
It took me approximately 1,45 hours to finish Chromopainter's job. After that in combined Chromopainter output files, creating the file "MDLP.Chr6.chunkcounts.out", which is the coancestry matrix that fineSTRUCTURE requires as input.
After that i fed fineSTRUCTURE with required input (the coancestry matrix) and let fineSTRUCTURE to find a correct number of sample's "splits" (clusters). Surprisingly, even in such a limited (LD-prunned and tiny) dataset of SNPs, fineSTRUCTURE was able to detect four population-like (2 East-European, 1 West-European and outliers) clusters. In comparision, ADMIXTURE had some troubles with fleshing out sample's K-clusters, when i ran it on both phased and unphased version of the same Chr6. dataset (you can see them in this Excel spreadsheet).
After that i deployed plink2chromopainter.pl, a conversion script for going from PLINK linkage-style PED and MAP files to ChromoPainters PHASE and MAP files. The converted MAP file was merged with HapMap's pre-compiled recombination file for Chr.6 (in order to accomplish this task i used a simple buggy but efficient AWK script). This trick allowed me to avoid/skip the painstaking exercise in UNLINKED model of Chromopainter .
As a next step, i changed default -k parameter to 200 (to give approximately 200 random samples to estimate the local variance). It is important to mention, that Chromopainter could be run in two modes: "donor mode" (which assumes the prior existence of "donor" and "recipient" haplotypic populations) and "all against all" mode. I we used the latter approach in which a single haplotype within an individual is reconstructed using the haplotypes from all other individuals in the sample as potential donors. This process is repeated for every haplotype in turn, so every individual is ultimately reconstructed in terms of all the other individuals, i.e every individual in my sample is conditioned on all the other individual.
It took me approximately 1,45 hours to finish Chromopainter's job. After that in combined Chromopainter output files, creating the file "MDLP.Chr6.chunkcounts.out", which is the coancestry matrix that fineSTRUCTURE requires as input.
After that i fed fineSTRUCTURE with required input (the coancestry matrix) and let fineSTRUCTURE to find a correct number of sample's "splits" (clusters). Surprisingly, even in such a limited (LD-prunned and tiny) dataset of SNPs, fineSTRUCTURE was able to detect four population-like (2 East-European, 1 West-European and outliers) clusters. In comparision, ADMIXTURE had some troubles with fleshing out sample's K-clusters, when i ran it on both phased and unphased version of the same Chr6. dataset (you can see them in this Excel spreadsheet).
From what i gather, author's claim holds water: fineSTRUCTURE indeed outperform ADMIXTURE/STRUCTURE approaches: "We next turn to our fineSTRUCTURE model-based analysis, again
considering the unlinked coancestry matrix even though strong and
variable LD exists in the dataset. We first compared performance
of our unlinked model to the popular ADMIXTURE [15] software (Figure 3B
and D, details in Section S8). Encouragingly, as the number of 5Mb
regions increased from 5 to 200 we saw a monotonic performance increase
for the no-linkage model, separating all groups with 200 markers.
Further, our approach outperformed ADMIXTURE, with the ADMIXTURE
performance levelling at around 60% correlation with the truth.
In practice, we observed ADMIXTURE successfully splitting groups A, B
and C and mostly splitting C1 and C2, but not B1 and B2, as detailed in
Figures S6-11. ADMIXTURE performs inference under a model where markers
are treated as unlinked, and where individuals may have genomes made up
of mixtures of inferred source populations, while our simulation
incorporated drift between populations, but not admixture. To examine
whether violations of both these modelling assumptions explain the
different results, we simulated a new dataset with the same underlying
population structure of 5 populations as before, but no linkage (i.e.
independence) between markers within each population. We analysed these
data with STRUCTURE, which uses a similar underlying model to that of
ADMIXTURE, but includes a no-admixture model (Section S7). For
small datasets, STRUCTURE slightly improved performance relative to our
unlinked fineSTRUCTURE model, but for larger SNP numbers, fineSTRUCTURE
was able to identify all population splits (K = 5) while again,
STRUCTURE was able to split only populations A, B and C (K = 3).
Thus, even when LD information is not used (or even present),
fineSTRUCTURE can offer advantages in some settings over these existing
approaches."
With this gist of fineSTRUCTURE methodology, is is now safe to conclude that the combined set of all 22 chromosome, the linkage model can do even better.
The painting process gives three square matrices of size NxN (N=number of individuals): 1) the frequency individuals copy chunks from each other, 2) the average length of those chunks, and 3) the mutation rate. These contain different aspects of the ancestry history, and almost completely summarize population ancestry.
The origin of some individual haplotypes could be ascertained using more comprehensive GERMLINe matching algorithm. Indeed, after applying GERMLINE algorithm to BEAGLE-phased chromosomes
(1..22), i can see that the pairwise matches in haplotypes located far
away from the chromosomal centromere and telomeres, tend to cluster into
"ethnic" groups. However, haplotypes which are observed near
centromere, are rather omnipresent - as would be expected if these
regions had low rates of recombination because of proximity to
centromeres. The IBD segments detected by GERMLINE are listed (per chromosome) below:
With the scope on xMHC (see our previous posting),Of special interest here are, perhaps, haplotype matches on Chromosome 6, which match the pattern of massive extensive sharing of linked SNPs in the region of xMHC - Chrom6-cM