My Blog List

Thursday, December 29, 2011

Last posting in the year 2011

Earlier this month I experimented with Chromopainter and fineSTUCTURE software. For the sake of comparision with ADMIXTURE/LAMP results, i used PLINK linkage file with previously extracted SNPs on Chr.6. The PLINK file included thinned set of LD-pruned SNPs (9155 SNPs in total). I also limited the dataset to include the projectäs participants only. I streamlined the phasing of this dataset by using Gusev's phasing_pipeline (which is indispensable for conversion between PLINK and BEAGLE format, and phasing genotypes with BEAGLE).

After that i deployed, a conversion script for going from PLINK linkage-style PED and MAP files to ChromoPainters PHASE and MAP files. The converted MAP file was merged with HapMap's pre-compiled recombination file for Chr.6 (in order to accomplish this task i used a simple buggy but efficient AWK script). This trick allowed me to avoid/skip the painstaking exercise in UNLINKED model of Chromopainter .

As a next step, i changed default -k parameter to 200 (to give approximately 200 random samples to estimate the local variance). It is important to mention, that Chromopainter could be run in two modes: "donor mode" (which assumes the prior existence of "donor" and "recipient" haplotypic populations) and "all against all" mode. I we used the latter approach in which a single haplotype within an individual is reconstructed using the haplotypes from all other individuals in the sample as potential donors. This process is repeated for every haplotype in turn, so every individual is ultimately reconstructed in terms of all the other individuals, i.e every individual in my sample is conditioned on all the other individual.

It took me approximately 1,45 hours to finish Chromopainter's job. After that in combined Chromopainter output files, creating the file "MDLP.Chr6.chunkcounts.out", which is the coancestry matrix that fineSTRUCTURE requires as input.

After that i fed fineSTRUCTURE with required input (the coancestry matrix) and let fineSTRUCTURE to find a correct number of sample's "splits" (clusters). Surprisingly, even in such a limited (LD-prunned and tiny) dataset of SNPs, fineSTRUCTURE was able to detect four population-like (2 East-European, 1 West-European and outliers) clusters. In comparision, ADMIXTURE had some troubles with fleshing out sample's K-clusters, when i ran it on both phased and unphased version of the same Chr6. dataset (you can see them in this Excel spreadsheet).
From what i gather, author's claim holds water: fineSTRUCTURE indeed outperform ADMIXTURE/STRUCTURE approaches: "We next turn to our fineSTRUCTURE model-based analysis, again considering the unlinked coancestry matrix even though strong and variable LD exists in the dataset. We first compared performance of our unlinked model to the popular ADMIXTURE [15] software (Figure 3B and D, details in Section S8). Encouragingly, as the number of 5Mb regions increased from 5 to 200 we saw a monotonic performance increase for the no-linkage model, separating all groups with 200 markers. Further, our approach outperformed ADMIXTURE, with the ADMIXTURE performance levelling at around 60% correlation with the truth. In practice, we observed ADMIXTURE successfully splitting groups A, B and C and mostly splitting C1 and C2, but not B1 and B2, as detailed in Figures S6-11. ADMIXTURE performs inference under a model where markers are treated as unlinked, and where individuals may have genomes made up of mixtures of inferred source populations, while our simulation incorporated drift between populations, but not admixture. To examine whether violations of both these modelling assumptions explain the different results, we simulated a new dataset with the same underlying population structure of 5 populations as before, but no linkage (i.e. independence) between markers within each population. We analysed these data with STRUCTURE, which uses a similar underlying model to that of ADMIXTURE, but includes a no-admixture model (Section S7). For small datasets, STRUCTURE slightly improved performance relative to our unlinked fineSTRUCTURE model, but for larger SNP numbers, fineSTRUCTURE was able to identify all population splits (K = 5) while again, STRUCTURE was able to split only populations A, B and C (K = 3). Thus, even when LD information is not used (or even present), fineSTRUCTURE can offer advantages in some settings over these existing approaches."

With this gist of fineSTRUCTURE methodology, is is now safe to conclude that the combined set of all 22 chromosome, the linkage model can do even better. 

The painting process gives three square matrices of size NxN (N=number of individuals): 1) the frequency individuals copy chunks from each other, 2) the average length of those chunks, and 3) the mutation rate. These contain different aspects of the ancestry history, and almost completely summarize population ancestry.

The origin of some individual haplotypes could be ascertained using more comprehensive GERMLINe matching algorithm. Indeed, after applying GERMLINE algorithm to BEAGLE-phased chromosomes (1..22), i can see that the pairwise matches in haplotypes located far away from the chromosomal centromere and telomeres, tend to cluster into "ethnic" groups. However, haplotypes which are observed near centromere, are rather omnipresent - as would be expected if these regions had low rates of recombination because of proximity to centromeres.  The IBD segments detected by GERMLINE are listed (per chromosome) below:

With the scope on xMHC (see our previous posting),Of special interest here are, perhaps, haplotype matches on Chromosome 6, which match the pattern of massive extensive sharing of linked SNPs in the region of xMHC  - Chrom6-cM


  1. Why are the numbers of IBD segments so high for chromosome 11 orders of magnitude than others? Also the link above for chromosome 9 takes me the chromosome 20 file.

  2. Thetick,

    Thank you for reporting a trouble with Chr 9 link.
    I have fixed it.

  3. I think that high numbers of IBD segments are explained by "the increase in IBD sharing on chromosome 11q13 (lod=1.9) that seemed to be supported by parametric analysis, because D11S4076, located in this region, gave a lod of 1.55 under a dominant transmission A cautious interpretation of these signals is warranted, because several methods of analysis under two phenotype models were employed; therefore, signals could have arisen from random statistical fluctuations." (A high-density genome scan detects evidence for a bipolar-disorder susceptibility locus on 13q32 and other potential loci on 1q32 and 18p11.2)

  4. "When we adjusted the data for the locus on chromosome 15q and performed a new genomescan
    (data not shown), the next largest test statistics were LODs of 1.7 and 1.5, on chromosomes 12 and
    11, respectively, which were slightly higher LOD scores than seen in the scan unadjusted for the 15q
    locus. Although these were not significant at the genome-wide level, the chromosome 11 peak is
    interesting as it is in a region that contains the tyrosinase gene (TYR, aka OCA1A on 11q14-11q21)
    which plays a role in melanine formation in the eye (Frudakis et al., 2003; Oetting and King, 1993)."

  5. I have a bacterial SNP haplotype data for 200 samples (1000 SNP positions). How do I define donor and recipient for this data? I would appreciate if somebody could guide me on how to format my dataset for input into chromopainter. I want to run in a all-against-all mode.