Magnus Ducatus Lituaniae Project: 2012-09-09

Saturday, September 15, 2012

The reliability of 'ancestral components' in MDLP World-22 calculator: trying TreeMix on 22 ancestral components

In the recent paper of Pickrell and Pritchard outlined the most challenging problems in the analysis of relationships between populations:

Many aspects of historical relationships between populations are reflected in genetic data. Inferring these relationships from dense genetic data remains a challenging and very difficult task.

One such aspect is the (re)construction of ancestral population from precomputed allele frequencies.In the recent supervised Admixture analysis of the MDLP analysis (see previous post), i have carried the analysis of dataset which consists of both extant populations and 'simulated un-admixed populations'. However, one can always question the accuracy of the results of such an experiment by posing a very simple question: is it possible to carry out the simultaneous analysis of 22 putative ancestral populations while being independent of prior demographic information ( involving population splits, gene flow, and changes in population size). Accounting for demographic information is especially important in cases when we have only limited genetic data for the few ancestral populations that are known with greater certainty.

To the ultimate relief of genome bloggers, Joe Pickrell and Jonathan Pritchard have released a companion program to Structure/Admixture programs called TreeMix. TreeMix uses large SNP data sets to estimate the relationships among populations including both population splits and admixture events. According to Pickrell and Pritchard, the new method provides a better representation of population histories than do standard tree-building methods when they are applied to the worldwide populations. In my opinion, the real advantage of TreeMix over traditional admixture programs is that it accounts for unknown ancestral allele frequencies. Indeed, in most cases we do not know the ancestral values of allele frequencies, but instead only the values in sampled descendant populations.

I decided to give TreeMix a try on 22 putative ancestral ADMIXTURE components from the latest run. For this purpose, I used a conversion script, which was kindly provided by Dienekes Pontikos. This script takes ADMIXTURE P file and converts it into a plink.treemix.gz file, which is ready for input into TreeMix.

I ran TreeMix using following settings > treemix -i MDLPworld22.treemix.gz -root South-African -k 500 -o MDLP22world (consult TreeMix manual for explanation of program parameters).

After careful examination of the output tree, i was really surprised by how remarkably similar the tree is to that one in the original paper of Pickrell & Pritchard:

In the next of experiment i have introduced to the tree a rudimentary model of migrations/gene flows by including -m 5 parameter. The model is, however, very rudimentary, because Pritchard and Pickrell have modeled "migration between populations as occurring at single, instantaneous time points. This is, of course, a dramatic simplification of the migration process. This model will work best when gene flow between populations is restricted to a relatively short time period. Situations of continuous migration violate this assumption and lead to unclear results" (see the cited paper for the further discussion).

Nevertheless, we may, though with all due caution, speculate about the hypothetical gene flow directions:

a) from East-South-Asian ancestral component -> to Tibetan ancestral component

b) from the split_point between Austronesian and Melanesian ancestral components -> to East-South-Asian ancestral component

c) from the split_point between East-Siberian and East-South-Asian -> to the split_point between Indian and ((Austronesian;Melanesian) Tibetian)* ancestral component

d) and finally, from North-European-Mesolithic component to the split_point between West-Asian and (North-East-European; Atlantic-Mediterranean-Neolithic)* component.

The last point d) calls for a further explanation. As one can see from the attached population tree,North-European-Mesolithic component is unexpectedly shifted towards East-Asian populations. Despite of all simplicity of test, i think that the result in consideration could probably support a recent statement made by David Reich in the recent Reich et al. (2012) paper:

we took advantage of the fact that east/central Asian admixture has affected northern Europeans to a greater extent than Sardinians (in our separate manuscript in submission, we show that this is a result of the different amounts of central/east Asian-related gene flow into these groups).

Cf. also Dienekes' comments

If our finding is congruent Reich's hypothesis, then one could assume that he Mesolithic Europeans were Asian-shifted themselves, i.e the earliest episode of admixture could have occurred in the Mesolithic period.

UPDATE:

The same scenario seems to be supported in (Patterson et al. 2012).

Another important question that could be raised in regard of the reliability of the inferred components is the question of 'component purity' (the notion of 'purity' is used here in rather speculative sense of Kant's Reinheit, and has nothing to do with 'racial purity'). Indeed, a population tree (such as one presented above) could represent with the same degree of reliability both pure nested morphology of components variable components of nested and non-nested morphology.

Fortunately for genome bloggers, we can try to tackle this problem by using three- and four- population tests for treeness from Reich et al. 2009. The 3-population test (Reich et al. 2009) allows one to detect the presence of admixture in a population X from two other populations A and B. The value

f3(X; A, B)

is negative when X does not appear to form a simple tree with A and B but appears to be a mixture of A and B (in Dienekes' laconic phrasing). In order to calculate f3 statistics i used the implementation of three-pop TreeMix's program threepop:

X                      A,B
Samoedic    North-Siberean,Atlantic_Mediterranean_Neolithic    0.00206352    0.000146163    14.118
North-Amerind    South-America_Amerind,South-African    0.00564354    0.00034763    16.2344
Samoedic    North-Siberean,North-East-European    0.00230317    0.000134738    17.0936
North-Amerind    South-America_Amerind,Melanesian    0.0063216    0.000345733    18.2846
Sub-Saharian    South-America_Amerind,South-African    0.0058646    0.000302046    19.4162
Sub-Saharian    Pygmy,South-America_Amerind    0.00420139    0.000216014    19.4496
Sub-Saharian    Pygmy,Paleo-Siberean    0.00426537    0.000217229    19.6354
Sub-Saharian    Pygmy,North-Siberean    0.00438367    0.000221767    19.767
North-Amerind    Pygmy,South-America_Amerind    0.0058802    0.000295087    19.927
Sub-Saharian    Pygmy,Mesomerican    0.00439921    0.000219885    20.0068
Sub-Saharian    Pygmy,Arctic-Amerind    0.00427798    0.000212131    20.1667
Samoedic    South-America_Amerind,South-African    0.00679588    0.000332084    20.4643
North-Amerind    South-America_Amerind,Austronesian    0.00658824    0.000314661    20.9375
Samoedic    North-Siberean,South-African    0.00504898    0.000240205    21.0195
North-Amerind    South-America_Amerind,Paleo-Siberean    0.00580508    0.000275451    21.0748

The full f3statistics file could be downloaded here.

After the careful examination of f3statistics for 22 putative ancestral components, i haven't found negative values (negative value is a strong unambiguous signal of admixture). Thus, we could safely reject the hypothesis of mixture in ancestral components, and assume that each X components forms a simple tree with A and B components.

Behind the Curtains: MDLP World 22 showcase

Preliminary remarks

As you all may know, the MDLP blog hasn't been updated since February 2012.

Half of year ago i promised myself that i would stop writing new posts on the MDLP blog before i'll finally get my scientific report on blog written. Since I had to prioritize the completion of a scientific paper over the routine of blog posting,I was unable to continue updating the blog on a regular basis due to a lack of time, and had to make a change in how I conducted my research. So i decided to abstain from posting on the MDLP blog for a couple of month, being focused on more important matters. Despite of all limitations, i kept working secretly on the MDLP project, collecting necessary data and performing different 'genomic' experiments in order to achieve my final goal (publishing of paper). The results of secret experiments with new genomic samples and tools eventually leaked to the curious public, spawning immense interest in my project. After releasing a new version of my own modification of DIYDodecad calculator on Gedmatch.com, i was literally flooded by emails from Gedmatch.com users asking me questions they wanted me to answer.

I understood the strategical mistake of releasing poorly documented data/analysis on Internet and felt obliged to explain details. Obviously, i will start new series of the blog posts by covering the project feature the people most interested in, i.e the MDLP World22 calculator.

The population dataset of MDLP World22 calculator.

The reference population dataset of the calculator was assembled in PLINK by intersecting and thinning the samples from different data sources: HapMap 3 (the filtered dataset CEU,YRI,JPT,CHB), 1000genomes, Rasmussen et al. (2010), HGDP (Stanford) (all populations), Metspalu et al. (2011),Yunusbayev et al. (2011), Chaubey et al. (2010) etc. Furthermore i handpicked random 10 individuals from each European country panel in POPRES dataset, or the maximum number of individuals available otherwise, to select the POPRES European individuals to be included in our study. Finally, in order to evaluate the correlation between the modern and the ancient genetic diversity, i have also included ancient DNA genomic samples of Ötzi,(Keller et al.(2012)) Swedish Neolithic samples Gök4, Ajv52, Ajv70, Ire8, Ste7 (Skoglund et al. (2012)) and 2 La Braña individuals from the Mesolithic sites of the Iberian Peninsula (Sánchez-Quinto et al.(2012)). Then i added 90 samples of individuals-participants of our MDLP project. After merging the aforementioned datasets and thinning the SNP set with PLINK command to exclude SNPs with missing rates greater than 1% and minor alleles, i filtered out duplicates, the individuals with high pairwise IBD-sharing (estimated in Plink as as the average fraction of alleles shared between two individuals over all loci) and the individuals with kinship coefficient suggesting relatedness (kinship coefficients were estimated in KING software). Also i had to filter out individuals with more more tham 3 standard deviations from the population averages. Since kinship coefficient is robustly estimated by HWE (Hary-Weinberg expectations) among SNPs with the same underlying allele frequencies, SNPs showing strong deviation (p < 5.5 x10⁻⁸) from Hardy-Weinberg expectations were removed from the merged and filtered dataset. After that I filtered to keep the list of common SNPs present in Illumina/Affymetrix chips and performed linkage disequilibrium based pruning using a window size of 50, a step of 5 and r^2 threshold of 0.3.

This complex sequence of consequent operations with the initial reference and project datasets yielded a final dataset which included 80751 SNPs in 2516 individuals from 225 populations.

ADMIXTURE analysis

As always, the final dataset in PLINK linked format was further processed in ADMIXTURE software. Sketching the plan for the design of ADMIXTURE test, i had to face the difficult problem: as it has been shown in (Patterson et al.2006) the number of markers needed to resolve populations in ADMIXTURE analysis is inversely proportional to the genetic distance (Fst ) betweeen the populations. According to ADMIXTURE best practice, it is believed that 10,000 markers are suffice to perform GWAS correction for continentally separated populations (for example, African, Asian, and European populations FST > .05) while more like 100,000 markers are necessary when the populations are within a continent (Europe, for instance, FST < 0.01).

To increase the accuracy of ADMIXTURE results i decided to use a method proposed by Dienekes' for converting allele frequencies into 'synthetic individuals'(see also Zack's example). The idea is fairly simple: run an unsupervised ADMIXTURE analysis once to generate allele frequencies for your K ancestral components; then generate zombie populations using these allele frequencies; whenever you want to estimate admixture proportions in new samples run supervised ADMIXTURE analysis using the zombie populations. Like any genome blogger engaged in the task of evaluating admixtures in samples, i must grapple with obvious question of the reliability of this approach. Although i am aware of methodological controversies in using simulated individuals, i would rather concur with Dienekes who considered "synthetic individuals" the best abstract proxies for the ancient ancestral populations. But my purpose is served if i can use the approach used by Dienekes and Zack to obtain meaningful results. To begin with, i routinely ran unsupervised ADMIXTURE K=22 analysis (assuming 22 ancestral populations) which yielded the admixture proportions of individuals from these K populations, as well as the allele frequencies for all SNPs for each of 22 ancestral populations (below are conventional names for each of inferred components in order of appearance):

Pygmy
West-Asian
North-European-Mesolithic
Tibetan
Mesomerican
Arctic-Amerind
South-America_Amerind
Indian
North-Siberean
Atlantic_Mediterranean_Neolithic
Samoedic
Proto-Indo-Iranian
East-Siberean
North-East-European
South-African
North-Amerind
Sub-Saharian
East-South-Asian
Near_East
Melanesian
Paleo-Siberean
Austronesian

Therefore i took the allele frequencies which were computed earlier in unsupervised Admixture K=22 for the merged dataset, pooled them into PLINK and generated 10 "synthetic individuals per ancestral component) using PLINK command --simulate. When the simulation had been finished, i visualized the distance between simulated individuals using multi-dimensional scaling:

As a next step,i included simulated individuals you have as part of a new reference population (including 220 simulated individuals in 22 simulated populations).Then, I ran ADMIXTURE anew, this time in “supervised” mode for K = 22 (with simulated individuals being 'reference' individuals). The Admixture K=22 converged in 31 iterations (37773.1 sec) with final loglikelihood:-188032005.430318 (below are Fst divergences between estimated 'ancestral' populations):

The Fst distance/divergence matrix was used for inferring a most probable NJ-based topology of component distance tree (outgroup: South-African):

The individual 'supervised' ADMIXTURE results (in Excel spreadsheet) for the project participants have been uploaded to GoogleDocs (please note that the average results for reference populations is also available on special request).

MDLP World22 DIYcalculator

The output files of Admixture K=22 supervised run (average values of admixture coefficients in reference populations and FsT values) were used for designing a new version of the MDLP DIYcalculator, which is better known by its codename "World22" (online version is available in AdMix-Utilities section of Gedmatch under MDLP project). MDLP DIYcalculator itself is based on the code of Dodecad DIY calculator (c)ourtesy of Dienekes Pontikos and was developed as part of the Dodecad Ancestry Project. In its Gedmatch implementation MDLP 'World22' DIYcalculator is paired by MDLP 'World22' Oracle, also based on Dienekes' and Zack's code (Harappa/DodecadOracle). The 'Oracle' is designed to find in a single population mode your closest (closest in terms of similarity) population from MDLP ''Word22' admixture results. In a mixed mode, Oracle considers all pairs of populations, and for each one of them calculates the minimum Fst-weighted distance to the sample in consideration, and the admixture proportions that produce it.

Please notice: 'ancestral' populations (i.e 'simulated populations' from the previous step - see above) are labeled in Oracle results as (anc), while the 'real world' modern and ancient populations are marked as "derived".

If you have troubles with understanding/interpreting the results of Oracle and DIYcalculcator, please consult the corresponding topics on Dodecad and HarappaWorld blogs. It is not of avail to repeat in this blog everything they wrote in their own blogs.

What the heck are MDLP Word-22 components?

One of those questions that i usually keep getting in emails is what do the various reference populations and ancestral components for my World K=12 and World-22 analyses mean. I've already provided hints to the answer earlier, but - as old Chinese proverb says - one picture is worth ten thousand words. That's why i decided to display the admixture coefficients spatially on the globe surface. Following Francois Olivier, who proposed to use the graphical library of the statistical software R to display spatial interpolates of the admixture coefficients (Q matrix) in two dimensions (where spatial coordinates are recorded as longitude and latitude), i created 2 contour maps per component.

Pygmy (modal in Biaka and Mbuti population)

West-Asian (bimodal component with peaks in Caucasian populations and south-western part of Iran, equal to Dienekes' Caucasian/Gedrosia component)