Saturday, September 15, 2012

The reliability of 'ancestral components' in MDLP World-22 calculator: trying TreeMix on 22 ancestral components

In the recent paper of Pickrell and Pritchard outlined the most challenging problems in  the analysis of relationships between populations: 

Many aspects of historical relationships between populations  are reflected in genetic data. Inferring these relationships from dense genetic data remains a challenging and very difficult task. 

One such aspect is the (re)construction of ancestral population from precomputed allele frequencies.In the recent supervised Admixture analysis of the MDLP analysis (see previous post), i have  carried the analysis of dataset which consists of both extant populations and 'simulated un-admixed populations'. However, one can always question the accuracy of  the results of  such an experiment by posing a very simple question: is it possible to carry out the  simultaneous analysis of 22 putative ancestral populations while being independent of prior demographic information ( involving population splits, gene flow, and changes in population size). Accounting for demographic information is especially important in cases when we have only limited genetic data for the few ancestral populations that are known with greater certainty.

To the ultimate relief of genome bloggers, Joe Pickrell and Jonathan Pritchard have released a companion program to Structure/Admixture programs called TreeMix.   TreeMix uses large SNP data sets to estimate the relationships among populations  including both population splits and admixture events.  According to Pickrell and Pritchard, the new method provides a better representation of population histories than do standard tree-building methods when they  are applied to the worldwide populations.  In my opinion, the real advantage of TreeMix over traditional admixture programs is that it accounts for unknown ancestral allele frequencies. Indeed, in most cases  we do not know the ancestral values of allele frequencies, but instead only the values in sampled descendant populations.

I decided to give TreeMix a try on 22 putative ancestral ADMIXTURE components from the latest run. For this purpose, I used a conversion script, which was kindly provided by Dienekes Pontikos. This script takes ADMIXTURE P file and converts it into  a plink.treemix.gz file, which is ready for input into TreeMix.

 I ran TreeMix using following settings > treemix -i MDLPworld22.treemix.gz -root South-African -k 500   -o MDLP22world (consult TreeMix manual for explanation of program parameters).


After careful examination of the output tree, i was really surprised by how  remarkably similar the tree is to that one in the original paper of Pickrell & Pritchard:





In the next of experiment i have introduced to the tree a rudimentary model of migrations/gene flows by including -m 5 parameter. The model is, however, very rudimentary, because Pritchard and Pickrell have  modeled "migration between populations as occurring at single, instantaneous time points. This is, of course, a dramatic simplification of the migration process. This model will work best when gene flow between populations is restricted to a relatively short time period. Situations of continuous migration violate this assumption and lead to unclear results" (see the cited paper for the further discussion).


 Nevertheless, we may, though with all due caution, speculate about the hypothetical gene flow directions:

a) from East-South-Asian ancestral component -> to Tibetan ancestral component
b) from the split_point between Austronesian and Melanesian ancestral components -> to East-South-Asian ancestral component
c) from the split_point between East-Siberian and East-South-Asian -> to the split_point between  Indian and ((Austronesian;Melanesian) Tibetian)* ancestral component
d) and finally, from North-European-Mesolithic component to the split_point between West-Asian and (North-East-European; Atlantic-Mediterranean-Neolithic)* component.

The last point d) calls for a further explanation. As one can see from the attached population tree,North-European-Mesolithic component is unexpectedly shifted towards East-Asian populations.  Despite of all simplicity of test, i think that the result in consideration could probably support  a recent statement made by David Reich in the recent Reich et al. (2012) paper:
we took advantage of the fact that east/central Asian admixture  has affected northern Europeans to a greater extent than Sardinians (in our separate  manuscript in submission, we show that this is a result of the different amounts of central/east  Asian-related gene flow into these groups).
Cf. also Dienekes' comments 

If our finding is congruent Reich's hypothesis, then one could assume that he Mesolithic Europeans were Asian-shifted themselves, i.e the earliest episode of admixture could have occurred in the Mesolithic period.

UPDATE: 
The same scenario seems to be supported in (Patterson et al. 2012).


Another important question that could be raised in regard of the reliability of the inferred components is the question of 'component purity' (the notion of 'purity' is used here in rather speculative sense of Kant's Reinheit, and has nothing to do with 'racial purity'). Indeed,  a population tree (such as one presented above) could represent with the same degree of reliability both pure nested morphology of components  variable components of nested and non-nested morphology.

Fortunately for genome bloggers, we can try to tackle this problem by using three- and four- population tests for treeness from Reich et al. 2009. The 3-population test (Reich et al. 2009) allows one to detect the presence of admixture in a population X from two other populations A and B. The value
f3(X; A, B)

is negative when X does not appear to form a simple tree with A and B but appears to be a mixture of A and B (in Dienekes' laconic phrasing).  In order to calculate f3 statistics  i used the implementation of three-pop  TreeMix's  program threepop:


X                      A,B
Samoedic    North-Siberean,Atlantic_Mediterranean_Neolithic    0.00206352    0.000146163    14.118
North-Amerind    South-America_Amerind,South-African    0.00564354    0.00034763    16.2344
Samoedic    North-Siberean,North-East-European    0.00230317    0.000134738    17.0936
North-Amerind    South-America_Amerind,Melanesian    0.0063216    0.000345733    18.2846
Sub-Saharian    South-America_Amerind,South-African    0.0058646    0.000302046    19.4162
Sub-Saharian    Pygmy,South-America_Amerind    0.00420139    0.000216014    19.4496
Sub-Saharian    Pygmy,Paleo-Siberean    0.00426537    0.000217229    19.6354
Sub-Saharian    Pygmy,North-Siberean    0.00438367    0.000221767    19.767
North-Amerind    Pygmy,South-America_Amerind    0.0058802    0.000295087    19.927
Sub-Saharian    Pygmy,Mesomerican    0.00439921    0.000219885    20.0068
Sub-Saharian    Pygmy,Arctic-Amerind    0.00427798    0.000212131    20.1667
Samoedic    South-America_Amerind,South-African    0.00679588    0.000332084    20.4643
North-Amerind    South-America_Amerind,Austronesian    0.00658824    0.000314661    20.9375
Samoedic    North-Siberean,South-African    0.00504898    0.000240205    21.0195
North-Amerind    South-America_Amerind,Paleo-Siberean    0.00580508    0.000275451    21.0748
 



The full f3statistics file could be downloaded here.


After the careful examination of  f3statistics for 22 putative ancestral components, i haven't found negative values (negative value is a strong unambiguous signal of admixture). Thus, we could safely reject the hypothesis of mixture in ancestral components, and assume that each X components forms a simple tree with A and B components.











No comments:

Post a Comment