My Blog List

Saturday, September 15, 2012

Behind the Curtains: MDLP World 22 showcase

Preliminary remarks

As you all may know, the MDLP  blog hasn't been updated since February 2012.
Half of year ago i promised myself that i would stop writing new posts on the MDLP blog before i'll finally get my scientific report on blog written.  Since I had to prioritize the completion of a scientific paper over the routine of blog posting,I was unable to continue updating the blog on a regular basis due to a lack of time, and had to make a change in how I conducted my research. So i decided to abstain from posting on the MDLP blog for a couple of month, being focused on more important matters. Despite of all limitations, i kept  working secretly on the MDLP project, collecting necessary data and performing different 'genomic' experiments in order to achieve my final goal (publishing of paper).  The results of secret experiments with new genomic samples and tools eventually leaked to the curious public, spawning immense interest in my project. After releasing a new version of my own modification of DIYDodecad calculator on, i was literally flooded by emails from users asking me  questions they wanted me to answer.

I understood the strategical mistake of releasing poorly documented data/analysis on Internet  and felt obliged  to explain details. Obviously, i will start new series of  the blog posts by covering the project feature the people most interested in, i.e the MDLP World22 calculator.
The population dataset of MDLP World22 calculator.

The reference population dataset of the calculator was assembled in PLINK by intersecting and thinning the samples from different data sources: HapMap 3 (the filtered dataset CEU,YRI,JPT,CHB), 1000genomes, Rasmussen et al. (2010)HGDP (Stanford) (all populations)Metspalu et al. (2011),Yunusbayev et al. (2011), Chaubey et al. (2010) etc. Furthermore i handpicked random 10 individuals from each European country panel in POPRES dataset, or the maximum number of individuals available otherwise, to select the POPRES European individuals to be included in our study. Finally, in order to evaluate the correlation between the modern and the ancient genetic diversity, i have also included ancient DNA genomic samples of Ötzi,(Keller et al.(2012)) Swedish Neolithic samples Gök4, Ajv52, Ajv70, Ire8, Ste7 (Skoglund et al. (2012)) and 2 La Braña individuals from the Mesolithic sites of the Iberian Peninsula (Sánchez-Quinto et al.(2012)). Then i added 90 samples of individuals-participants of our MDLP project.  After merging the aforementioned datasets and thinning the SNP set with PLINK command to exclude SNPs with missing rates greater than 1% and minor alleles, i filtered out duplicates, the individuals with high pairwise IBD-sharing  (estimated in Plink as as the average fraction of alleles shared between two individuals over all loci) and the individuals with kinship coefficient suggesting relatedness (kinship coefficients were estimated in KING software). Also i had to filter out individuals with more more tham 3 standard deviations from the population averages. Since kinship coefficient is robustly estimated by HWE (Hary-Weinberg expectations) among SNPs with the same underlying allele frequencies, SNPs showing strong deviation (p < 5.5 x10−8) from Hardy-Weinberg expectations were removed from the merged and filtered dataset. After that I filtered to keep the list of common SNPs present in Illumina/Affymetrix chips and performed  linkage disequilibrium based pruning using a window size of 50, a step of 5 and r^2 threshold of 0.3.

This complex sequence of consequent operations with the initial reference and project datasets yielded a final dataset which included 80751 SNPs  in 2516 individuals from 225 populations.

ADMIXTURE analysis

 As always, the final dataset in PLINK linked format was further processed in ADMIXTURE software. Sketching the plan for the design of ADMIXTURE test,  i had to face the difficult problem: as it has been shown in (Patterson et al.2006) the number of markers needed to resolve populations in ADMIXTURE analysis is inversely proportional to the genetic distance (Fst ) betweeen the populations. According to ADMIXTURE best practice, it is believed that 10,000 markers  are suffice to perform GWAS correction for continentally separated populations (for example, African, Asian, and European populations FST > .05) while more like 100,000 markers are necessary when the populations are within a continent (Europe, for instance, FST < 0.01).
To increase the accuracy of ADMIXTURE results i decided to use a method proposed by  Dienekes' for converting allele frequencies into 'synthetic individuals'(see also Zack's example). The idea is fairly simple: run an unsupervised ADMIXTURE analysis once to generate allele frequencies for your K ancestral components; then generate zombie populations using these allele frequencies; whenever you want to estimate admixture proportions in new samples run supervised ADMIXTURE analysis using the zombie populations.  Like any genome blogger engaged in the task of evaluating admixtures in samples, i must grapple with obvious question of the reliability of this approach.  Although i am aware of  methodological controversies in using simulated individuals, i would rather concur with Dienekes who considered "synthetic individuals" the best abstract proxies for the ancient ancestral populations. But my purpose is served if i can use the approach used by Dienekes and Zack to obtain meaningful results. To begin with, i routinely ran unsupervised ADMIXTURE  K=22 analysis (assuming 22 ancestral populations) which yielded the admixture proportions of individuals from these K populations, as well as the allele frequencies for all SNPs for each of 22 ancestral populations (below are conventional names for each of inferred components in order of appearance):

Therefore i took the allele frequencies which were computed earlier in unsupervised Admixture K=22 for the merged dataset, pooled them into PLINK and generated 10 "synthetic individuals per ancestral component) using PLINK command --simulate.  When the simulation had been  finished, i visualized the distance between simulated individuals using multi-dimensional scaling:


As a next step,i included simulated individuals you have as part of a new reference population (including 220 simulated individuals in 22 simulated populations).Then, I ran ADMIXTURE anew, this time in “supervised” mode for K = 22 (with simulated individuals being 'reference' individuals). The Admixture K=22 converged in 31 iterations (37773.1 sec) with final loglikelihood:-188032005.430318 (below are Fst divergences between estimated 'ancestral' populations):

The Fst distance/divergence matrix was used for inferring a most probable NJ-based topology of component distance tree (outgroup: South-African):

 The individual 'supervised' ADMIXTURE  results (in Excel spreadsheet) for the project participants have been uploaded to GoogleDocs (please note that the average results for reference populations is also available on special request).

MDLP World22 DIYcalculator

The output files of Admixture K=22 supervised run (average values of admixture coefficients in reference populations and FsT values)  were used for designing a new version of  the MDLP DIYcalculator, which is better known by its codename "World22" (online version is available in AdMix-Utilities section of Gedmatch under MDLP project). MDLP DIYcalculator itself is based on the code of Dodecad DIY calculator (c)ourtesy of Dienekes Pontikos and was developed as part of the Dodecad Ancestry Project. In its Gedmatch implementation MDLP 'World22' DIYcalculator is paired by MDLP 'World22' Oracle, also based on Dienekes' and Zack's code (Harappa/DodecadOracle). The 'Oracle' is designed to find in a single population mode your closest (closest in terms of similarity) population from MDLP ''Word22' admixture results. In a mixed mode, Oracle considers all pairs of populations, and for each one of them calculates the minimum Fst-weighted distance to the sample in consideration, and the admixture proportions that produce it.

Please notice: 'ancestral' populations (i.e 'simulated populations' from the previous step - see above) are labeled in Oracle results as (anc), while the 'real world' modern and ancient populations are marked as "derived".

If you have troubles with understanding/interpreting the results of Oracle and DIYcalculcator, please consult the corresponding topics on Dodecad and HarappaWorld blogs. It is not of avail to repeat in this blog everything they wrote in their own blogs.

What the heck are MDLP Word-22 components?

 One of those questions that i usually keep getting in emails is what do the various reference populations and ancestral components for my World K=12 and World-22 analyses mean. I've already provided hints to the answer  earlier, but - as old Chinese proverb says - one picture is worth ten thousand words. That's why i decided to display the admixture coefficients spatially on the globe surface. Following Francois Olivier, who proposed to use the graphical library of the statistical software R to display  spatial interpolates of the admixture coefficients (Q matrix) in two dimensions (where spatial coordinates are recorded as longitude and latitude), i created  2 contour maps per component.

Pygmy (modal in Biaka and Mbuti population)

West-Asian (bimodal component with peaks in Caucasian populations and south-western part of Iran, equal to Dienekes' Caucasian/Gedrosia component) 

 North-European-Mesolithic (local component with peaks in European Mesolithic samples of La_Brana and  modern North-European Saami population).

 Tibetan (Indo-Burmese) component (Himalay, Tibet)

Mesomerican (major genetic component in Native Americans from Mesoamerica)

North-Amerind (the 'native' component in North American Natives)

South-Amerind (the 'native' component in South American Natives)

  Atlantic-Mediterranean-Neolithic (the main genetic component  in Western and South-Western Europe)

  The rest contour maps for all components could be downloaded here.



  1. Vadim,

    Fantastic work - I can't wait to see my family's results!!!

    Are you only providing the World22 and Oracle on Gedmatch, or is there a link to download them also?

    I have one account on Gedmatch - my own - but have 9 other accounts, many of them children. I would like to be able to download the files and run them directly in R, is that possible?

  2. I'm Irish - all 8 G-Grandparents are Irish, and here's what I get:

    Admix Results (sorted):

    # Population Percent
    1 North-East-European 48.89
    2 Atlantic_Mediterranean_Neolithic 33.55
    3 West-Asian 7.29
    4 North-European-Mesolithic 6.41
    5 Indo-Iranian 2.74
    6 Indian 0.55
    7 South-America_Amerind 0.27
    8 South-African 0.14
    9 Indo-Tibetan 0.09
    10 Pygmy 0.04
    11 Near_East 0.02

    Single Population Sharing:

    # Population (source) Distance
    1 CEU_V (derived) 3.39
    2 German_V (derived) 3.67
    3 Welsh (derived) 4.41
    4 CEU (derived) 4.92
    5 Austrian (derived) 5.16
    6 Norwegian_V (derived) 5.23
    7 British (derived) 5.92
    8 Swedish (derived) 6.11
    9 German-North (derived) 6.34
    10 Orcadian (derived) 6.84
    11 Hungarian (derived) 7.03
    12 German (derived) 7.26
    13 Slovenian (derived) 7.96
    14 Swedish_V (derived) 8.61
    15 German-South (derived) 8.61
    16 Croatian (derived) 9.19
    17 Bosnian (derived) 9.99
    18 Serbian (derived) 10.43
    19 Czech (derived) 10.76
    20 Croatian_V (derived) 12.2

    Mixed Mode Population Sharing:

    # Primary Population (source) Secondary Population (source) Distance
    1 61% German_V (derived) + 39% Norwegian_V (derived) @ 1.98
    2 81% CEU (derived) + 19% Latvian_V (derived) @ 2.19
    3 61.4% CEU (derived) + 38.6% German (derived) @ 2.38
    4 55.9% British (derived) + 44.1% German (derived) @ 2.41
    5 81.5% CEU (derived) + 18.5% Ukrainian-Center (derived) @ 2.43
    6 81.1% CEU (derived) + 18.9% Ukrainian_V (derived) @ 2.47
    7 77.7% British (derived) + 22.3% Latvian_V (derived) @ 2.5
    8 84.6% CEU (derived) + 15.4% Mordovian (derived) @ 2.51
    9 93.7% Welsh (derived) + 6.3% Avar (derived) @ 2.52
    10 96% CEU_V (derived) + 4% Lak (derived) @ 2.58
    11 96% CEU_V (derived) + 4% Tabassaran (derived) @ 2.6
    12 82.3% CEU (derived) + 17.7% Mordovian_V (derived) @ 2.6
    13 96.3% CEU_V (derived) + 3.7% Pashtun (derived) @ 2.6
    14 84% CEU (derived) + 16% Ukrainian-East (derived) @ 2.6
    15 93.6% Welsh (derived) + 6.4% Tabassaran (derived) @ 2.6
    16 96.3% CEU_V (derived) + 3.7% Ossetian (derived) @ 2.61
    17 97.8% CEU_V (derived) + 2.2% West-Asian (ancestral) @ 2.63
    18 68.1% Norwegian_V (derived) + 31.9% Bosnian (derived) @ 2.63
    19 96.2% CEU_V (derived) + 3.8% Avar (derived) @ 2.63
    20 81.4% British (derived) + 18.6% Mordovian (derived) @ 2.63

    Question: Do you have Irish, Scottish or French Reference populations?

  3. One other thing, it would be interesting to see on what Chromosome segments I am:

    North-European-Mesolithic 6.41

    Also what are the rank order of North-European-Mesolithic in terms of various populations? I guess based on your map, Saami would be #1, and maybe Finns, Estonians and Latvians would be #2, and then maybe Scandinavians in general as #3 ??

  4. @pconroy

    Thank you for your comments.

    As of today, i am going to release the off-line Oracle version for the wide audience.
    So you'll be able to use it for your other accounts.

    As far as it concerns your question regarding Irish, Scottish or French Reference populations - no i don't have any Irish, Scottish references in my dataset. On other hand,i do have a huge set of French populations.

    North-European-Mesolithic - is a local "component" centered in ancient Mesolithic La Brana samples, Saami, Baltic and Finnic populations. That's why it has a double label "North-European-Mesolithic"

  5. Is there a URL that I can download the MDLP World-22 DIY Oracle from?


    1. Yes, i'll let you know when i'll publish the MDLP World-22 DIY Oracle online

  6. how come yemen jews are only 2.4% SSA according to your spreadsheet??i expected them to be much more SSA (they have the darkest skin tone among the jews in israel)

  7. As Gedmatch has suspended the uploading of new raw data until August I'm also searching for the calculator files of World-22. Thank you in advance for sharing them.

  8. This comment has been removed by the author.

  9. What does this mean??? Please help!!!

    # Population Percent
    1 North-East-European 23.65
    2 Atlantic_Mediterranean_Neolithic 22.23
    3 Mesoamerican 17.48
    4 North-Amerind 12.22
    5 West-Asian 7.41
    6 South-America_Amerind 5.58
    7 Near_East 4.38
    8 Sub-Saharian 2.89
    9 Indo-Iranian 1.45
    10 Samoedic 0.79
    11 North-Siberean 0.62
    12 Austronesian 0.34
    13 Pygmy 0.3
    14 Indian 0.23
    15 East-Siberean 0.18
    16 North-European-Mesolithic 0.18
    17 Melanesian 0.09

    Single Population Sharing:

    # Population (source) Distance
    1 Miwok (derived) 7.1
    2 Mexican (derived) 12.72
    3 Serrano (derived) 21.27
    4 Puerto-Rican (derived) 21.91
    5 Cochimi (derived) 22.73
    6 Colville (derived) 24.25
    7 Costanoan (derived) 24.3
    8 Tsimsian (derived) 24.57
    9 Romania (derived) 28
    10 Ashkenazim_V (derived) 28.47
    11 Gagauz (derived) 28.84
    12 Bulgarian (derived) 28.87
    13 Macedonian (derived) 29.57
    14 Greek_South (derived) 29.7
    15 Swiss (derived) 29.95
    16 Greek_North (derived) 30.01
    17 Montenegrin (derived) 30.09
    18 Tatar_Crim (derived) 30.22
    19 Aleut (derived) 30.48
    20 Italian_North (derived) 30.6

    Mixed Mode Population Sharing:

    # Primary Population (source) Secondary Population (source) Distance
    1 86% Miwok (derived) + 14% Latvian_V (derived) @ 3.49
    2 86.7% Miwok (derived) + 13.3% Mordovian_V (derived) @ 3.77
    3 87.6% Miwok (derived) + 12.4% Mordovian (derived) @ 3.88
    4 85.1% Miwok (derived) + 14.9% Tatar_Kryashen (derived) @ 3.97
    5 88.4% Miwok (derived) + 11.6% Russian_South (derived) @ 3.97
    6 85.9% Miwok (derived) + 14.1% Tartar_Mishar (derived) @ 3.97
    7 88.3% Miwok (derived) + 11.7% Ukrainian (derived) @ 3.98
    8 87.4% Miwok (derived) + 12.6% Ukrainian_V (derived) @ 3.99
    9 88.3% Miwok (derived) + 11.7% Ukrainian-East (derived) @ 4
    10 87.7% Miwok (derived) + 12.3% Ukrainian-Center (derived) @ 4.04
    11 89.9% Miwok (derived) + 10.1% Belarusian (derived) @ 4.04
    12 88.3% Miwok (derived) + 11.7% Russian_cossack (derived) @ 4.05
    13 87.3% Miwok (derived) + 12.7% Ukrainian-West (derived) @ 4.07
    14 89.5% Miwok (derived) + 10.5% Russian (derived) @ 4.1
    15 88.3% Miwok (derived) + 11.7% Russian_V (derived) @ 4.11
    16 90.7% Miwok (derived) + 9.3% Lithuanian (derived) @ 4.13
    17 92.7% Miwok (derived) + 7.3% North-East-European (ancestral) @ 4.13
    18 88.4% Miwok (derived) + 11.6% Moldavian (derived) @ 4.14
    19 89% Miwok (derived) + 11% Russian_Center (derived) @ 4.14
    20 84.5% Miwok (derived) + 15.5% Tatar (derived) @ 4.16