My Blog List

Tuesday, August 2, 2011

K=6 STRUCTURE run: results for participants V157-V222

I've recently finished STRUCTURE analysis for the participants V157-V222. The fixing of STRUCTURE problems for the whole set of genomic SNP-markers was somewhat problematic, therefore i decided to run series of complimentary STRUCTURE on each chromosome separately (after splitting the initial PLINK dataset into chromosomal groups). Since in BGA analysis it is important not to overlook effects of  SNP-genotyping and linkage disequilibrium (LD), the dataset was pruned for low-genotyping rate SNPs and SNP markers with high LD with each other. The question of how far back one has to look to find unadmixed ancestors is relevant for determining ancestry admixture because it relates to whether or not, and to what extent, markers are in linkage disequilibrium (LD) with one another. Assuming that the markers are unlinked often produces admixture estimates that show subtle differences from those obtained when assuming the markers may be linked. Not having a knowledge of an individual’s genealogy  before testing them, we cannot predict the extent to which markers will be in LD beforehand, so we must be able to accommodate uncertainty in the linkage between markers if we hope to improve the precision of admixture estimates. The best way to accommodate uncertainty is to perform LD-based prunning in PLINK.

STRUCTURE method (Pritchard et al.2000) is called a model-based clustering method because it assumes a model bywhich it classifies samples into groups (either discretely or proportionally). However, there is  the very important problem with discrete population models. To measure the bias of the method and make real sense of it, one should first to define  absolute 100% ancestry—but how do we define absolute ancestry? The design of  "amateurish" BGA project (like our one) is usually involve the selection of arbitrary reference populations that are impossible to justify. But merely selecting the "reference" is not  sufficient to guarantee the success of BGA project. However, some (Dodecad and Harappa project, most notably) tried to circumvent this problem by simulating allele frequencies of  "ancestral" populations (Dodecad's "zombies") that represented a perfect surrogate/proxy for "real" ancestry, then as long as definition does not bias toward, say, one type of Eurasian over another (which could occur if we sampled one part of the Eurasian continent but not others), the definition would be reasonable for our purposes. Had the "simulating" experiment with "zombies"succeeded, it would have been a perfect surrogate of absolute ancestry.Once the definition of absolute ancestry had been established, it would be possible to then define the error of an admixture method relative to this reference point (as it is in the suprevised model of ADMIXTURE). However, one should keep in minds that this method has its own limitations. For example, given that we can estimate the parental allele frequency considering that of its modern-day descendents, we are still left with the problem of how to define which modern-day groups are bona-fide descendents of the parental groups and which are not.
Perhaps a  skeptic could admit that it is more safe to use assumption-free, objective approaches, such the clustering method, implemented by  Pritchard et al. (2000). The difference, the sceptic's claim would go, lies in methodology: running STRUCTURE with a given set of markers and a given set of samples, we define the best way to subdivide the sample given the genotype data and so by definition, if our sample is a good one, we establish the most appropriate population model for the markers. In that this particular method does not require a definition of structure a priori to infer population structure and assign individual samples to groups, its use obviates the need to define who belongs to a parental group beforehand, or even how many parental groups there are.

Another obvious reason for running STRUCTURE on each chromosome separately is the distribution of SNPs. Clearly, given that mixed ancestry can be unequally distributed throughout the genome, the more dense the marker spacing the more precise the estimates of ancestry and the less likely low levels of recent admixture (which might be focused on a small number of chromosomes) would be missed. 

Some final considerations:  the results (coeffecients of admixture) presented below are clustered using CLUMPP software ( CLUMPP was used to resolve the label switching and compute the average admixture coefficients for 22 subsequent runs) in 1000 permutations. Since the permutations of clusters was involved,  it is more appropriate to think of  results in terms of combined ancestry (i.e N-European + East-European, mixed (Balto-)Slavs, West-European+East-European etc.). 

You can check individual coefficients of ancestry in spreadsheet


  1. This comment has been removed by the author.

  2. 2nd try at same post.

    Seems that the methodology utilized results in a new standard going foward. Interesting results! Thank you and Leon for much thought out process.

    I have three questions.

    1) Will you publish the number for each reference individuals for each group that were averaged?

    2) If it can be discussed, how do the individuals who participated compare to their known reported paper history, or is there a list maintained and available to review?

    3) You titled each K population and I am to assume, for an example Mixed Western + Eastern Europe, that 50% would be generally in the center of the two areas. Could elaborate in more detail?


    MJost V173


    I have posted a Root Means Square Comparison Excel 2007 Macro Enabled XLSM spreadsheet originally created by LarryS for the Eurogene's BGA project was modified for the MDL Project data by MJost and LarryS. If used in any other version please check your original column cells and verify your alternate software is handling sorting and reporting correctly. MJost

  4. @Mark

    Thank you for questions and Root Means Square Comparison Excel 2007 Macro. Could you be so kind to modify that macro for ADMIXTURE data?

    Some answers to your questions:

    1)Yes, i will publish the averaged numbers for each "group"
    2) There is an open-access spreadsheet "MDLP participants" for those who decide to reveal their identity -
    3)Yes,you are right in your assumption. Mixed "components" are, indeed, the intersection products of two (or more) "pure" components.

  5. Thank you for the response. I have revised the MDL project K6 spreadsheet creating a sort arrow button for each Admixture column.

    Please note there is a second worksheed called 'First 200 Matches' tab which contains a sort order chart that can be view with any specific sort selected.

    Thanks for the MDL participant document. I think everyone will enjoy the information provided.

    If this wasn't exactly what you were wanting to see with the Admixture Macro, please let me know.