My Blog List

Tuesday, August 2, 2011

K=6 STRUCTURE run: results for participants V157-V222

I've recently finished STRUCTURE analysis for the participants V157-V222. The fixing of STRUCTURE problems for the whole set of genomic SNP-markers was somewhat problematic, therefore i decided to run series of complimentary STRUCTURE on each chromosome separately (after splitting the initial PLINK dataset into chromosomal groups). Since in BGA analysis it is important not to overlook effects of  SNP-genotyping and linkage disequilibrium (LD), the dataset was pruned for low-genotyping rate SNPs and SNP markers with high LD with each other. The question of how far back one has to look to find unadmixed ancestors is relevant for determining ancestry admixture because it relates to whether or not, and to what extent, markers are in linkage disequilibrium (LD) with one another. Assuming that the markers are unlinked often produces admixture estimates that show subtle differences from those obtained when assuming the markers may be linked. Not having a knowledge of an individual’s genealogy  before testing them, we cannot predict the extent to which markers will be in LD beforehand, so we must be able to accommodate uncertainty in the linkage between markers if we hope to improve the precision of admixture estimates. The best way to accommodate uncertainty is to perform LD-based prunning in PLINK.

STRUCTURE method (Pritchard et al.2000) is called a model-based clustering method because it assumes a model bywhich it classifies samples into groups (either discretely or proportionally). However, there is  the very important problem with discrete population models. To measure the bias of the method and make real sense of it, one should first to define  absolute 100% ancestry—but how do we define absolute ancestry? The design of  "amateurish" BGA project (like our one) is usually involve the selection of arbitrary reference populations that are impossible to justify. But merely selecting the "reference" is not  sufficient to guarantee the success of BGA project. However, some (Dodecad and Harappa project, most notably) tried to circumvent this problem by simulating allele frequencies of  "ancestral" populations (Dodecad's "zombies") that represented a perfect surrogate/proxy for "real" ancestry, then as long as definition does not bias toward, say, one type of Eurasian over another (which could occur if we sampled one part of the Eurasian continent but not others), the definition would be reasonable for our purposes. Had the "simulating" experiment with "zombies"succeeded, it would have been a perfect surrogate of absolute ancestry.Once the definition of absolute ancestry had been established, it would be possible to then define the error of an admixture method relative to this reference point (as it is in the suprevised model of ADMIXTURE). However, one should keep in minds that this method has its own limitations. For example, given that we can estimate the parental allele frequency considering that of its modern-day descendents, we are still left with the problem of how to define which modern-day groups are bona-fide descendents of the parental groups and which are not.
Perhaps a  skeptic could admit that it is more safe to use assumption-free, objective approaches, such the clustering method, implemented by  Pritchard et al. (2000). The difference, the sceptic's claim would go, lies in methodology: running STRUCTURE with a given set of markers and a given set of samples, we define the best way to subdivide the sample given the genotype data and so by definition, if our sample is a good one, we establish the most appropriate population model for the markers. In that this particular method does not require a definition of structure a priori to infer population structure and assign individual samples to groups, its use obviates the need to define who belongs to a parental group beforehand, or even how many parental groups there are.

Another obvious reason for running STRUCTURE on each chromosome separately is the distribution of SNPs. Clearly, given that mixed ancestry can be unequally distributed throughout the genome, the more dense the marker spacing the more precise the estimates of ancestry and the less likely low levels of recent admixture (which might be focused on a small number of chromosomes) would be missed. 

Some final considerations:  the results (coeffecients of admixture) presented below are clustered using CLUMPP software ( CLUMPP was used to resolve the label switching and compute the average admixture coefficients for 22 subsequent runs) in 1000 permutations. Since the permutations of clusters was involved,  it is more appropriate to think of  results in terms of combined ancestry (i.e N-European + East-European, mixed (Balto-)Slavs, West-European+East-European etc.). 

You can check individual coefficients of ancestry in spreadsheet