My Blog List

Tuesday, December 6, 2011

Experimental test: running Admixture on phased samples and estimating ancestries at each locus in a population of admixed individuals (LAMP))

As a methodological tool, there is nothing wrong with the combined usage of Plink/Admixture/Lamp for assessing the popupaltion stratification and the levels of individual admixtures. One of  the goals of our project is to expose both the advantages and the limitation of particular methodologies in BGA analyses, and therefore this particular  experimental test seeks to expose  the grip of unified Plink/Admixture/Lamp approaches  as  a methodological contrivance.
Thus the goal of the experiment described herein  is to lay  bare some of practical and thereoretical  obstacles that have  hitherto obscured the legibility of our previous biogenographic analysis of the population substructures in Europe.

We routinely began our analysis in Plink software to filter the combined dataset to include only SNPs on the 22 autosomal chromosomes with minor allele frequency >1% and genotyping success >99%. 
Because background linkage disequilibrium (LD) can affect both principal component and structure-like analysis, wethinned the marker set by excluding SNPs in strong LD (pairwise genotypic correlation r2>0.4) in a window of 100 SNPs (sliding the window by 10 SNPs at a time). We also used the following Plink techniques for obtaining the homogeneous sample: pairwise clustering based on IBS is for detecting pairs of individuals who look more different from each other than you'd expect in a random, homogeneous sample; evaluating 

Those reference individials, who look more different (more than 3 SD) to the rest of the data set and/or have high PIHAT values ( greater than 0.05) and higher degree of inbreeding (homozygocity)* were removed from our training set.
-
* Note that the degree of inbreeding is defined as the probability that identical homozygocity occurs in a locus. 

The stratification of samples according to the levels of homozygocity,X axis -total ammount of homozygous segments in Kb (kilobases) units; Y-axis - average size of homozygous segments in Kb  units


The levels of individual homozygocity in data set: NSEG (number of segments) on X axis is plotted against  the total length of homozygous segments in KB (Y axis)


The pairwise clustering based on IBD. The total length of IBD segments (X axis) is plotted against PIHAT values (
P(IBD=2)+0.5*P(IBD=1) ( proportion IBD ))



After frequency and genotyping pruning, and removing the outliers, the final data set consisted of 90455 SNPs and 317  individuals (289 males, 82 females) that were used in subsequent analyses.

First of all, we used ADMIXTURE (Alexander, Novembre, Lange 2009) implementing a structure-like
(Pritchard, Stephens, Donnelly 2000) model-based maximum likelihood (ML) clustering algorithm to assess population structure in the whole data set .

In order to maintain the compatibility with MDLP calculator, we choosed K=7 as a sensible modeling choice. The Q estimates, obtained from Admixture run, were plotted  in R.



Please note that only participants of the MDLP project are plotted on this barplot. The full list of Q estimates (including "reference" populations) is available in this  spreadsheet.

The seven ancestral components (K) inferred at this level of resolution are:

Trans-Caucasian -red 
Balkanic/Mediterrean -yellow    
North-Caucasian -green
West-European    
Altaic    - light blue
Balto-Slavic    - dark blue 
Balto-Finnic/North-European -purple

As usual, we labeled these components for mnemonic purposes only:  researchers should thus be cautious in interpreting  the inferred components in terms of conventional population history.

A neighbor-joining tree based on inter-population Fst distances
 As a next step we splitted whole-genome PLINK binary files (371 samples) to 22 separate chromosomal chunks and sunsequently used Admixture software to evaluate population structure on each of 22 chromosomes.  After that we used a pipeline for phasing PLINK format data with BEAGLE and converting phased datat back to PLINK format.

We assumed that the admixed population (represented by VID samples) being analyzed in our project  has arisen from the admixture of  7 separate ancestral populations, and that phased data are available from unadmixed reference populations  are closely related to the true ancestral population.  Under this assumption we again used ADMIXTURE software to infer ancestral components in the phased data set of separate chromosomes.  Please note that we intentionally left those components unlabeled.

Finally, we used LAMP (Local Ancestry in adMixed Populations) (Sankararaman et al.2008) for inferring individual admixture. This simply involves the application of either of the above procedures (locus-specific ancestries when the ancestral populations are not known or locus-specific ancestries of the ancestral populations are known) followed by averaging these locus-specific ancestries to obtain the individual admixture. We estimated allele frequencies (stratified by a categorical cluster) in Plink software (to produce summary of allele frequencies that is stratified by a categorical cluster variable, we used the plink --file data --freq --within  option).We fixed the HapMap recombination rate (estimated separately for each of 22 chromosomes) according to the units of distance specified in chromosome posfile. We also used different number of generations G (generations) 5, 10,25  , assuming that there are K=7  ancestral populations A1, …, AK that have been mixing for g=5,10,25 generations. The ancestries generated by LAMP were visualized in Generategraph package.

 We organized the output into separate  Excel spreadsheets to help facilitate analysis and interpretation of the results.  Each of provided Excel spreadsheets includes the following sections:

1) ADMIXTURE output for unphased chromosomal data (Chr*-phased)
2) ADMIXTURE output for unphased chromosomal data (Chr*-unphased)
3) LAMP results for G=5 (Chr*-lamp-gen5)
4) LAMP results for G=10 (Chr*-lamp-gen5)
5) LAMP results for G=25 (Chr*-lamp-gen5) 


Links

Chr1 Chr2 Chr3 Chr4 Chr5 Chr6 Chr7 Chr8

The rest of chromosomal data will be added asap.











  

.