As a methodological tool, there is nothing wrong with the combined usage of Plink/Admixture/Lamp for assessing the popupaltion stratification and the levels of individual admixtures. One of the goals of our project is to expose both the advantages and the limitation of particular methodologies in BGA analyses, and therefore this particular experimental test seeks to expose the grip of unified Plink/Admixture/Lamp approaches as a methodological contrivance.
Thus the goal of the experiment described herein is to lay bare some of practical and thereoretical obstacles that have hitherto obscured the legibility of our previous biogenographic analysis of the population substructures in Europe.
After frequency and genotyping pruning, and removing the outliers, the final data set consisted of 90455 SNPs and 317 individuals (289 males, 82 females) that were used in subsequent analyses.
First of all, we used ADMIXTURE (Alexander, Novembre, Lange 2009) implementing a structure-like
(Pritchard, Stephens, Donnelly 2000) model-based maximum likelihood (ML) clustering algorithm to assess population structure in the whole data set .
In order to maintain the compatibility with MDLP calculator, we choosed K=7 as a sensible modeling choice. The Q estimates, obtained from Admixture run, were plotted in R.
Please note that only participants of the MDLP project are plotted on this barplot. The full list of Q estimates (including "reference" populations) is available in this spreadsheet.
The seven ancestral components (K) inferred at this level of resolution are:
Trans-Caucasian -red
Balkanic/Mediterrean -yellow
North-Caucasian -green
West-European
Altaic - light blue
Balto-Slavic - dark blue
Balto-Finnic/North-European -purple
As usual, we labeled these components for mnemonic purposes only: researchers should thus be cautious in interpreting the inferred components in terms of conventional population history.
As a next step we splitted whole-genome PLINK binary files (371 samples) to 22 separate chromosomal chunks and sunsequently used Admixture software to evaluate population structure on each of 22 chromosomes. After that we used a pipeline for phasing PLINK format data with BEAGLE and converting phased datat back to PLINK format.
We assumed that the admixed population (represented by VID samples) being analyzed in our project has arisen from the admixture of 7 separate ancestral populations, and that phased data are available from unadmixed reference populations are closely related to the true ancestral population. Under this assumption we again used ADMIXTURE software to infer ancestral components in the phased data set of separate chromosomes. Please note that we intentionally left those components unlabeled.
.
Thus the goal of the experiment described herein is to lay bare some of practical and thereoretical obstacles that have hitherto obscured the legibility of our previous biogenographic analysis of the population substructures in Europe.
We routinely began our analysis in Plink software to filter the combined dataset to include only SNPs on the 22 autosomal chromosomes with minor allele frequency >1% and genotyping success >99%.
Because background linkage disequilibrium (LD) can affect both principal component and structure-like analysis, wethinned the marker set by excluding SNPs in strong LD (pairwise genotypic correlation r2>0.4) in a window of 100 SNPs (sliding the window by 10 SNPs at a time). We also used the following Plink techniques for obtaining the homogeneous sample: pairwise clustering based on IBS
is for detecting pairs of individuals who look more different from each other than
you'd expect in a random, homogeneous sample; evaluating
Those reference individials, who look more different (more than 3 SD) to the rest of the data set and/or have high PIHAT values ( greater than
0.05) and higher degree of inbreeding (homozygocity)* were removed from our training set.
-
* Note that the degree of inbreeding is defined as the probability that identical homozygocity occurs in a locus.
The stratification of samples according to the levels of homozygocity,X axis -total ammount of homozygous segments in Kb (kilobases) units; Y-axis - average size of homozygous segments in Kb units |
The levels of individual homozygocity in data set: NSEG (number of segments) on X axis is plotted against the total length of homozygous segments in KB (Y axis) |
The pairwise clustering based on IBD. The total length of IBD segments (X axis) is plotted against PIHAT values (P(IBD=2)+0.5*P(IBD=1) ( proportion IBD )) |
After frequency and genotyping pruning, and removing the outliers, the final data set consisted of 90455 SNPs and 317 individuals (289 males, 82 females) that were used in subsequent analyses.
First of all, we used ADMIXTURE (Alexander, Novembre, Lange 2009) implementing a structure-like
(Pritchard, Stephens, Donnelly 2000) model-based maximum likelihood (ML) clustering algorithm to assess population structure in the whole data set .
In order to maintain the compatibility with MDLP calculator, we choosed K=7 as a sensible modeling choice. The Q estimates, obtained from Admixture run, were plotted in R.
Please note that only participants of the MDLP project are plotted on this barplot. The full list of Q estimates (including "reference" populations) is available in this spreadsheet.
The seven ancestral components (K) inferred at this level of resolution are:
Trans-Caucasian -red
Balkanic/Mediterrean -yellow
North-Caucasian -green
West-European
Altaic - light blue
Balto-Slavic - dark blue
Balto-Finnic/North-European -purple
As usual, we labeled these components for mnemonic purposes only: researchers should thus be cautious in interpreting the inferred components in terms of conventional population history.
A neighbor-joining tree based on inter-population Fst distances |
We assumed that the admixed population (represented by VID samples) being analyzed in our project has arisen from the admixture of 7 separate ancestral populations, and that phased data are available from unadmixed reference populations are closely related to the true ancestral population. Under this assumption we again used ADMIXTURE software to infer ancestral components in the phased data set of separate chromosomes. Please note that we intentionally left those components unlabeled.
Finally, we used LAMP (Local Ancestry in adMixed Populations) (Sankararaman et al.2008) for inferring individual admixture. This simply involves the application of either of the above procedures (locus-specific ancestries when the ancestral populations are not known or locus-specific ancestries of the ancestral populations are known) followed by averaging these locus-specific ancestries to obtain the individual admixture. We estimated allele frequencies (stratified by
a categorical cluster) in Plink software (to produce summary of allele frequencies that is stratified by
a categorical cluster variable, we used the plink --file data --freq --within
option).We fixed the HapMap recombination rate (estimated separately for each of 22 chromosomes) according to the units of distance specified in chromosome posfile. We also used different number of generations G (generations) 5, 10,25 , assuming that there are K=7 ancestral populations A1, …, AK that have been mixing for g=5,10,25 generations. The ancestries generated by LAMP were visualized in Generategraph package.
We organized the output into separate Excel spreadsheets to help facilitate analysis and interpretation of the results. Each of provided Excel spreadsheets includes the following sections:
1) ADMIXTURE output for unphased chromosomal data (Chr*-phased)
2) ADMIXTURE output for unphased chromosomal data (Chr*-unphased)
3) LAMP results for G=5 (Chr*-lamp-gen5)
4) LAMP results for G=10 (Chr*-lamp-gen5)
5) LAMP results for G=25 (Chr*-lamp-gen5)
.