My Blog List

Saturday, May 28, 2011

Experimental test III: (De)constructing ancestry with LAMP

In my previous attempt to find statistical criteria  by which Lithuanians and Belarusians might be separated from the rest of dataset, i outlined some of the most important incongruities/ambuigities between the output of LAMP and the results of  supervised ADMIXTURE run. It might be argued that these ambiguities are somewhat explained by the evident differences in population stratification's algorithms, implemented in LAMP and ADMIXTURE.

1) ADMIXTURE software  implements a model-based approach to estimate ancestry coefficients as the parameters of a statistical model. It is also important to add that the model-based approach in ADMIXTURE is based  on the global ancestry paradigm (i.e the goal of  ADMIXTURE/STRUCTURE analysis is to estimate the proportion of ancestry from each contributing population, considered as an average over the individual's entire genome).

2) LAMP software is built upon an efficient dynamic-programming algorithm WINPOP that infers locus-specific ancestries.Genome is partitioned into chromosome segments of definite ancestral origin (overlapping, contiguous windows of SNPs) and likelihood model optimized over each window. The goal then is to find the segment boundaries and assign each segment's origin.

As we have already seen, for our project's particular purposes (i.e the estimation of definte anectry proportion of very closely related populations) LAMP's model has a number of advantages over ADMIXTURE's. I fully share  Andres Palsen's confidence  in LAMP's ability to separate closely related populations of Scandinavia (Norwegians, Swedes, Danes).
Another advantage of the LAMP's model is that it includes physical positions of SNP, cM (centimorgans) between SNP and recombination events. Moreover, it allows to estimate the ancestral origin of different segments for different number of generations passed from the admixture events.

I have used HapMap's list of recombination rates for Chromosome 22 and estimated alleles' frequencies using PlINK on a training set of 6 reference populations (Orcadians, Romanians, Lithuanians,Belarussians and Russians). Then i performed the analysis in LAMP for G=5, 10 and 25 (number of generations from the admixture events) and visualised the output (click on the attached pictures to obtain the original images).

I would suggest to check the results of LAMP analysis of Chromosome 22 for all project's participants. It would be interesting to compare these results to AncestryFinder's visualizations or  ADMIXTURE results Dodecad's/Eurogenes'/Diogenes' projects.

As always, i am open to comments and criticism.



G=5 (5 generations from the admixture events)
Ancestral segments on Chromosome 22
Ancestral segments on Chromosome 22

Average ancestry for G=5


G=10(10 generations from the admixture events)

Ancestral segments of Chromosome 22
Average ancestry for G=10


G=25(25 generations from the admixture events)

Ancestral segments on Chromosome 22 
Average ancestry for G=25


Thursday, May 26, 2011

Some comments on experimental test

Some participants of the project have requested MDS source file to produce their own plots.
I have uploaded Plink MDS, CLUSTER and NEAREST files to GoogleDocs, so please feel free to download them (all files in text format, so you can open them with text editors or import to Excel/Rgui). NEAREST is essentialy one of those files (another one is MIBS matrix), which are used to produce IBS Rdata object files, distributed by Dienekes or Zack. I haven't time to create the same the same object file, but you can import it to Excel and filter out the participants, who are closest to you.

MDLP-adm.cluster0

Several blog readers pointed to the incorrect representations of Belarusian/Lithuanian in the ADMIXTURE plot I published yesterday. I have fixed the plot by swapping Belarusian/Lithuanian components to make sure that the visualization works properly.

A couple of the project's participants have expressed their concern with the high % of Hungarian component in their results. As it was indicated earlier, this component is not associated entirely to Hungarian population and having a high % of this component doesn't necessary imply some distant Hungarian roots. Here, the Hungarian component is used as a suitable proxy for genetic Central- European (or, more specifically Circum-Carpathian) ancestry. I, personally, am inclined to believe that this "Hungarian" component is high in indviduals with Ukrainian ancestry/ancestry from Slovakia or southern Poland (when i'll have enough of Ukrainian/Polish samples, i'll test this scenario).

As for the LAMP/STRUCTURE analysis, i would warn against reading too much into it. Since the  analysis was performed on a single Chromosome 22 (without thinning-out set of SNPs in linkage disequilibrium etc ), the results were shown for evalution purposes only.  The estimated components seem to pick up more distant signal of Neolithic migration events ( it makes sense to compare those results to Davidski's or Dienekes' estimation of  African/Asian admixture levels in the European populations).







Wednesday, May 25, 2011

MDL plots for selected participants/populations






Experimental analysis Part II: STRUCTURE+LAMP

In the part II of my experimental analysis, we are going to discuss a promising alternative to the standard ADMIXTURE analysis. This alternative is the combination of STRUCTURE and LAMP analysis. LAMP (Local Ancestry in adMixed Populations) is a software for the inference of locus-specific ancestry in recently admixed populations(Sriram Sankararaman, Srinath Sridhar, Gad Kimmel and Eran Halperin, Estimating Local Ancestry in Admixed Populations, The American Journal of Human Genetics, Volume 82, Issue 2, 290-303, 2008). LAMP computes the ancestry structure for overlapping windows of contiguous SNPs and combines the results with a majority vote.Basically, LAMP allows to reconstruct fine-scale chromosomal patterns of admixture and visualize them in a manner, which is similiar to 23andme's Ancestry Painting. 
The input  genotype file  for LAMP can be easily converted from PLINK by running  plink --recodeA command, removing first line and first six columns of output file and replacing NA with "-1". For my dirty and quick experiment with LAMP i used 8271 SNPs from Chr.22, which i converted to LAMP and STRUCTURE input formats.  Since the LAMP algorithm for P>2 requires estimated allele frequencies for all populations in dataset, i first ran on my dataset the Bayesian clustering algorithm STRUCTURE еo estimate ancestral allele frequencies and assess admixture proportions within and among populations in my dataset (Russians, Orcadians, Romanians, Hungarians, Belorussians and Lithuanians). I also included the following participants of my project:


V158
V157
V160

V161
V162
V163
V164
V165
V201
V202

V166
V167
V168
V169
V170
V171
V172
V173
V174
V175
V176
V177
V178
V179

The results of this simplified STRUCTURE run (same as unsupervised ADMIXTURE analysis) for Chr.22 are included in Google spreadsheet:






After estimating ancestral allele frequencies (gprob1, gprob2, grpob3, grpob4 etc.) i entered the recombination rate, the number of generations/individuals to analayze allele frequencies, and estimated alpha values (the rest of options remained unchanged). Then I used LAMP to reconstruct for each individual in our data set, segments of different ancestry across Chr.22 in all admixed and putative source population individuals.



The visualization of ancestral segments


Experimental analysis Part I: Supervised ADMIXTURE (K=5) analysis

Meanwhile i keep receiving 23andme and FF raw data file, i decide to perform a quick  ADMIXTURE  supervised analysis and compare the results to LAMP/STRUCTURE output.
In order to make the presentation clear and easy to follow, i will start with the simpliest part of my analysis, which is supervised ADMIXTURE (K=5) analysis. Supervised analysis allows more accurate estimation of the ancestries of the individuals,by specifying the ancestries of the reference individuals.
I did supervised ADMIXTURE analysis by selecting 6 reference populations - Orcadians and Russians (Vologda) from HGDP project; Romanians, Hungarians, Russians (Tver), Lithuanians and Belorussians from public dataset (Behar DM, Yunusbayev B, Metspalu M, Metspalu E et al. The genome-wide structure of the Jewish people. Nature 2010 Jul 8;466(7303):238-42. ). In our particular case, Orcadians represents an abstract proxy for the whole NW European component,  Romanians and Hungarians as proxies for  Central-European component (while Hungarians represent  more specific  Subcarpathian component,  we consider Romanians to have more genetic affinity to SE (Balkan) component). Russians from Vologda define here North-European component  -and finally, ,Lithuanians, Belorussians and Russians from Tver are included to represent the main genetic component in North-Eastern Europe.

Before manipulating with the reference data in Plink, i removed 2 pairs of close relatives (2Orcadians and 2 Hungarians) and 2 Romanians with Roma admixture . Then, I excluded SNPs with missing rates greater than 1% and performed the SNP prunning, based on the variance inflation factor and pairwise genotypic correlation. After the prunning, i included SNPs with MAF >= 0.05  and with maximum 1 missing allele per-person. Then, i performed  LD-based pruning using a window size of 50, a step of 5 and r^2 threshold of 0.3.  After that i had at my disposal the dataset with 121 included individuals and circa 140Kb SNPs.  27 participants of the MDL project were included into the ADMIXTURE run:
Then i performed linkage disequilibrium based pruning using a window size of 50, a step of 5 and r^2 threshold of 0.3:


V158
V157
V160


V161
V162
V163
V164
V165
V201
V202

V166
V167
V168
V169
V170
V171
V172
V173
V174
V175
V176
V177
V178
V179
V180
V181
V182

The results are in Google spreadsheet