Magnus Ducatus Lituaniae Project: 2011-05-22

Saturday, May 28, 2011

Experimental test III: (De)constructing ancestry with LAMP

In my previous attempt to find statistical criteria by which Lithuanians and Belarusians might be separated from the rest of dataset, i outlined some of the most important incongruities/ambuigities between the output of LAMP and the results of supervised ADMIXTURE run. It might be argued that these ambiguities are somewhat explained by the evident differences in population stratification's algorithms, implemented in LAMP and ADMIXTURE.

1) ADMIXTURE software implements a model-based approach to estimate ancestry coefficients as the parameters of a statistical model. It is also important to add that the model-based approach in ADMIXTURE is based on the global ancestry paradigm (i.e the goal of ADMIXTURE/STRUCTURE analysis is to estimate the proportion of ancestry from each contributing population, considered as an average over the individual's entire genome).

2) LAMP software is built upon an efficient dynamic-programming algorithm WINPOP that infers locus-specific ancestries.Genome is partitioned into chromosome segments of definite ancestral origin (overlapping, contiguous windows of SNPs) and likelihood model optimized over each window. The goal then is to find the segment boundaries and assign each segment's origin.

As we have already seen, for our project's particular purposes (i.e the estimation of definte anectry proportion of very closely related populations) LAMP's model has a number of advantages over ADMIXTURE's. I fully share Andres Palsen's confidence in LAMP's ability to separate closely related populations of Scandinavia (Norwegians, Swedes, Danes).
Another advantage of the LAMP's model is that it includes physical positions of SNP, cM (centimorgans) between SNP and recombination events. Moreover, it allows to estimate the ancestral origin of different segments for different number of generations passed from the admixture events.

I have used HapMap's list of recombination rates for Chromosome 22 and estimated alleles' frequencies using PlINK on a training set of 6 reference populations (Orcadians, Romanians, Lithuanians,Belarussians and Russians). Then i performed the analysis in LAMP for G=5, 10 and 25 (number of generations from the admixture events) and visualised the output (click on the attached pictures to obtain the original images).

I would suggest to check the results of LAMP analysis of Chromosome 22 for all project's participants. It would be interesting to compare these results to AncestryFinder's visualizations or ADMIXTURE results Dodecad's/Eurogenes'/Diogenes' projects.

As always, i am open to comments and criticism.

G=5 (5 generations from the admixture events)
Ancestral segments on Chromosome 22

Ancestral segments on Chromosome 22

Average ancestry for G=5

G=10(10 generations from the admixture events)


Ancestral segments of Chromosome 22

Average ancestry for G=10

G=25(25 generations from the admixture events)


Ancestral segments on Chromosome 22

Average ancestry for G=25

Thursday, May 26, 2011

Some comments on experimental test

Some participants of the project have requested MDS source file to produce their own plots.

I have uploaded Plink MDS, CLUSTER and NEAREST files to GoogleDocs, so please feel free to download them (all files in text format, so you can open them with text editors or import to Excel/Rgui). NEAREST is essentialy one of those files (another one is MIBS matrix), which are used to produce IBS Rdata object files, distributed by Dienekes or Zack. I haven't time to create the same the same object file, but you can import it to Excel and filter out the participants, who are closest to you.

Several blog readers pointed to the incorrect representations of Belarusian/Lithuanian in the ADMIXTURE plot I published yesterday. I have fixed the plot by swapping Belarusian/Lithuanian components to make sure that the visualization works properly.

A couple of the project's participants have expressed their concern with the high % of Hungarian component in their results. As it was indicated earlier, this component is not associated entirely to Hungarian population and having a high % of this component doesn't necessary imply some distant Hungarian roots. Here, the Hungarian component is used as a suitable proxy for genetic Central- European (or, more specifically Circum-Carpathian) ancestry. I, personally, am inclined to believe that this "Hungarian" component is high in indviduals with Ukrainian ancestry/ancestry from Slovakia or southern Poland (when i'll have enough of Ukrainian/Polish samples, i'll test this scenario).

As for the LAMP/STRUCTURE analysis, i would warn against reading too much into it. Since the analysis was performed on a single Chromosome 22 (without thinning-out set of SNPs in linkage disequilibrium etc ), the results were shown for evalution purposes only. The estimated components seem to pick up more distant signal of Neolithic migration events ( it makes sense to compare those results to Davidski's or Dienekes' estimation of African/Asian admixture levels in the European populations).

Wednesday, May 25, 2011

MDL plots for selected participants/populations

Experimental analysis Part II: STRUCTURE+LAMP

In the part II of my experimental analysis, we are going to discuss a promising alternative to the standard ADMIXTURE analysis. This alternative is the combination of STRUCTURE and LAMP analysis. LAMP (Local Ancestry in adMixed Populations) is a software for the inference of locus-specific ancestry in recently admixed populations(Sriram Sankararaman, Srinath Sridhar, Gad Kimmel and Eran Halperin, Estimating Local Ancestry in Admixed Populations, The American Journal of Human Genetics, Volume 82, Issue 2, 290-303, 2008). LAMP computes the ancestry structure for overlapping windows of contiguous SNPs and combines the results with a majority vote.Basically, LAMP allows to reconstruct fine-scale chromosomal patterns of admixture and visualize them in a manner, which is similiar to 23andme's Ancestry Painting.

The input genotype file for LAMP can be easily converted from PLINK by running plink --recodeA command, removing first line and first six columns of output file and replacing NA with "-1". For my dirty and quick experiment with LAMP i used 8271 SNPs from Chr.22, which i converted to LAMP and STRUCTURE input formats. Since the LAMP algorithm for P>2 requires estimated allele frequencies for all populations in dataset, i first ran on my dataset the Bayesian clustering algorithm STRUCTURE еo estimate ancestral allele frequencies and assess admixture proportions within and among populations in my dataset (Russians, Orcadians, Romanians, Hungarians, Belorussians and Lithuanians). I also included the following participants of my project:

V158
V157
V160

V161

V162

V163

V164

V165

V201

V202

V166
V167
V168
V169
V170
V171
V172
V173
V174
V175
V176
V177
V178
V179

The results of this simplified STRUCTURE run (same as unsupervised ADMIXTURE analysis) for Chr.22 are included in Google spreadsheet:

After estimating ancestral allele frequencies (gprob1, gprob2, grpob3, grpob4 etc.) i entered the recombination rate, the number of generations/individuals to analayze allele frequencies, and estimated alpha values (the rest of options remained unchanged). Then I used LAMP to reconstruct for each individual in our data set, segments of different ancestry across Chr.22 in all admixed and putative source population individuals.

The visualization of ancestral segments

Experimental analysis Part I: Supervised ADMIXTURE (K=5) analysis

Meanwhile i keep receiving 23andme and FF raw data file, i decide to perform a quick ADMIXTURE supervised analysis and compare the results to LAMP/STRUCTURE output.

In order to make the presentation clear and easy to follow, i will start with the simpliest part of my analysis, which is supervised ADMIXTURE (K=5) analysis. Supervised analysis allows more accurate estimation of the ancestries of the individuals,by specifying the ancestries of the reference individuals.

I did supervised ADMIXTURE analysis by selecting 6 reference populations - Orcadians and Russians (Vologda) from HGDP project; Romanians, Hungarians, Russians (Tver), Lithuanians and Belorussians from public dataset (Behar DM, Yunusbayev B, Metspalu M, Metspalu E et al. The genome-wide structure of the Jewish people. Nature 2010 Jul 8;466(7303):238-42. ). In our particular case, Orcadians represents an abstract proxy for the whole NW European component, Romanians and Hungarians as proxies for Central-European component (while Hungarians represent more specific Subcarpathian component, we consider Romanians to have more genetic affinity to SE (Balkan) component). Russians from Vologda define here North-European component -and finally, ,Lithuanians, Belorussians and Russians from Tver are included to represent the main genetic component in North-Eastern Europe.

Before manipulating with the reference data in Plink, i removed 2 pairs of close relatives (2Orcadians and 2 Hungarians) and 2 Romanians with Roma admixture . Then, I excluded SNPs with missing rates greater than 1% and performed the SNP prunning, based on the variance inflation factor and pairwise genotypic correlation. After the prunning, i included SNPs with MAF >= 0.05 and with maximum 1 missing allele per-person. Then, i performed LD-based pruning using a window size of 50, a step of 5 and r^2 threshold of 0.3. After that i had at my disposal the dataset with 121 included individuals and circa 140Kb SNPs. 27 participants of the MDL project were included into the ADMIXTURE run:
Then i performed linkage disequilibrium based pruning using a window size of 50, a step of 5 and r^2 threshold of 0.3:

V158

V157

V160

V161

V162

V163

V164

V165

V201

V202

V166

V167

V168

V169

V170

V171

V172

V173

V174

V175

V176

V177

V178

V179

V180

V181

V182

The results are in Google spreadsheet

Magnus Ducatus Lituaniae Project

My Blog List