Magnus Ducatus Lituaniae Project: 2011-10-02

Thursday, October 6, 2011

Analyzing admixture in phased v.unphased dataset

Determination of haplotype phase is becoming increasingly important as we enter the era of large-scale sequencing because many of its applications, such as imputing low-frequency variants and characterizing the relationship between genetic variations in different populations. Haplotype phase can be generated through laboratory-based experimental methods, or it can be estimated using computational approaches. We assess the haplotype phasing method that is aravailable in BEAGLE software, focusing in particular on using its output in ADMIXTURE analysis.

For simplicity's sake we have selected individuals from "Balto-Slavic" cluster (the cluster attribution of individuals were inferred from Dienekes's Mclust using 11 MDS dimension), which is the major cluster of our project. Here is an all inclusive list of IDs for selected participants of our project:

V158
V157
V160
V202
V169
V170
V171
V174
V176
V177
V180
V181
V188
V189
V196
V205
V208
V211
V215
V218
V220
V221
V222
V228
V225
V232
V236
V237
V235
V231
V244
V246
V238

We had thinned the genotype data of selected individuals to c.100 000 SNPs, removing SNPs in strong LD and low quality SNps. After that we used GERMLINE pipeline for phasing PLINK format data with BEAGLE and processing in GERMLINE (phasing was performed in a homologous populations), Then, in order to assess possible discrepancies between phased and unphased data, we performed ADMIXTURE analysis (with 4 assumed clusters K=4) separately for original unphased dataset and BEAGLE-phased dataset.

To our surprise, we haven't be able to find expected signficant differences between phased and unphased multi -SNP markers genotypes (the range of difference is c.1-5%).

Unphased data:

Phased data:

Spreadsheet with ADMIXTURE results can be found here.

Tuesday, October 4, 2011

Simulated SNP-populations of MDLP

Yesterday I had set out to repeat "simulation" experiments with a SNP dataset of my project's dataset, using PLINK's simulation techniques first described (in terms of population genetics) by Dienekes (the analogous experiments were performed by Harappa DNA BGA project and Eurogenes BGA project).
Synthetic "ancestral" populations (Altaic, Anatolian-Balkanian, Balto-Slavic, North-Atlantic, Scandinavian, Volga-Uralic and Celto-Germanic) were simulated using standard PLINK's simulation routine, with each ""synthetic" population including 5 generated synthetic individuals:

plink --simulate wgas1.sim --make-bed --out sim11

plink --simulate wgas2.sim --make-bed --out sim12

etc. ..

In data simulation, we assumed that each of 7 clusters defined by specific combination of allele frequencies of c.100000 Snps (obtained from ADMIXTURE K=7 run under unsupervised model) represents one ancestral pupulation.

Since I was interested in PCA loadings of "ancestral" populations, i used Eigensoft for explicit modeling differences between different components" along continuous axes of variation. The calculated PCA loadings were then visualized as interactive biplot in R-package BiplotGUI using the following Biplot's command:

> Biplots(Data = PCA[, -1], groups = [, 1])

Afterwards i performed three statistical tests on imported PCA loadings: linear regression, circular regression and procrusted analysis.

MDLP modification of DIYDodecad calculator: additional instructions/ideas

I have come up with another idea of how to use the estimated frequencies of my project for inferring the origin of shared HIR segments. Suppose, for example, that a Lithuanian/Belorussian and a Norwegian share some HIR segment. This could be:

1)Balto-Slavic-like ancestry in the Norwegian individual
2)Scandinavian-like ancestry in the Lithuanian individual
3)third party ancestry in both individuals

Using byseg 500 50 mode in DIYDodecad, AncestryFinder file, and MDLP allele frequencies, i was able to predict the origin of the HIR segment by picking one of three scenarios:

1) if the Lithuanian sees an excess of Scandinavian, then he should pick the second scenario
2) if he sees nothing unremarkable, the first one
3) if an excess of some component (relatively) low in both Lithuanian and Norwegian (e.g., Caucassian-Anatolian), the third.

After analyzing my "Scandinavian" "matches" from AF's file, i can conclude that DIYDodecad analysis (in byseg mode) reveals the presence of all three possible scenarios:

Scenario nr.2 (Balto-Slavic-like ancestry in the Scandinavian individual)
X Sweden Sweden United States United States 1 156.4 160 3.6 5.6
60%-Scenario nr.1 (Scandinavian-like ancestry in the Balto-Slavic individual)
X Denmark Denmark Denmark Denmark 1 87.9 94.3 6.4 6.3
Scenario nr.1 (Scandinavian-like ancestry in the Balto-Slavic individual)
X Norway Norway Norway Norway 1 62.1 65.4 3.3 5
Scenario nr.1 (Scandinavian-like ancestry in the Balto-Slavic individual)
Anonymous0454 Finland Finland Finland Finland 1 104.8 110.3 5.5 6.4
Scenario nr.2 (Balto-Slavic-like ancestry in the Scandinavian individual)
X Sweden Not Provided Not Provided Not Provided 2 82.2 88.1 5.9 5
60%-Scenario nr.2 (Balto-Slavic-like ancestry in the Scandinavian individual)
X Sweden Sweden United States United States 3 174.9 178.1 3.2 5
80%-Scenario nr.2 (Balto-Slavic-like ancestry in the Scandinavian individual)
Anonymous0353 Denmark Denmark Denmark Denmark 3 68.4 73.3 4.9 8.1
50%-Scenario nr.2 (Balto-Slavic-like ancestry in the Scandinavian individual)
Anonymous0245 Denmark Denmark Denmark Denmark 4 70.5 77.3 6.8 5.3

HLA-MHC group -Scenario III - third party (Celto-Germanic) ancestry in both individuals
X Sweden Sweden United States United States 6 25.8 34.2 8.4 5.4
X Sweden Denmark United States United States 6 27.6 34.5 6.9 5.1
Anonymous0207 Sweden Sweden Not Provided Not Provided 6 25.6 34.1 8.5 5.3
X Norway Not Provided Belgium Not Provided 6 29.7 36.1 6.4 5.3
Anonymous0370 Sweden Sweden Not Provided Sweden 6 30.7 36.6 5.9 5.4
X Denmark Denmark Denmark Denmark 6 25.6 35.5 9.9 5.9
Anonymous0439 Denmark Not Provided United States Hungary 6 26.3 36 9.7 6
X Denmark Denmark Denmark Denmark 6 26 34 8 5.1
X Norway Norway Norway Norway 6 25.6 34.2 8.6 5.6

Chr14. group - 70%-Scenario nr.2 (Balto-Slavic-like ancestry in the Scandinavian individual)
Anonymous0001 Norway Norway Norway Norway 14 38 48.8 10.8 6.5
Anonymous0037 Sweden Norway United States United States 14 39.2 48 8.8 5.1

Scenario nr.1 (Scandinavian ancestry in the Balto-Slavic individual)
Anonymous0002 Denmark Denmark United States United States 15 43.9 51.3 7.4 6