Magnus Ducatus Lituaniae Project

Thursday, November 15, 2012

New add-on tool to MDLP World-22 calculator

Alex_Axe from Molgen forum announced the release of a new tool Oracle_AdMix4.

Basically, you can think of this tool as an useful add-on that extends the Oracle two-population results into four-population format.

Currently the tool support only DOS/Windows platform, and its usage is quite straightforward. The current distribution of Oracle_AdMix4 includes the required file (data.txt) with the frequencies of 22 components in each of reference populations. The second file - input.txt - is the one you need to modify in order to use the tool on your own World-22 calculator results:

30     {Quantity of output results}
1,5    {Threshold of components to ignore noise}
0,5    {Threshold of method to ignore noise}
2      {Power parameter for weighted method or zero for least-squares method}
0      Pygmy
15     West-Asian
0      North-European-Mesolithic
0      Indo-Tibetan
0      Mesoamerican
0      Arctic-Amerind
0      South-America_Amerind
5      Indian
0      North-Siberean
25     Atlantic_Mediterranean_Neolithic
5      Samoedic
0      Indo-Iranian
0      East-Siberean
40     North-East-European
0      South-African
0      North-Amerind
0      Sub-Saharian
0      East-South-Asian
10     Near_East
0      Melanesian
0      Paleo-Siberian
0      Austronesian

All you need to do is to change the default values of components with your own results of World-22 calculator.

The output should appear in the following format:

1 Mordovian+Russian_South+Swedish+Swedish @ 20,749639
2 Estonian+Mordovian+Swedish+Swedish @ 23,267935
3 Finnish-North+Mordovian+Polish+Swedish @ 23,656645
4 Finnish-North+Mordovian+Russian_South+Swedish @ 23,83736
5 Estonian+Finnish-South+Mordovian+Swedish @ 24,192354
6 Mordovian+Polish+Swedish+Swedish @ 24,846391
7 Estonian+German-North+Mordovian+Swedish @ 25,176036
8 Lithuanian+Mordovian+Swedish+Swedish @ 25,511204
9 Finnish-North+Lithuanian+Mordovian+Swedish @ 25,523722
10 Finnish-South+Lithuanian_V+Mordovian+Swedish @ 25,539837

Sunday, November 11, 2012

The levels and dating of admixture in Belorusians

Following Dienekes's suggestion on using Pathan and Lithuanian samples as references for ROLLOFF analysis, i decided to undertake a second attempt of formal analysis of admixture and dating of admixture events in Belorusian samples which are available to me: the reference dataset of Belorusians from Behar et al.2011., and Belorusian samples collected by our project.

Below you can glean the results of experiment which i deem less noisy in contrast to my previous attempt.

valid snps: 746877
group 0 Lithuanian
group 1 Pathan
number admixed: 13 number of references: 2
numsnps: 746877 numindivs: 55
starting main loop. numsnps: 158101

Summary of fit:

Formula: wcorr ~ (C + A * exp(-m * dist/100))

Parameters:
   Estimate Std. Error t value Pr(>|t|)
C 2.332e-04 3.029e-04   0.770 0.44165
A 3.306e-02 1.227e-02   2.695 0.00728 **
m 1.169e+02 3.851e+01   3.037 0.00252 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.006508 on 493 degrees of freedom

Number of iterations to convergence: 0
Achieved convergence tolerance: 9.103e-06

mean (generations): 116.9416

jackknife (generations) 105.086+-52.591
The date of admixture event in Belarusian_V sample with Belarusian and Pathan being reference populations appears to be very close to the date which was estimated by Dienekes for Lithuanian [Lithuanian_D;Pathan].

Inference of Admixture Parameters in Belarusians using Weighted Linkage Disequilibrium

On 1 November 2012 Po-Ru Loh, Mark Lipson, Nick Patterson, Priya Moorjani, Joseph K Pickrell, David Reich, Bonnie Berger announced and published their new paper, in which they introduced a new approach that harnesses the exponential decay of admixture-induced linkage disequilibrium (LD) as a function of genetic distance. They proposed a new weighted LD statistic that can be used to infer mixture proportions as well as dates with fewer constraints on reference populations than previous methods.

I haven't had enough time to investigate this method in full extent, but i used a software package ALDer which implements the weighted LD statistic for a quick & dirty experiment of dating the admixture events in Belarusian sample:

Sunday, October 14, 2012

ROLLOFF analysis of Poles, Belarusians and Russians from central regions of Russia

A month ago the notorious Reich lab released an alpha version of ADMIXTOOOLS version 1.0. The alpha version package was developed for in-house use, so the operating routine is not always self-explanatory. The goof thing, however, is that ADMIXTOOLS package maintains full format compatibility with another very well-known EIGENSOFT software program developed by the same lab. This makes the learning curve of ADMIXTOOLS much steeper and flatter.

The aforementioned package features 6 cool programs, among which i find most useful qp3Pop and rolloff. Due to limitations of this post, i am not going to discuss qp3pop in all details and for the purpose of my presentation it is suffice to say that this program implements three- population (f_3) test for treeness of populations from Reich et al. 2009

Rather i'd suggest reading ADMIXTOOLS supporting material, Dienekes' posts and Reich's paper to get an idea of what f_3 test is about.

ROLOFF method, however, needs a closer look.This is a method that measures time since admixture. It does so by looking at the linkage disequilibrium between SNPs due to admixture. Now it is time to recall the standard definition of the linkage disequilibrium. The linkage disequilibrium (further -LD) is the nonrandom association between two alleles such that certain combinations are more likely to occur together. As two SNPs get farther apart, we expect there to be less admixture LD.The rate of decline of admixture LD is directly related to the number of generations since admixture, since that indicates how many recombinations have occurred between any two SNPs. In short: Rolloff fits an exponential curve to a plot of admixture LD vs. distance, and uses the rate of exponential decline to calculate the generations since admixture. Given that one generation is roughly equal to 29 years, one can convert the number of generations since admixture into years.

Dienekes has already tested ADMIXTOOLS programs with various worldwide populations, and among other he has carried out f_3 and rolloff analyses of Poles, Lithuanians, and Ukrainians.

Below are words from the man himself:

Using the aforementioned idea, I set out to see whether Lithuanians, who occupy the European end of the Europe-South Asia cline present such a signal of admixture LD. I used the Lithuanian_D sample from the Dodecad Project and the Balochi HGDP sample as reference populations (to calculate allele frequency differences), and the Behar et al. (2010) Lithuanians for admixture LD. There were only ~300k SNPs usuable in this set, but sufficient to detect the signal of admixture LD:

The admixture time estimate is 200.350 +/- 61.608 generations, or 5,810 +/- 1790 years. This is not very precise, probably because of the small number of SNPs and individuals used, but it certainly points to the Neolithic-to-Bronze Age for the occurrence of this admixture. The date is certainly reminiscent of the expansion of the Kurgan culture out of eastern Europe, or, the later Corded Ware culture of northern Europe.

So, it may well appear that at least some of the people participating in these groups of cultures, were indeed influenced by the Indo-Europeans as they expanded from their West Asian homeland. These intruders mixed with eastern Europeans who vacillated during the late Neolithic between a northern Europeoid pole akin to Mesolithic hunter gatherers from Gotland and Iberia, and a widely dispersed Sardinian-like population that is in evidence at least in the Sweden-Italian Alps-Bulgaria triangle. The gradual appearance of non-mtDNA U related lineages in Siberia and Ukraine is most likely related to this phenomenon.

......

I have carried out rolloff analysis of my 25-strong Polish_D sample using Lithuanians and Pathans as references:

The signal is fairly distinct, and corresponds to 149.296 +/- 38.783 generations or 4330 +/- 1120 years. I am guessing that either the different reference population (Pathans vs. Balochi), or, more likely the increased number of target individuals (25 vs. 10) have contributed to the narrowing down of the uncertainty. It will be interesting to explore this signal further with more population pairs.

...

I have used the Yunusbayev et al. sample of Ukrainians, and estimated its admixture time using Lithuanians and Balochi as reference populations: The admixture time estimate is 191.078 +/- 35.079 generations, or 5,540 +/- 1,020 years. It seems very similar to that in Lithuanians, with a smaller standard error, perhaps on account of either the larger number of SNPs or larger number of individuals.

It is tempting to associate this admixture signal with the Maikop culture which appeared at around this time. Assuming that North_European/West_Asian (or Lithuanian-like and Balochi-like) gene pools existed north and south of the Pontic-Caspian-Caucasus set of geographical barriers, then the Maikop culture which shows links to both the early Transcaucasian culture and those of Eastern Europe would have been an ideal candidate region for the admixture picked up by rolloff to have taken place. There are, of course, other possibilities.

As always, Dienekes' analysis spawned a lot of criticism on behalf of another genome blogger - Davidski from Eurogenes. In his latest post he argued that it’s difficult to say what this experiment was testing exactly because Pathans aren’t pure West Asians and Lithuanians aren’t pure Mesolithic Europeans. He also claimed that Dienekes' interpretations are wrong, because f3-statistics and rolloff tests are basically picking up (belated) signals of the Mesolithic and Neolithic peopling of Europe.

Since the aforementioned populations have the strongest presence in my dataset, and there is no consensus-opinion between genome bloggers on how to interpret the ADMIXTOOLS i've decided to put my 5 cents and to test ADMIXTOOLS.

For the purposes of this analysis, i created ad-hoc dataset, which includes 750 000 snps samples in 250 worldwide populations. Next, i made 3*62 000 trios in the following form (X,Y; Z), where X and Y are two paired reference populations, and Z is one of three populations - central Russians, Poles and Belarusians. After that i carried out q3Pop analysis of those trios.

From the obtained results I have picked up only those with significant negative Z-score:

Poles:

X Y Z

Estonian    Jew-Iraqi    Polish    -0.002039    0.000179    -11.368
Jew-Iraqi    Estonian    Polish    -0.002039    0.000179    -11.368
Italian-North    Latvian    Polish    -0.001211    0.000109    -11.098
Latvian    Italian-North    Polish    -0.001211    0.000109    -11.098
Estonian    Italian-North    Polish    -0.001023    0.000093    -11.037
Italian-North    Estonian    Polish    -0.001023    0.000093    -11.037
Estonian    Jew-Iran    Polish    -0.001861    0.000172    -10.831
Jew-Iran    Estonian    Polish    -0.001861    0.000172    -10.831
Armenian    Estonian    Polish    -0.001425    0.000136    -10.505
Estonian    Armenian    Polish    -0.001425    0.000136    -10.505
Italian-South    Latvian    Polish    -0.001344    0.000129    -10.458
Latvian    Italian-South    Polish    -0.001344    0.000129    -10.458
Cypriot    Estonian    Polish    -0.001626    0.000161    -10.113
Estonian    Cypriot    Polish    -0.001626    0.000161    -10.113

Russians_Central

North_Amerind    Sardinian   Russian_Center   -0.004202   0.000479   -8.779
Sardinian    North_Amerind Russian_Center   -0.004202   0.000479   -8.779
Basque   Ket   Russian_Center   -0.003771   0.000444   -8.493
Ket   Basque   Russian_Center   -0.003771   0.000444   -8.493
Karitiana   Sardinian   Russian_Center   -0.005947   0.000704   -8.453
Sardinian   Karitiana   Russian_Center   -0.005947   0.000704   -8.453
Pima   Sardinian   Russian_Center   -0.004908   0.000605   -8.117
Sardinian   Pima   Russian_Center   -0.004908   0.000605   -8.117
Ket   Sardinian   Russian_Center   -0.004295   0.000552   -7.786
Sardinian   Ket   Russian_Center   -0.004295   0.000552   -7.786
Lithuanian   Oroqen   Russian_Center   -0.00344   0.000445   -7.731
Oroqen   Lithuanian   Russian_Center   -0.00344   0.000445   -7.731
Basque   Karitiana   Russian_Center   -0.004629   0.000609   -7.596
Karitiana   Basque   Russian_Center   -0.004629   0.000609   -7.596
Basque   Pima   Russian_Center   -0.003711   0.000492   -7.545
Pima   Basque   Russian_Center   -0.003711   0.000492   -7.545

North_Amerind    Sardinian   Russian_Center   -0.003465   0.000462   -7.505
Sardinian    North_Amerind Russian_Center   -0.003465   0.000462   -7.505
Basque   Nganassan   Russian_Center   -0.003574   0.000478   -7.471

Belarusians

Indian    Polish    Belarusian    -0.000736    0.000251    -2.935
Polish    Indian    Belarusian    -0.000736    0.000251    -2.935
Karitiana    Sardinian    Belarusian    -0.001278    0.000517    -2.471
Sardinian    Karitiana    Belarusian    -0.001278    0.000517    -2.471
Otzi    North_Amerind    Belarusian    -0.002556    0.001126    -2.271
Cirkassian    Polish    Belarusian    -0.000488    0.000231    -2.113
Polish    Cirkassian    Belarusian    -0.000488    0.000231    -2.113
Pima    Otzi    Belarusian    -0.002727    0.00137    -1.99
Pima    Sardinian    Belarusian    -0.000794    0.000431    -1.843
Sardinian    Pima    Belarusian    -0.000794    0.000431    -1.843
Otzi    Surui    Belarusian    -0.002938    0.001931    -1.522
Surui    Otzi    Belarusian    -0.002938    0.001931    -1.522
Discussion

As it looks at first glance,the results of my ad-hoc experiment with 3qPop seems to be consistent with the findings in Patterson et al.2012 paper: "the most striking finding is a clear signal of admixture into northern Europe, with one ancestral population related to present day Basques and Sardinians, and the other related to present day populations of northeast Asia and the Americas. This likely reflects a history of admixture between Neolithic migrants and the indigenous Mesolithic population of Europe, consistent with recent analyses of ancient bones from Sweden and the sequencing of the genome of the Tyrolean ‘Iceman’".

Indeed, the admixture in Poles can be shown as the admixture between Neolithic + Mesolithic populations of Europe, Russians/Belarusians can be represented as admixture between the ancestral population of modern populations of NE-Asia/Amerinds and Neolithic populations of Europe.
However, more careful examination of results allows me to reveal the additional signals of admixtures in two of three target populations - Poles and Belarusians.
Although its is perfectly possible to treat Estonians and Latvians as modern day proxies for the NE-populations of Mesolithic Europe, it is also obvious that these populations could have (at least in theory) the significant genetic legacy related to Baltic branch of Indo-European Corded Ware culture. On other hand, the second component of admixture in Poles itself is a product of admixture between Near-East/Anatolian-like Neolithic populations and more recent genetic stratum, which is probably related to the massive migration of R1b people (the ancestors of 'Bell beakers') from NE-Asia to Western Europe.

Given that, i'd suggest to rewrite the components of admixtures in Poles in the following manner:

Pole=(Neolithic_populations of Europe)+"Bell Beakerish-like")+(Mesolithic_poplations)+"Corded_Ware" component) [1]

In Belarusians, the sources of signals of additional admixture are less clear and vague.

As was shown earlier, in terms of formal admixture analysis (f3 statistics), Belarusians could be represented as the admixture between Poles and Indian/Cirkassian. The first component of admixture is already known (see above [1]), the second one, according to results, must resemble the component, common to both Indian and Circkassian. From the history textbooks i've learned that the territory of modern Karachay-Cherkessia was occupied in the 1st millenium AD by the Alans, or the Alani, who were a group of Sarmatian tribes, nomadic pastoralists of the 1st millennium AD who spoke an Eastern Iranian language which derived from Scytho-Sarmatian and which in turn evolved into modern Ossetian. The only currently known most recent ancestral population to modern Alans and modern Indians is Scytho-Sarmatian metapopulation.

Thus, we can re-write the admixture formula for Belarusians in the following manner

Belarusian=((Neolithic_populations of Europe)+"Bell Beakerish-like")+(Mesolithic_poplations)+"Corded_Ware" component)) + Scytho-Sarmatian-like

Now, after long discussion, it is time for the admixture_dating fun!

Admixture dating with ROLLOFF

To estimate the admixture date in Polish population, i used as reference populations Latvian and North-Italian

The admixture date is 119.670+-37.145 generations ago, which corresponds to 3470 +-1077 years before present, or 1510 +- 1077 AD. The upper limits of our dating for the admixture event seem to overlap with the timescale of Unetice culture. The Bronze Age in Poland, as well as elsewhere in central Europe, begins with the innovative Unetice culture, in existence in Silesia and a part of Greater Poland during the first period of this era, that is from before 2200 to 1600 BC. This settled agricultural society's origins consisted of the conservative traditions inherited from the Corded Ware populations and dynamic elements of the Bell-Beaker people. Significantly, the Unetice people cultivated contacts with the highly developed cultures of the Carpathian Basin, through whom they had trade links with the cultures of early Greece. Their culture also echoed inspiring influence coming all the way from the most highly developed at that time civilizations of the Middle East.

To estimate admixture date in Belarusian population, i used as reference populations Polish and Indian (note: i also lowered genetic distance threshold in ROLLOFF parameters to reduce noise from more recent admixtures)

As you can see, the signal of admixture is less detectable, and by virtue of that the margins of error in admixture dating are significantly higher than in previous example: 154.158+-87.024 generations ago (or, 4470 +-2523 years before present/2510 -+2523 years AD).

Saturday, October 13, 2012

World-22 dataset: fastIBD analysis

Various methods for detecting IBD, including those implemented in the software programs fastIBD, GERMLINE, Chromopainter have been developed in the past several years using population genotype data from microarray platforms. Now, next-generation DNA sequencing data is becoming increasingly available, enabling the comprehensive analysis of genomes, including identifying rare variants. These sequencing data may provide an opportunity to detect IBD with higher resolution than previously possible, potentially enabling the detection of disease causing loci that were previously undetectable with sparser genetic data.

fastIBD is a fast and computationally efficient method for detecting the identity by descent.The fastIBD algorithm starts by sampling a fixed number of haplotype pairs (four pairs by default) for each individual from the posterior haplotype distribution. Each sampled haplotype corresponds to a sequence of hidden Markov model (HMM) states. The fastIBD algorithm searches for pairs of sampled haplotypes sharing the same sequence of HMM states for a set of consecutive markers. If the pair of sampled haplotypes belongs to two distinct individuals, the shared haplotype tract is recorded. For each pair of individuals, overlapping shared haplotype tracts are merged, and the merged shared haplotype tract is a mosaic of pairs of sampled haplotype. The method has been implemented in BEAGLE, a very popular genetic analysis software.

The similar IBD analyses have been already carried out by other genome bloggers.

Dienkes Pontikos has performed different fastIBD analyses for detecting the degree of sharing between various Euroasian and African groups. Davidski of Eurogenes project has also used fastIBD method in his Intra-European chromosome paintings.

Inspired by those analyses i decided to run IBD sharing analysis of World 22 calculator dataset using robust and powerful fastIBD software (with increased IBD detection threshold) and ibd2segment script.

I have created ad-hoc subset of various East-European populations by including samples from the following populations:

-->

Mordovian

Sorb

Hungarian

Belarusian

Tatar

Lithuanian

Polish

Bosnian

Ukrainian

Slovakian

Nogai

Serbian

Estonian

German

Swedish

Macedonian

Latvian

Moldavian

Montenegrin

Bulgarian

NorthOssetian

Kazakh

Slovenian

Uzbek

Adygei

Armenian

British

Czech

Orcadian

Russian

Turk

I have calculated the sum of IBD shared segments (measured in cM - centimorgans). The obtained matrix of pairwise sharing has been visualized in the following heat maps. The populations in the first heat map are clustered according to z-score values: high values indicate a high degree of IBD sharing, while low values indicate a low degree of IBD sharing.

In the second heat map, a tree-like hierarchic grouping of population is tied to the total value of cM in segments shared by two populations in pairwise-sharing.

I've also made some visualizations of IBD sharing for selected East-European populations

UPDATE I: I've uploaded a spreadsheet with IBD pairwise-sharing values to GoogleDrive.

Sunday, October 7, 2012

Spatial Analysis of Ancestry (UPDATE)

One member of my project (known as "linkus") has created a map (on Google Maps) with the inferred coordinates of 'ancestral locations of the MDLP participants' being superimposed on the geographic map of Europe.

Last but mot least i feel obliged to mention a couple of dendrographs by courtesy of another enthusiastic member of our project ( this guy goes by codename: "Wojewoda").
These 'dendrographs' aim to visualize the geneographical distance between the MDLP participants, and by virtue of that they are worth noticing.

Saturday, September 29, 2012

A quick update to SupportMix's Chromosome Painting

A thoughtful reader of our blog has noticed that some of chromosome_paintings (Chr 5 set 9, Chr 7 sets 4,5; Chr 9 set 7, Chr 11 set 4) were missing in the original tar.gz bundle distributed via Google Data Drive. I've had to re-upload the archive (with missing files), the new location of the archive is here.

Additional experiment.

In addition to that quick fix, i decided to test the accuracy of SupportMix's chromosome paintings by juxtaposing them over the MDLP-World22 chromosome graphs. Due to time limitations, i used only first 7 chromosomes of my own SNP data. At first, i ran the MDLP-World22 modification of DIYDodecad v2.1 in byseg mode on "windows" of 500 contiguous SNPs along a chromosome, slided by increments of 50. After that i cut out chromosome paintings of each chromosome from SupportMix's graphic output and aligned them to the scale of corresponding DIYDodecad chromosome graphs:

After the preliminary evaluation of results, i have mentioned an approximate correlation between the byseg-output MDLP World22-DIYcalculator and SupportMix for two major "components" in my genome (North-East-European and Atlantic-Mediterranean). Moreover, "Near-Eastern segments" (assigned by SupportMix" partially overlaps with the peaks of "Near-East segments" in DIYDodecad output. However, the situation with the minor components is much less uncertain. The lack of correlation for the minor components could be explained by different factors:

1) DIYDodecad operates on the unphased raw data of genotypes

2) DIYDodecad program doesn't take into consideration genetic distance/recombination

3) last, but not least: small segments may appear more noisier than they are, because there may not be any informative SNPs in a particular region to distinguish between some of the minor ancestral components (Dienekes Pontikos' observation)

UPDATE: At first i thought that it would be a great idea to calculate the index of correlation between 'byseg' output and SupporMix' Tprobs output file. But it seems that the results are not directly comparable - the assignment of segments in 'byseg' is measured in frequencies, while the assignment of segments in SupportMix is expressed by probability of assignment. If someone has a solution to this problem, please let me know.

Friday, September 28, 2012

Geography of Ancestry: the SPA analysis of the MDLP participants

Geography of Ancestry: the SPA analysis of the MDLP participants

A team of researchers (Wen-Yun Yang, John Novembre, Eleazar Eskin , Eran Halperin) from Tel Aviv University (TAU) and University of California, Los Angeles (UCLA) have created a method for more precisely pinpointing the geographic origin of a person's ancestry by developing an understanding of the spatial diversity of genes. The analysis of diversity of genes within and between populations has broad applications in studies of human disease and human migrations. The afore-mentioned team of researchers proposed a new approach, spatial ancestry analysis, explicitly modeling the spatial distribution of each SNP by assigning an allele frequency as a continuous function in geographic space.

Although the authors were more concerned with detecting the signals of selective sweeps in human genome, the SPA software implements some interesting features that could be immediately applied to the analysis of genetic data collected in open genome projects.

The most important one is that the explicit modeling of the allele frequency allows individuals to be localized on the map on the basis of their genetic information alone.

From the original paper on model-based approach for analysis of spatial structure in genetic data:

If the geographic origins of the individuals are known, one can use this information to infer their allele frequency functions at each SNP. However, if locations are not known, our model can infer geographic origins for individuals using only their genetic data, in a manner similar in spirit to PCA-based approaches for spatial assignment.

The experiment

Since the authors have made their software publicly available, I have decided to give SPA software a try. A learning curve was very smooth, because three of five supported formats are in Plink format (with which i am familiar). Actually, the hardest part of experiment with SPA analysis was deciding what to do with the unknown geographic origins of the MDLP participants. Following the hint found in another interesting paper (A Quantitative Comparison of the Similarity between Genes and Geography in Worldwide Human Populations), i divided the experiment into three parts:

1) first of all, i obtained the geographic coordinates (lats/longs) of each population included in the run.

2) then i carried out SPA analysis with 3 specified dimensions

3) after the SPA analysis was finished, i applied Procrustes analysis to compare the individual-level coordinates of the first two components (1 and 2) in the SPA performed on the SNP data (1440447 snps) to the geographic coordinates

4) using Procrustes analysis, i identified an optimal alignment of the genetic coordinates to the (Gilbert-projected) geographic coordinates that involved a rotation of the longitudes and latitudes by 16

counterclockwise.

5) finally, i projected the individual coordinates (which have been previously corrected for the optimal Procrustes alignment) onto the geographic map of Eurasia.

The MDLP participants can find their final geographic coordinates in the corresponding spreadsheet.

The allele frequency gradients and signals of recent positive selection

Another cool feature of SPA software is that it is able to identify loci showing extreme frequency gradients (i.e loci under selection), which does not require grouping individuals into populations. These are SNPs that show steep slopes of allele frequency change, with the consideration that some of these might show extreme gradients because of the impact of recent positive selection.

The analysis of selective sweeps (as well as their possible implications) belongs to the domain of the molecular biology and medical genetics, and due to the project limitation i am not going to discuss them in all details. I'll limit my discussion by the following observations: the direction gradients of allele frequencies resembles the presupposed genetic flow from East Eurasia to West Eurasia, and from South-Europe to North-Europe. The first two dimensions of SPA capture the main features of variation on the well-known East-West Eurasian cline, while the second and third dimension represent the gene flow from South-Europe to North-Europe.

I've sorted SNPs according to the value of slope function and it appears that the most extreme individual value is detected in rs7568419 - a SNP, which is believed to have linked to a genetically inherited trait. Researchers at 23andMe have identified two genetic variants associated with the trait in people of European ancestry. The C version of rs10953183 is associated with more pronounced chin dimple and the C version of rs7568419 is associated with less of a chin dimple.

A couple of factoids about a cleft chin from Wikipedia:

"This is an inherited trait in humans, where the dominant gene causes the cleft chin while the recessive genotype presents without a cleft. However, it is also a classic example for variable penetrance^[5] with environmental factors or a modifier gene possibly affecting the phenotypical expression of the actual genotype. Although cleft chins are seen throughout the world, they are most predominate among people of Germanic and West Slavic (i.e., Polish) ethnicity. It is very common in that part of the world and among descendants of people originating in that part of Europe.^[6]It seems particularly prevalent among people living in the former Prussian areas of northern Poland bordering the Baltic Sea."

Those who are interested in more detailed analysis of loci under selection, could find SPA output file in the corresponding spreadsheet (note: a value of slope function in the last column). If you'll find an interesting SNP association with a particular trait, please report your finding to me.