Geography of Ancestry: the SPA analysis of the MDLP participants
A team of researchers (Wen-Yun Yang, John Novembre, Eleazar Eskin, Eran Halperin) from Tel Aviv University (TAU) and University of California, Los Angeles (UCLA) have created a method for more precisely pinpointing the geographic origin of a person's ancestry by developing an understanding of the spatial diversity of genes. The analysis of diversity of genes within and between populations has broad applications in studies of human disease and human migrations. The afore-mentioned team of researchers proposed a new approach, spatial ancestry analysis, explicitly modeling the spatial distribution of each SNP by assigning an allele frequency as a continuous function in geographic space.
Although the authors were more concerned with detecting the signals of selective sweeps in human genome, the SPA software implements some interesting features that could be immediately applied to the analysis of genetic data collected in open genome projects.
The most important one is that the explicit modeling of the allele frequency allows individuals to be localized on the map on the basis of their genetic information alone.
From the original paper on model-based approach for analysis of spatial structure in genetic data:
If the geographic origins of the individuals are known, one can use this information to infer their allele frequency functions at each SNP. However, if locations are not known, our model can infer geographic origins for individuals using only their genetic data, in a manner similar in spirit to PCA-based approaches for spatial assignment.
Since the authors have made their software publicly available, I have decided to give SPA software a try. A learning curve was very smooth, because three of five supported formats are in Plink format (with which i am familiar). Actually, the hardest part of experiment with SPA analysis was deciding what to do with the unknown geographic origins of the MDLP participants. Following the hint found in another interesting paper (A Quantitative Comparison of the Similarity between Genes and Geography in Worldwide Human Populations), i divided the experiment into three parts:
1) first of all, i obtained the geographic coordinates (lats/longs) of each population included in the run.
2) then i carried out SPA analysis with 3 specified dimensions
3) after the SPA analysis was finished, i applied Procrustes analysis to compare the individual-level coordinates of the first two components (1 and 2) in the SPA performed on the SNP data (1440447 snps) to the geographic coordinates
4) using Procrustes analysis, i identified an optimal alignment of the genetic coordinates to the (Gilbert-projected) geographic coordinates that involved a rotation of the longitudes and latitudes by 16 counterclockwise.
5) finally, i projected the individual coordinates (which have been previously corrected for the optimal Procrustes alignment) onto the geographic map of Eurasia.
The MDLP participants can find their final geographic coordinates in the corresponding spreadsheet.
The allele frequency gradients and signals of recent positive selection
Another cool feature of SPA software is that it is able to identify loci showing extreme frequency gradients (i.e loci under selection), which does not require grouping individuals into populations. These are SNPs that show steep slopes of allele frequency change, with the consideration that some of these might show extreme gradients because of the impact of recent positive selection.
The analysis of selective sweeps (as well as their possible implications) belongs to the domain of the molecular biology and medical genetics, and due to the project limitation i am not going to discuss them in all details. I'll limit my discussion by the following observations: the direction gradients of allele frequencies resembles the presupposed genetic flow from East Eurasia to West Eurasia, and from South-Europe to North-Europe. The first two dimensions of SPA capture the main features of variation on the well-known East-West Eurasian cline, while the second and third dimension represent the gene flow from South-Europe to North-Europe.
I've sorted SNPs according to the value of slope function and it appears that the most extreme individual value is detected in rs7568419 - a SNP, which is believed to have linked to a genetically inherited trait. Researchers at 23andMe have identified two genetic variants associated with the trait in people of European ancestry. The C version of rs10953183 is associated with more pronounced chin dimple and the C version of rs7568419 is associated with less of a chin dimple.
A couple of factoids about a cleft chin from Wikipedia:
"This is an inherited trait in humans, where the dominant gene causes the cleft chin while the recessive genotype presents without a cleft. However, it is also a classic example for variable penetrance with environmental factors or a modifier gene possibly affecting the phenotypical expression of the actual genotype. Although cleft chins are seen throughout the world, they are most predominate among people of Germanic and West Slavic (i.e., Polish) ethnicity. It is very common in that part of the world and among descendants of people originating in that part of Europe.It seems particularly prevalent among people living in the former Prussian areas of northern Poland bordering the Baltic Sea."
Those who are interested in more detailed analysis of loci under selection, could find SPA output file in the corresponding spreadsheet (note: a value of slope function in the last column). If you'll find an interesting SNP association with a particular trait, please report your finding to me.