Friday, September 28, 2012

Geography of Ancestry: the SPA analysis of the MDLP participants


Geography of Ancestry:  the SPA analysis of the MDLP participants


A team of researchers (Wen-Yun Yang, John Novembre, Eleazar Eskin, Eran Halperin) from Tel Aviv University (TAU) and University of California, Los Angeles (UCLA) have created a method for more precisely pinpointing the geographic origin of a person's ancestry by developing an understanding of the spatial diversity of genes. The analysis of  diversity of genes within and between populations has broad applications in studies of human disease and human migrations. The afore-mentioned team of researchers proposed a new approach, spatial ancestry analysis, explicitly modeling the spatial distribution of each SNP by assigning an allele frequency as a continuous function in geographic space.
 
Although the authors were more concerned with detecting the signals of selective sweeps in human genome, the SPA software implements some interesting features that could be immediately applied to the analysis of genetic data collected in open genome projects.
The most important one is that the explicit modeling of the allele frequency allows individuals to be localized on the map on the basis of their genetic information alone.

From the original paper on model-based approach for analysis of spatial structure in genetic data:

If the geographic origins of the individuals are known, one can use this information to infer their allele frequency functions at each SNP. However, if locations are not known, our model can infer geographic origins for individuals using only their genetic data, in a manner similar in spirit to PCA-based approaches for spatial assignment.

 The experiment


Since the authors have made their software  publicly available, I have decided to give SPA software a try. A learning curve  was very smooth, because three of five supported formats are in Plink format (with which i am familiar). Actually, the hardest part of experiment with SPA analysis was deciding what to do with the unknown geographic origins of the MDLP participants. Following the hint found in another interesting paper (A Quantitative Comparison of the Similarity between Genes and Geography in Worldwide Human Populations), i divided the experiment into three parts:
1) first of all, i obtained the geographic coordinates (lats/longs) of  each population included in the run.
2) then i carried out  SPA analysis with 3 specified dimensions
3) after the SPA analysis was finished, i applied Procrustes analysis  to compare the individual-level coordinates of the first two components (1 and 2) in the SPA performed on the SNP data (1440447 snps) to the geographic coordinates
4) using Procrustes analysis, i identified an optimal alignment of the genetic coordinates to the (Gilbert-projected) geographic coordinates that involved a rotation of the longitudes and latitudes  by 16 counterclockwise.
5) finally, i projected the individual coordinates (which have been previously corrected for the optimal Procrustes alignment) onto the geographic map of Eurasia.



The MDLP participants can find their final geographic coordinates in the corresponding spreadsheet.


The allele frequency gradients and signals of recent positive selection

Another cool feature of SPA software is that it is able to identify loci showing extreme frequency gradients (i.e loci under selection), which does not require grouping individuals into populations. These are SNPs that show steep slopes of allele frequency change, with the consideration that some of these might show extreme gradients because of the impact of recent positive selection. 

The analysis of selective sweeps (as well as their possible implications) belongs to the domain of the molecular biology and  medical genetics,  and due to the project limitation i am not going to discuss them in all details. I'll limit my discussion by the following observations:   the direction gradients of allele frequencies resembles the presupposed genetic flow from East Eurasia to West Eurasia, and from South-Europe to North-Europe. The first two dimensions of SPA capture the main features of variation on the well-known East-West Eurasian cline, while the second and third dimension represent the gene flow from South-Europe to  North-Europe.





















 

I've sorted SNPs according to the value of slope function and it appears that the most extreme individual value is detected in rs7568419 - a SNP, which is believed to have linked to a genetically inherited trait. Researchers at 23andMe have identified two genetic variants associated with the trait in people of European ancestry. The C version of rs10953183 is associated with more pronounced chin dimple and the C version of rs7568419 is associated with less of a chin dimple.

A couple of factoids about a cleft chin from Wikipedia:
"This is an inherited trait in humans, where the dominant gene causes the cleft chin while the recessive genotype presents without a cleft. However, it is also a classic example for variable penetrance[5] with environmental factors or a modifier gene possibly affecting the phenotypical expression of the actual genotype. Although cleft chins are seen throughout the world, they are most predominate among people of Germanic and West Slavic (i.e., Polish) ethnicity. It is very common in that part of the world and among descendants of people originating in that part of Europe.[6]It seems particularly prevalent among people living in the former Prussian areas of northern Poland bordering the Baltic Sea."

Those who are interested in more detailed analysis of loci under selection, could find SPA output file in the corresponding spreadsheet (note: a value of slope function in the last column). If you'll find an interesting SNP association with a particular trait, please report your finding to me.


 



 


7 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. Should you need assistance in the repair process for your pool pump, filter, heater, cleaner or anything else, you should call a professional to do this very work as they know how to repair these such parts for better care of your swimming pool.

    ReplyDelete
  4. I had Visit your website which was very good..it would be great full to see that such a wild range of products you have
    thanks by Incubators Products Suppliers
    Laboratory Equipments

    ReplyDelete
  5. you probably won't contact me, but i was on GEDmatch and saw your page. my kit is A716396 and i would like to know if there is any American Indian genetic matches. So i would love for you to tell me anything and everything you can based on my DNA test. Thank you. Janet

    ReplyDelete
    Replies
    1. Hi Janet, I am not the original poster of the blog, just a reader, but I saw this comment is recent and wanted to help ... I don't think the blog poster has personal control over your file on GEDMatch. Have you tried running your file through any of the analysis tools on GEDMatch yourself yet? This one is called MDLP World or something like that. If you have trouble finding how to use the tool on GEDMatch, you could try submitting your .txt file on a website called DNASolves. They are run by a third party group that helps Law Enforcement identify possible distant relatives of unidentified deceased individuals, in the hope of identifying them. When you volunteer your DNA on that site, they give you an analysis based on this tool in return. That site is a little more user-friendly. Hope this helps!

      Delete
    2. This comment has been removed by the author.

      Delete