My Blog List

Saturday, September 29, 2012

A quick update to SupportMix's Chromosome Painting

A thoughtful reader of our blog has noticed that some of chromosome_paintings (Chr 5 set 9, Chr 7 sets 4,5; Chr 9 set 7, Chr 11 set 4) were missing in the original tar.gz bundle  distributed via Google Data Drive. I've had to re-upload the archive (with missing files), the new location of  the archive is here.

Additional experiment.

In addition to that quick fix, i decided to test the accuracy of SupportMix's chromosome paintings by juxtaposing them over the MDLP-World22 chromosome graphs. Due to time limitations, i used only first 7 chromosomes of my own SNP data. At first, i ran the MDLP-World22 modification of DIYDodecad v2.1 in byseg mode on "windows" of 500 contiguous SNPs along a chromosome, slided  by increments of 50. After that i cut out chromosome paintings of each chromosome from SupportMix's graphic output and aligned them to the scale of corresponding DIYDodecad chromosome graphs:

 After the preliminary evaluation of results, i have mentioned an approximate correlation between the byseg-output MDLP World22-DIYcalculator and SupportMix for two major "components" in my genome (North-East-European and Atlantic-Mediterranean). Moreover, "Near-Eastern segments" (assigned by SupportMix"  partially overlaps with the peaks of "Near-East segments" in DIYDodecad output. However, the situation with the minor components is much less uncertain. The lack of correlation for the minor components could be explained by different factors:

1) DIYDodecad operates on the unphased raw data of genotypes
2) DIYDodecad program doesn't take into consideration genetic distance/recombination
3) last, but not least: small segments may appear more noisier than they are, because there may not be any informative SNPs in a particular region to distinguish between some of the minor ancestral components (Dienekes Pontikos' observation)

UPDATE: At first i thought that it would be a great idea to calculate the index of correlation between 'byseg' output and SupporMix' Tprobs output file. But it seems that the results are not directly comparable - the assignment of segments in 'byseg' is measured in frequencies, while the assignment of segments in SupportMix is expressed by probability of assignment. If someone has a solution to this problem, please let me know.

Friday, September 28, 2012

Geography of Ancestry: the SPA analysis of the MDLP participants

Geography of Ancestry:  the SPA analysis of the MDLP participants

A team of researchers (Wen-Yun Yang, John Novembre, Eleazar Eskin, Eran Halperin) from Tel Aviv University (TAU) and University of California, Los Angeles (UCLA) have created a method for more precisely pinpointing the geographic origin of a person's ancestry by developing an understanding of the spatial diversity of genes. The analysis of  diversity of genes within and between populations has broad applications in studies of human disease and human migrations. The afore-mentioned team of researchers proposed a new approach, spatial ancestry analysis, explicitly modeling the spatial distribution of each SNP by assigning an allele frequency as a continuous function in geographic space.
Although the authors were more concerned with detecting the signals of selective sweeps in human genome, the SPA software implements some interesting features that could be immediately applied to the analysis of genetic data collected in open genome projects.
The most important one is that the explicit modeling of the allele frequency allows individuals to be localized on the map on the basis of their genetic information alone.

From the original paper on model-based approach for analysis of spatial structure in genetic data:

If the geographic origins of the individuals are known, one can use this information to infer their allele frequency functions at each SNP. However, if locations are not known, our model can infer geographic origins for individuals using only their genetic data, in a manner similar in spirit to PCA-based approaches for spatial assignment.

 The experiment

Since the authors have made their software  publicly available, I have decided to give SPA software a try. A learning curve  was very smooth, because three of five supported formats are in Plink format (with which i am familiar). Actually, the hardest part of experiment with SPA analysis was deciding what to do with the unknown geographic origins of the MDLP participants. Following the hint found in another interesting paper (A Quantitative Comparison of the Similarity between Genes and Geography in Worldwide Human Populations), i divided the experiment into three parts:
1) first of all, i obtained the geographic coordinates (lats/longs) of  each population included in the run.
2) then i carried out  SPA analysis with 3 specified dimensions
3) after the SPA analysis was finished, i applied Procrustes analysis  to compare the individual-level coordinates of the first two components (1 and 2) in the SPA performed on the SNP data (1440447 snps) to the geographic coordinates
4) using Procrustes analysis, i identified an optimal alignment of the genetic coordinates to the (Gilbert-projected) geographic coordinates that involved a rotation of the longitudes and latitudes  by 16 counterclockwise.
5) finally, i projected the individual coordinates (which have been previously corrected for the optimal Procrustes alignment) onto the geographic map of Eurasia.

The MDLP participants can find their final geographic coordinates in the corresponding spreadsheet.

The allele frequency gradients and signals of recent positive selection

Another cool feature of SPA software is that it is able to identify loci showing extreme frequency gradients (i.e loci under selection), which does not require grouping individuals into populations. These are SNPs that show steep slopes of allele frequency change, with the consideration that some of these might show extreme gradients because of the impact of recent positive selection. 

The analysis of selective sweeps (as well as their possible implications) belongs to the domain of the molecular biology and  medical genetics,  and due to the project limitation i am not going to discuss them in all details. I'll limit my discussion by the following observations:   the direction gradients of allele frequencies resembles the presupposed genetic flow from East Eurasia to West Eurasia, and from South-Europe to North-Europe. The first two dimensions of SPA capture the main features of variation on the well-known East-West Eurasian cline, while the second and third dimension represent the gene flow from South-Europe to  North-Europe.


I've sorted SNPs according to the value of slope function and it appears that the most extreme individual value is detected in rs7568419 - a SNP, which is believed to have linked to a genetically inherited trait. Researchers at 23andMe have identified two genetic variants associated with the trait in people of European ancestry. The C version of rs10953183 is associated with more pronounced chin dimple and the C version of rs7568419 is associated with less of a chin dimple.

A couple of factoids about a cleft chin from Wikipedia:
"This is an inherited trait in humans, where the dominant gene causes the cleft chin while the recessive genotype presents without a cleft. However, it is also a classic example for variable penetrance[5] with environmental factors or a modifier gene possibly affecting the phenotypical expression of the actual genotype. Although cleft chins are seen throughout the world, they are most predominate among people of Germanic and West Slavic (i.e., Polish) ethnicity. It is very common in that part of the world and among descendants of people originating in that part of Europe.[6]It seems particularly prevalent among people living in the former Prussian areas of northern Poland bordering the Baltic Sea."

Those who are interested in more detailed analysis of loci under selection, could find SPA output file in the corresponding spreadsheet (note: a value of slope function in the last column). If you'll find an interesting SNP association with a particular trait, please report your finding to me.



Sunday, September 23, 2012

Free SNP test offer from DNATribes

I'm encouraging all MDLP participants who haven't had yet an opportunity to receive  DNATribes Free SNP offer, send their Raw Data to Luke Martin for a free-of-charge analysis.

Free SNP Offer for Magnus Ducatus Lithuaniae Participants

"Good afternoon,

We have followed with interest your Magnus Ducatus Lituaniae Project. In case you or any of your participants are interested, we would like to offer a free DNA Tribes SNP analysis for unrelated participants with four grandparents from the same country. (Grandparent birth places of U.S. and Canada are not included in this offer.) This would help us include less sampled populations in our geographical analysis.

For free DNA Tribes SNP analysis, interested project participants can fill out the grandparent form ( and email their zipped genome file to between now and October 1, 2012.

If you have any questions, please let me know at

Best regards,
Lucas Martin"