My Blog List

Thursday, September 20, 2012

The component maps of MDLP-World22 calculator

I would like to express my gratitude to the fellow members of ABF - Loxias and Wojewoda- for creating amazing maps of components and plots showing interrelationships between 22 components.

Since the readers of my blog might be interested in visual inspecting of components , i have decided to upload them on-line. 

The first batch of maps  includes "the component portraits", created by Loxias:


 The second batch includes the PCA scatter plot of components, created by Wojewoda. PCA scatter plots one PCA component vs. another PCA component of data obtained from ancestry coefficients  collected for each of MDLP World22 components: the spots are connected over time to produce a trajectory for each component.

Wednesday, September 19, 2012

Paint Me a Rainbow: Painting World 22 ancestral components

This update will be concerned with inter-related concepts of  "chromosome painting" and admixture. Modern genetics and personal genomics, particularly in the last 2-5 years, had paid very considerable attention to them, sometimes under the rubric of " determining ancestral origin of genomic segments".

Although the experiments met with moderate success, i wasn't satisfied with the results and decided to postpone the forthcoming experiments with chromosome painting to a future day. In so doing, i endorsed increasing  appreciation of the distinction between  population stratification's algorithms, implemented in LAMP and ADMIXTURE.

I have already discussed the differences between LAMP and Admixture, but in illustrating the idea of experiment, i need to turn to my previous explanation again:

1) ADMIXTURE software  implements a model-based approach to estimate ancestry coefficients as the parameters of a statistical model. It is also important to add that the model-based approach in ADMIXTURE is based  on the global ancestry paradigm (i.e the goal of  ADMIXTURE/STRUCTURE analysis is to estimate the proportion of ancestry from each contributing population, considered as an average over the individual's entire genome).

2) LAMP software is built upon an efficient dynamic-programming algorithm WINPOP that infers locus-specific ancestries.Genome is partitioned into chromosome segments of definite ancestral origin (overlapping, contiguous windows of SNPs) and likelihood model optimized over each window. The goal then is to find the segment boundaries and assign each segment's origin.
I understand the problem in terms of the ancestry assignment. My experience shows that methods based on  the inference of locus-specific ancestries are usually very accurate for ancestral deconvolution of genotype data that has consistently been shown to do better than more popular statistical and PCA-based methods, while being able:
1) to handle more than two ancestral populations
2) to model the paths of recombination between ancestral segments.
In June 2012, Jason Mezey Lab (Cornell University) released SupportMix - a machine learning algorithm for determining ancestral origin of genomic segments when analyzing individuals from a population with a recent or ancient history of admixture. As regards the accuracy of the software, the authors argued that SupportMix provides a robust tool for accurate and robust ancestral assignment by simultaneous analysis of a worldwide selection of ancestral populations. Such analyses will be critical for accurate assignment in the many world-wide admixed populations that are likely to have unexpected ancestry that reflects a richer history than known from anthropological or historical studies (from the provisional paper: Omberg et al.2012 "Inferring genome-wide patterns of admixture in Qataris using fifty-five ancestral populations"). The cited paper includes a number of other claims important for any analysis of genetic admixture: the accuracy of ancestry assignment was lower for more closely related populations but better than LAMP-ANC, a method that was been shown to consistently outperform other ancestry deconvolution methods.

This overly optimistic conclusion influenced my choice between LAMP-ANC and SupportMix in favor of latter. To be honest, i'm not the first genome blogger to use Supportmix - in July of 2012 Polako from Eurogenes carried out a loci-specific analysis of Finnish genomes using SupprotMix. I decided to repeat this experiment. There is, of course, a significant  difference between Polako's analaysis and my experiment. While Polako was using the modern populations as 'putative' donor populations, the final goal of my design was to imitate the results of the long-awaited update of 23andme's Ancestry Painting, which is is being updated to offer more detailed results based on approximately 20 world regions, drawn from both customer data and academic reference populations. In order to do so, i have used the dummy set of 22 simulated putative ancestral populations simulated from the allele frequencies of the World-22 calculator.

The experiment

SupportMix requires at least three input files. One file for each of the putative ancestral populations and one file containing the genetic information of the admixed individuals with additional requirement that he markers have to be phased. Each population should be represented by two files in Plink transposed format, a .tped file and a .tfam file.

The markers (80751 SNPs) were phased per each chromosome using default settings in BEAGLE software. While the original Plink format does not specify the order of the alleles in in the file, SupportMix works with phased data. Keeping that in mind, i have converted BEAGLE-phased dataset directly into Plink's tped format without pre-processing the dataset in Plink (hint: Kantele's beagle_to_tped script). Then i used UNIX text processing utilities to extract the genotypes from ancestral populations into the corresponding subsets ('references') and split the project dataset (93 individuals) into 9 subgroups.

Finally, i have interpolated the genetic map position of each SNP along chromosome using Rutgers genetic maps.

 SupportMix was run by specifying a configuration file with the default options: (window_size =400, generations_from_admixture_event=6).

Below are some screenshots of  SupportMiX output for the MDLP project participants:

Please note that each 'recipient' (i.e the project participant) is represented by two phased chromosomes, i.e V199_a and V199_b. The color legend of the components used in the analysis, has been attached to the right side of the plot.

The complete set (Chr.1-22 for all project participants) in tar.gz format (14.6 Mb) could be downloaded here.

SupportMix results: what to do next.

First of all, i encourage every participant of my project to compare their results to "chromosome painting" in MDLP World-22 calculator on John Olson's Gedmatch site. I have to contemplate the possibility that the painting on Gedmatch site might be different from that one produced by SupportMix. I would also suggest to compare SupportMix's paintings to other calculators' paintings and 23andme's Ancestry Finder, etc.

If you are familiar with basic techniques of image editing software, then it is a good idea to have your chromosomes cut&merged into the composite image (see example below):

Chromosome Painting (in 23andme's  style)

Chromosome Painting (an imitation of 23andme's Ancestry Finder)