Magnus Ducatus Lituaniae Project: 2011-05-29

Thursday, June 2, 2011

Graphoanalytical approach to the analysis of genetic relationships

It goes without saying that the admixture analysis and MDA&PCA (statistical) calculations are most typical methods for inferring the underlying genetic structures in the human populations. But sometimes it is very useful to visualize the shared genetic components (or, rather, segments) as graphs. Graph visualization can be used to better understand underlying patterns of genetic effects (such as the genetic drift, the founder's effect etc).

Earlier i attempted to re-formulate the basic aspects of the graphoanalytical approach , proposing graph visualization techniques, which can be easily applied to the analysis of IBD (identical by descent) segments. I even succeeded in producing a neat graphical output for IBD files, but the method itself required more amending and refining efforts.

As I was willing to invest more time and efforts to make of my preliminary considerations of graph visualization something meaningful, then this moning i decided to repeat my previous experiment. Prior to running PLINK routine commands for detecting IBD segments, i removed almost all individuals with 3 or more standard deviations from the mean Z-score. Another subset of individuals (c.9% of all dataset) was removed due to the suprisingly significant values of inbreeding coefficient F (in PLINK, the calculations of inbreeding coeffecient F are based on the observed versus expected number of homozygous genotypes). The champions of inbreeding coefficient F are Orcadians, and some Lithuanians/Belarusians, whose F-score is significantly higher than European average value of F. As some of project's participants might be interested in the re-evaluation of Z-score and F-inbreeding coeffecient, i've uploaded the corresponding NEAREST and HET files.

Then i reduced the dataset by performing linkage disequilibrium based SNP pruning and removing low quality SNPs. Then i created the genetic map (in cM unit) for the remainig set of circa 72000 SNPs (this time i sorted&joined the interpolated data from HapMap) and ran the segmental sharing analysis in PLINK (it took c.70 minutes to complete this task, since the scope of the analysis was restricted to the low sharing 1cM/1Mb smps).The resulting file (available for donwloading here) has been converted into delimited table format and then read into R tool for network analysis with Igraph.

I used values of NSEG (number of pairwise shared segments -see INDIV.SUMMARY file) to create a weighted undirected graph. The created graph was then analyzed using custom Igraph's methods of community detection by extended modularity and the calculation of weighted betweenness in igraph. The vertex and edge betweenness are defined by the number of geodesics (shortest paths) going through a vertex or an edge. With all preprocessing of graph done in R, i converted Igraph object into GraphXML and pulled it into GEPHI , to apply some minor visual and aesthetic tweaks.

You can see the final graph in its full glory in PDF format (with a loseless invariant scaling), although the better representation of graph is achieved in GEPHI (compare the pictures uploaded above).

To make a complete picture of the stratification patterns within the MDL project's populations, we should add some.We advise all participants to compare their position on PCA plots and MDS plots.

Wednesday, June 1, 2011

PCA plots for reference populations and project participants

I have uploaded some PCA plots in order to satisfy the demands of some impatient participants (including the colour-blinded individuals who complained about the low quality of the previously uploaded PCA-MDS plots). I tried to do my best and hope that the quality of visualizations is now satisfactory (click to resize the attached images).

PCA1-PCA2

PCA1-PCA2 with trendline

PCA5-PCA1

For some curious minds, I have also attached 5 PCA (eigenvalues) and eigenvectors, calculated in Eigensoft software (please check the attached spreadsheet):

Tuesday, May 31, 2011

Troubles with StepPCO,Part II: The Solution

Now, i would like to return to the previosuly discussed problems with instantiating StepPCO model.

I've been able to contact one of the software developpers and it appears that my troubles stems from the fact that the spco cannot find a window that would separate your parental groups.

Indeed, if one looks at the PCA-based plot (see attached below ), the samples I choosed as your parental groups (pop1(Lithuanians)=colnames(Chr1[,27:32]), pop2 (Belarusians)=colnames(Chr1[,72:78])) are found right in the middle of the "cloud", so there is no separation between them (no matter how big a window you take) The extension of principal component analysis (PCA) is used to obtain a signal of admixture from an individual genome, starting from the center of each window anc increasing the window until the mean PC1 coordinates for the parental populations are separated by three standard deviations ("3 sigmas") from each mean. The goal is to achieve a complete separation of the parental populations within each window, so there is no ambiguity in assigning chromosomal segments in an admixed genome to either ancestral population.

Thus, in our case, there is no a signal of (either complete or partial) separation between pop1 and pop2, and this lack of signal implies that parental populations (or individuals) are not chosen correctly.

Yesterday i decided to give StepPCO a second chance (since this excellent piece of R code really deserved that). This time i limited the scope of my analysis to the thinned set of high quality SNPs (2212 independent SNPs in linkage equilibrium) on Chromosome 22.
Then I extracted from my datatset 96 reference persons (HGDP+Behar dataset: Orcadians, Romanians, Russians, Belorusians and Lithuanians) and re-calculated the mean PCA coordinates for populations of this new pruned dataset, omitting the "real" participants of the MDL project. Instead of default 3 SDs window size, i set window.size parameter to 100 SNPs.


PCA plot


PCA plot. Orcadians=forestgreen, Romanians=blue, Hungarians=brown, Lithuanians=red, Belorusians=orange, Russians=purple.

Then i calculated StepPCO parameters for the selected genotypes, using HGDP's Orcadians and Russians as parental populations (or, using more proper phrasing, distnict populations), and ran a wavelet transform analysis on choosen individuals from Hungarian, Romanian, Belarussian and Lithuanian populations.

The real advantage of StepPCO is that calculated WT coefficients can be used to obtain accurate estimates of the time of admixture from suitable genome-wide SNP data in ancestral populations, which can be statistically-differentiated:

"The spectral analysis of the StepPCO signal revealed that the average dominant frequency for the African-Americans is located at level 1.8, which would correspond to an abundance of low frequency wavelets (that is, wider ancestry blocks), while for the Fijians and the Polynesians the average dominant frequency is at level 3.06 and 3.63 respectively, which is indicative of much narrower ancestry blocks (Figure 7). Based on simulations, the WT center of 1.8 corresponds to an admixture time of 6 generations ago (95% CI: 4-8 generations) for the African Americans. Assuming a generation time of 30 years [33], our results indicate that the admixture in the African Americans started about 180 years ago. Similarly, the simulations indicate that the WT center of 3.63 for the Polynesians corresponds to an admixture time of 90 generations (95% CI: 77-131 generations), or about 2,700 years ago (Figure 8). The time estimation for Fiji is based on simulated data with a 40% admixture rate (to match the higher admixture rate of Fiji), and here the WT center of 3.06 corresponds to an admixture time of 37 generations (95% CI: 29-39) or about 1,100 years ago. "

Using the same method, StepPCO's authors esitimated an average of 19% European ancestry in Afircan-Americans, with a wide range of less than 5% to more than 40% European ancestry across individuals. Both the average and the observation of a wide range of individual admixture estimates are in keeping with previous studies . The estimated time of admixture is about 180 years ago (95% CI: 120-240 years ago), which is probably an underestimate since admixture in the African-American population is ongoing.

Returning to our own sample of 96 reference individuals, we tested the performance of the StepPCO method on Orcadians and Russians. We calculated WT centers for the statistically-differentiated individuals (Orcadians -HGDP00806, HGDP00805, HGDP00801,HGDP00804,HGDP00797,HGDP00810; Russians -HGDP00879. The most common center of WT is 2.44, which probably corresponds to an admixture time of 15-20 generations (the correct corresponding value can be obtained using the simulations, but i haven't done it yet).