My Blog List

Friday, May 13, 2011

Admixture clusters, Mclust and populations concordance

First Excel spreadsheet includes 3 sheets for different Admixture runs (K=4),(K=7),(K=8).

Second Excel features additional statistic data:
1) Population concordance ratio (read more about this test in Dienekes' blog) - the row represents a test in which pairs of individuals from one populations were compared against individuals from the population of each column;
the column represents inividuals from same populations used as an "outgroup" for comparison against pairs of individuals from each row.

2) Mclust - using the Mclust package one can check if ADMIXTURE program correctly identify the clusters

3) K=8 for project's populations and members

IBS similarity matrix in R

Distributing IBS similiarity matrix as R data object becomes a good practice in other BGA projects. It is our turn to share IBS R data with project's members.

Please feel free to download R data file here.

The instructions assume that you have already got R installed:
1) In R console, type closest("ID"), where ID is a member's ID in project.
2) The default number of "neighbours" is 20, but you can change it by specifing the second argument, for example -closest ("V158",50).
3) Have fun.

Monday, May 9, 2011

IBD sharing

A couple of weeks ago i ran on my dataset ( 160 samples) a pairwise segmential IBD sharing analysis (in PLINK software). The analysis aimed at calculating genome-wide IBD given IBS information, as long as a large number of  independent, high quality SNPs  (c.300 000) were available pruning from the dataset SNPs below given MID,MAF and LD thresholds (the segmental sharing analysis requires approximately independent SNPs (i.e. linkage equilibrium) in a homogenous sample of unrealted individuals).

I followed the strategy suggested by the creators of PLINK software:

plink --bfile mydata2 --mind 1 --geno 0.01 --maf 0.05 --make-bed --out mydata3
followed by
plink --bfile mydata3 --indep-pairwise 100 25 0.2
followed by
plink --bfile mydata3 --extract --make-bed --out mydata4 
I also removed a couple of relatives, since the focus of this analysis is to look for extended haplotypes shared between distantly related individuals: having very closely related individuals  will likely swamp the results of the analysis.

Here are results of PLINK analysis (in Excel format with more than 400 000 rows, so i can not upload it to Google Docs) (download link).

As I was expecting, the strongest  evidence of IBD sharing  was detected on Chr.6. Although the total number of pairwise shared segments (175) appears to be proportional to the length of chromosome 6, the average number of shared is about 2-3 times larger in comparisions to IBD segments on other chromosomes. Also the intersection of chromosome 6 shared segments produces apparent clusters.The detected preponderance of segmental IBD sharing on chromosome 6 is obviously explained by some well-known properties of Chr6 genomic areas. It is a well-known fact that human Chr.6 has one of the highest genotyping rate (SNPs per kbp), and the most of those SNPs are in very strong linkage (it also means that the default PLINK strategy for obtaining the set of SNPs in linkage equilibrium simply won't work for Chr 6).
Chromosome 6 also includes the human leukocyte antigen system (HLA) is the name of the major histocompatibility complex (MHC) in humans. The super locus contains a large number of genes related to immune system function in humans. This group of genes resides on chromosome 6, and encodes cell-surface antigen-presenting proteins and many other genes. The recent studies in medical genetics have revealed  the large level of IBD sharing in HLA, which is indicative of strong natural selection that affects (otherwise) discrete Poisson-like distribution of  IBD sharing probabilities.This particular property of IBD sharing in HLA region has been discussed in paper "Natural Selection and the Distribution of Identity-by-Descent in the Human Genome":

The largest excess of IBD sharing is in the HLA region on chromosome 6, where the mean posterior probability of IBD sharing within the CEPH individuals is about 0.06. The inferred IBD tracts on the entire chromosome are plotted in Figure S7A and the inferred tracts in the HLA region are plotted in Figure S7B. In total we infer 312 IBD tracts in the HLA region. The length distribution for these tracts is seen in Figure S7C. The length of most of the tracts are in the range of 0.5–7 Mb long in physical distances, but in genetic distances they are only 0.5–3.5 cM long (not shown). These tracts contain 30–309 SNPs; however, all SNPs contribute information to the inference of the tracts. Three of the tracts are exceptionally long (not shown in the length distribution). They belong to two unreported avuncular pairs and a sib pair present in the data. Excluding these pairs of individuals has little or no impact on the size of the overall signal (not shown).

The HLA region has some extreme recombination patterns and it has some of the highest amounts of LD found in the genome. In addition, it is a region that is difficult to genotype accurately because of the high degree of duplicated genes and structural variants. We therefore performed an additional test to validate the inferred IBD tracts. A subset of the CEPH sample consisting of 56 individuals overlaps with the original HapMap population. These individuals have been extensively genotyped using various genotyping platforms (International Hapmap Consortium 2007) with extra genotyping in the HLA region (De Bakker et al. 2006). Of the more than 10,000 SNPs available in the HapMap phase 2 database for this region we used only a small subset for our inference. This provided us the opportunity to validate our inferred IBD tracts using the additional data. If an inferred tract for a pair of individuals is truly an IBD tract then those two individuals should have at least one allele in common for all SNPs within a tract. If the two individuals do not share an allele identical-by-state (IBS) then the SNP is not compatible with the IBD tract. In Figure S8 we show the percentage of SNPs from the HapMap phase 2 genotyping data in each of the inferred tracts that do not share at least one allele IBS (incompatible). Of the 105 tracts inferred for the subset of individuals, only 4 have more than 0.5% (the genotyping error rate we used throughout this study) of the SNPs with an incompatibility in the HapMap phase 2 data. The vast majority of tracts have less than 0.1%. In contrast, if we randomly permute the individuals, we see that most of the tracts have between 1% and 10% pairwise incompatible genotypes. Thus the IBD region inferences within the HLA region are compatible, not only with the SNPs used for inferring the IBD tracts, but with all the available phase 2 HapMap SNPs in this region. This shows that the IBD tracts are not incorrectly inferred due to possible confounding factors, such as remaining LD, but are indeed real. Because the HLA region is hard to genotype we also examined the region for additional deviations from Hardy–Weinberg equilibrium. Using a Mann–Whitney test we found no difference in the distribution P-values between the HLA region and other parts of the genome. Finally, if IBD tracts were erroneously inferred due to genotyping error caused by hidden structural variants, we would expect to observe an excess of heterozygosity in individuals in IBD tracts compared to individuals not in IBD tracts in the same regions. We found no such excess in the HLA region (not shown).
The HLA region is also important for population genetics. The authors of paper "Linkage disequilibrium and age of HLA region SNPs in relation to classic HLA gene alleles within Europe"   discovered an important pattern of haploblock sharing:  haplotype blocks detected within HLA region, as well as classic HLA gene alleles and SNPs, were predictive of a northern versus southern European population membership (misclassification error rates ranged from 0 to 23%, depending on which independent population was used for prediction), indicating that this region may be a rich source of ancestry informative markers.  

It is believed now that HLA locuses belong to the set of ancestry informative markers that are used for inferring the genetic barriers on a worldwide scale.

In order to test the existence of  long extended IBD haplotypes,  i ran GERMLINE (an algorithm for discovering long shared segments of  (IBD) between pairs of individuals in a large population) on extracted subset of chromosome 6 "populations". GERMLINE takes as input genotype phased data for individuals (as well as an and generates a list of all pairwise segmental sharing. I used a pipeline for phasing PLINK format data with BEAGLE and processing in GERMLINE.

The analysis revealed explicit genetic signal of IBD sharing between Chuvashes and Finns. I haven't checked SNPs in detected haplotypes for ancestry information, but i assume that most of them are result of strong natural selection.