My Blog List

Thursday, December 29, 2011

Last posting in the year 2011

Earlier this month I experimented with Chromopainter and fineSTUCTURE software. For the sake of comparision with ADMIXTURE/LAMP results, i used PLINK linkage file with previously extracted SNPs on Chr.6. The PLINK file included thinned set of LD-pruned SNPs (9155 SNPs in total). I also limited the dataset to include the projectäs participants only. I streamlined the phasing of this dataset by using Gusev's phasing_pipeline (which is indispensable for conversion between PLINK and BEAGLE format, and phasing genotypes with BEAGLE).

After that i deployed, a conversion script for going from PLINK linkage-style PED and MAP files to ChromoPainters PHASE and MAP files. The converted MAP file was merged with HapMap's pre-compiled recombination file for Chr.6 (in order to accomplish this task i used a simple buggy but efficient AWK script). This trick allowed me to avoid/skip the painstaking exercise in UNLINKED model of Chromopainter .

As a next step, i changed default -k parameter to 200 (to give approximately 200 random samples to estimate the local variance). It is important to mention, that Chromopainter could be run in two modes: "donor mode" (which assumes the prior existence of "donor" and "recipient" haplotypic populations) and "all against all" mode. I we used the latter approach in which a single haplotype within an individual is reconstructed using the haplotypes from all other individuals in the sample as potential donors. This process is repeated for every haplotype in turn, so every individual is ultimately reconstructed in terms of all the other individuals, i.e every individual in my sample is conditioned on all the other individual.

It took me approximately 1,45 hours to finish Chromopainter's job. After that in combined Chromopainter output files, creating the file "MDLP.Chr6.chunkcounts.out", which is the coancestry matrix that fineSTRUCTURE requires as input.

After that i fed fineSTRUCTURE with required input (the coancestry matrix) and let fineSTRUCTURE to find a correct number of sample's "splits" (clusters). Surprisingly, even in such a limited (LD-prunned and tiny) dataset of SNPs, fineSTRUCTURE was able to detect four population-like (2 East-European, 1 West-European and outliers) clusters. In comparision, ADMIXTURE had some troubles with fleshing out sample's K-clusters, when i ran it on both phased and unphased version of the same Chr6. dataset (you can see them in this Excel spreadsheet).
From what i gather, author's claim holds water: fineSTRUCTURE indeed outperform ADMIXTURE/STRUCTURE approaches: "We next turn to our fineSTRUCTURE model-based analysis, again considering the unlinked coancestry matrix even though strong and variable LD exists in the dataset. We first compared performance of our unlinked model to the popular ADMIXTURE [15] software (Figure 3B and D, details in Section S8). Encouragingly, as the number of 5Mb regions increased from 5 to 200 we saw a monotonic performance increase for the no-linkage model, separating all groups with 200 markers. Further, our approach outperformed ADMIXTURE, with the ADMIXTURE performance levelling at around 60% correlation with the truth. In practice, we observed ADMIXTURE successfully splitting groups A, B and C and mostly splitting C1 and C2, but not B1 and B2, as detailed in Figures S6-11. ADMIXTURE performs inference under a model where markers are treated as unlinked, and where individuals may have genomes made up of mixtures of inferred source populations, while our simulation incorporated drift between populations, but not admixture. To examine whether violations of both these modelling assumptions explain the different results, we simulated a new dataset with the same underlying population structure of 5 populations as before, but no linkage (i.e. independence) between markers within each population. We analysed these data with STRUCTURE, which uses a similar underlying model to that of ADMIXTURE, but includes a no-admixture model (Section S7). For small datasets, STRUCTURE slightly improved performance relative to our unlinked fineSTRUCTURE model, but for larger SNP numbers, fineSTRUCTURE was able to identify all population splits (K = 5) while again, STRUCTURE was able to split only populations A, B and C (K = 3). Thus, even when LD information is not used (or even present), fineSTRUCTURE can offer advantages in some settings over these existing approaches."

With this gist of fineSTRUCTURE methodology, is is now safe to conclude that the combined set of all 22 chromosome, the linkage model can do even better. 

The painting process gives three square matrices of size NxN (N=number of individuals): 1) the frequency individuals copy chunks from each other, 2) the average length of those chunks, and 3) the mutation rate. These contain different aspects of the ancestry history, and almost completely summarize population ancestry.

The origin of some individual haplotypes could be ascertained using more comprehensive GERMLINe matching algorithm. Indeed, after applying GERMLINE algorithm to BEAGLE-phased chromosomes (1..22), i can see that the pairwise matches in haplotypes located far away from the chromosomal centromere and telomeres, tend to cluster into "ethnic" groups. However, haplotypes which are observed near centromere, are rather omnipresent - as would be expected if these regions had low rates of recombination because of proximity to centromeres.  The IBD segments detected by GERMLINE are listed (per chromosome) below:

With the scope on xMHC (see our previous posting),Of special interest here are, perhaps, haplotype matches on Chromosome 6, which match the pattern of massive extensive sharing of linked SNPs in the region of xMHC  - Chrom6-cM

Tuesday, December 27, 2011

Experimental test: inferring HLA (haplo)types from DNA nucleotide sequences using HLA*IMP framework

Molecular differences between HLA alleles vary up to 57 nucleotides within the peptide binding coding region of human Major Histocompatibility Complex (MHC) genes, but it is still unclear whether this variation results from a stochastic process or from selective constraints related to functional differences among HLA molecules. Although HLA alleles are generally treated as equidistant molecular units in population genetic studies, DNA sequence diversity among populations is also crucial to interpret the observed HLA polymorphism (Buhler S, Sanchez-Mazas A, 2011 HLA DNA Sequence Variation among Human Populations: Molecular Signatures of Demographic and Selective Events. PLoS ONE 6(2): e14643. doi:10.1371/journal.pone.0014643).

Traditionally, serology has been used for HLA typing for many decades; however, serological typing of histocompatibility class II molecules depends on the adequate expression of these molecules on the surface of B lymphocytes, the availability of viable cells and a complete set of antisera. However, the application of molecular methods (RFLP, PCR, SSO etc.) to HLA typing has led to the situation whereby nearly every laboratory performs some DNA typing for the detection of HLA alleles.

Why HLA loci are so important?

 HLA loci present the highest degree of polymorphism of all human genetic systems. Prior knowledge of the extent of diversity is essential in the development and selection of molecular typing methods. Reliable allele frequencies are also important in allogeneic unrelated hematopoietic stem cell transplantation to determine the likelihood of finding closely HLA matched donors for each patient.The genetic diversity of the HLA loci is responsible for the efficiency of the immune system in eliminating cells carrying foreign antigens. There is a need to develop a measure of this genetic diversity in order to assess how different populations are equipped to respond to foreign-antigen exposure, and to evaluate the contribution of each HLA locus.

The HLA system has been extensively studied from an evolutionary perspective. The region contains a number of closely linked genes whose products control a variety of functions concerned with the regulation of immune responses. In addition, the genetic predisposition to over 40 diseases maps to this region. A number of observations indicate that strong selection is acting on the HLA region, including its extensive polymorphism with very even allele frequencies, the preferential occurrence of high levels of variability at positions critical to antigen recognition, the great age of alleles and the patterns of linkage disequilibrium among loci. The form of the selection is unknown. Although balancing selection is a strong candidate, it seems unlikely that only one selective mechanism is operating in this complex multigene family region. Mutation, recombination and gene conversion all contribute to the generation of HLA variability. The apparent great age of many HLA alleles revealed by phylogenetic analysis suggests that the absolute rate of production of new variants is not high. Detailed studies of population and evolutionary features of the HLA region are necessary for an informed discussion of the evolution of disease predisposing genes and epitopes, and of complex multigene families (Thomson G.HLA population genetics.1991 Jun;5(2):247-60.)
Nomenclature of HLA Alleles

Each HLA allele name has a unique number corresponding to up to four sets of digits separated by colons. The length of the allele designation is dependant on the sequence of the allele and that of its nearest relative. All alleles receive at least a four digit name, which corresponds to the first two sets of digits, longer names are only assigned when necessary.The digits before the first colon describe the type, which often corresponds to the serological antigen carried by an allotype. The next set of digits are used to list the subtypes, numbers being assigned in the order in which DNA sequences have been determined. Alleles whose numbers differ in the two sets of digits must differ in one or more nucleotide substitutions that change the amino acid sequence of the encoded protein. Alleles that differ only by synonymous nucleotide substitutions (also called silent or non-coding substitutions) within the coding sequence are distinguished by the use of the third set of digits. Alleles that only differ by sequence polymorphisms in the introns or in the 5' or 3' untranslated regions that flank the exons and introns are distinguished by the use of the fourth set of digits (further information:


HLA-A    identifies HLA A locus
HLA-A1 serologically defined antigen
HLA-A* asterisk denotes HLA alleles defined by molecular methods
HLA-A*01 2 digit resolution denotes a group of alleles corresponds usually to serological group – low resolution
HLA-A*0101 4 digit resolution – sequence variation between alleles results in amino acid substitutions
HLA-A010101 6 digit resolution – non coding variation: sequence changes synonymous no amino acid substitution

HLA types and linked SNPs
 Still, classical HLA alleles (e.g. HLA-, HLA-B, etc.) are tricky to analyze with chip technology employed in popular commercial personal genomic services (23andme, FTDNA's Family Finder and deCODEme) , because analysis is a complex process, requiring a large number of multiplex PCR reactions to obtain a patient's full genotype. That is why classical HLA typing methods are often imparctical for large-scale studies.

Fortunately for us , Wellcome Trust Centre for Human Genetics described a method for imputing classical alleles from linked SNP genotype data, implemented in the imputation framework (HLA*IMP) (Alexander Dilthey, Loukas Moutsianas, Stephen Leslie, and Gil McVean. HLA*IMP – an integrated framework for imputing classical HLA alleles from SNP genotypes Bioinformatics btr061 first published online February 7, 2011 doi:10.1093/bioinformatics/btr061).

HLA*IMP  imputes HLA type information based on SNP genotype, using methods of iterative model building to select a set of informative SNPs  for particular supported genotyping chips (AffyMetrix 500K, AffyMetrix 900K, Illumina 300K, Illumina 550K,Illumina 650K, Illumina 1M). Thus, HLA*IMP enables researchers to perform classical HLA allele imputation from genotype data collected from several available genome-wide SNP sets through reference to a reference data set of over 2,500 samples of European ancestry with dense SNP data and classical HLA allele types. This reference set includes:
  • The 1958 Birth Cohort  typed both on the Illumina 1.2M and Affymetrix Genome-Wide Human SNP Array 6.0 chips (TheWellcome Trust Case Control Consortium, 2007) - 2420 genotype samples x 7733 SNP  in the extended HLA region.
  • The HapMap CEU samples (The International HapMap Consortium, 2007) and CEPH CEU+ additional samples (de Bakker et al., 2006)    - 92 samples x 7733 BC58-overlapping SNPs)

SNP haplotypes for the 1958BC and CEU+ samples were phased using IMPUTE v2  using the trio-phased HapMap samples as a reference dataset. Classically typed HLA genotypes were then phased into SNP haplotypes by using PHASE (Stephens and Scheet, 2005) applying standard settings for multiallelic loci. The combined reference dataset consists of 5024 haplotypes with data on 7733 SNPs in the HLA region. This splits up into 2474 (HLA-A), 3090 (HLA-B), 2022 (HLA-C), 175 (HLA-DQA1), 2629 (HLA-DQB1), 2665 (HLA-DRB1) locus-specific haplotypes which are used for inference.

The experiment with MDLP dataset


As i previously mentioned in my blog (re: Chromosome 6), 23andme's RelativeFinder and AncestryFinder revealed (among my results) a cluster of a prominent number of HIR-matches (315), positioned in the exactly the same subregion of HLA-MHC domain on Chr.6 (21Mb-38Mb). This remarkable group of matches constitutes almost a half of my total AF/RF matches (315/720 or 43,75%).

Earlier i suggested that such a preponderance of HIR-matches to HLA region is indicative of case of sharing the same extended HLA haplotype. Until recently, my suggestion rested solely on my intiution.  Then, i found solution to HLA*IMP challenges and so far, i've managed to make HLA*IMP work with 23andme genotyper (Illumina Omnio Express) call raw data.

For my test, i needed to make sure that my data meets the following requirements:

  • SNPs must be from the xMHC region (chromosome 6) 
  • European ancestry of the typed individuals 
  • sufficient quality and typed SNP density in the HLA region to enable reliable phasing
  • since HLA*IMP doesn't provide a direct support for 23andme's chips and i was using the combined set of genotypes from 23andme's chipset v2 and chipset v3, i've choosed "a downgraded" version of Illumina genotyping platform (Illumina 300K)
In order to my initial assumption of HLA haplotype sharing i selected 7 participants from my project (including 4 individuals without  HIR-sharing on Chromosome 6 (not detected by 23andme's AF)  me, my mother and an individual, who has a HIR-match with me and my mother in xMHC region). I  converted 23andme's raw data of the project's participants into Plink format,  merged the files into one dataset and extracted SNPs on Chromosome 6 using Plink command --chr 6. After that, i converted the genotype data from Plink format to HLA*IMP input data format. As a next step, i carried out quality control  on the genotype data, removing SNPs and individuals with too much missing data and aligning complementary SNPs to the HapMap reference strand. Finally, i phased  included genotypes to obtain haplotype data. Note: each individual was anonymized by substituting project's ID with prefix N.

The haplotype data was uploaded to HLA*IMP back-end web service    for imputing HLA types.

The imputed HLA types look as follows (each individual is represented by 2 HLA haplotypes in HLA-A:HLA-B:HLA-C:HLA-DQA:HLA-DQB:HLA-DRB format)
N1 1 101 801 701 501 201 301
N1 2 2601 2705 102 101 501 101
N6 1 3101 801 701 501 201 301
N6 2 201 1501 304 501 201 301
N3 1 6801 1501 102 101 501 101
N3 2 2301 5201 501 101 501 101
N2 1 101 801 701 501 201 301
N2 2 2601 3801 1203 102 602 1501
N5 1 301 1501 304 501 302 401
N5 2 205 5001 602 501 202 701
N7 1 101 801 701 501 301 1101
N7 2 101 1501 303 103 604 1301
N4 1 301 702 702 401 402 801
N4 2 2402 4002 202 501 301 1101

The haplotypes should be read as  follows (for example, in case of N1): HLA A*0101 : Cw*0701 : B*0801 : DRB1*0301 : DQA1*0501 : DQB1*0201. 

From the table posted above, one can easily mention a matching pattern between  N1, N2 and N7, as they share the similiar haplotype.  As a matter of fact N1 (ny mother), N2 (me) and N7 share HIR-segment (matching DNA segment equal to or greater than seven Centimorgans (abbreviated as cMs) on one segment with 700 or more SNPs), found by RelativeFinder service at 23andme.

Thus, it is safe to conclude that the imputation was accurate and  my initial suggestion seems to be backed up by HLA imputation

The practical implications of test

Aside from the medical usefulness, there are also some benefits of knowing HLA type for genetic genealogists:

1) First of all,  it is possible to determine the nature of the sharing the segments in xMHC region on Chromosome 6. Consider the aforementioned extended haplotype HLA A*0101 : Cw*0701 : B*0801 : DRB1*0301 : DQA1*0501 : DQB1*0201 (in shorthand: A1::DQ2).A1::DQ2 haplotype creates a conundrum for the evolutionary study of recombination.A1::DQ2 does not follow the expected dynamics. Other haplotypes exist in the region of Europe where this haplotype formed and expanded, some of these haplotypes also are ancestral and also are quite large. At 4.7 million nucleotides in length and ~300 genes the locus had resisted the effects of recombination, either as a consequence of recombination-obstruction within the DNA, as a consequence of repeated selection for the entire haplotype, or both. The length of the haplotype is remarkable because of the rapid rate of evolution at the HLA locus should degrade such long haplotypes. A1::DQ2 is the most frequent haplotype of its length found in US Caucasians, ~15% carry this common haplotype. The SNP analysis of the haplotype suggests a potential founding affect of 20,000 years within Europe, though conflicts in interpreting this information are now apparent. The last possible point of a constriction forcing climate was the Younger Dryas before 11,500 calendar years ago, and so the haplotype has taken on various forms of the name, Ancestral European Haplotype, lately called Ancestral Haplotype A1-B8 (AH8.1). It is one of 4 that appear common to western Europeans and other Asians. Assuming that the haplotype frequency was 50% at the Younger Dryas and declined by 50% every 500 years the haplotypes should only be present below 0.1% in any European population. Therefore it exceeds the expected frequency for a founding haplotype by almost 100 fold. For genetic genealogist, the said would mean that the hotspot of  DNA matching in xMHC region could be indicative of very distant common ancestor, probably as recent as Neolithic era.

2)It is also possible to infer the geographic ditribution of  a particular HLA haplotype. Using the same A 1::DQ2 haplotype, we could see that it is found in Iceland, Pomors of Northern Russia, the Serbians of Northern Slavic descent, Basque, and areas of Mexico where Basque settled in larger numbers. The haplotypes great abundance in the most isolated geographic region of Western Europe, Ireland, in Scandinavians and Swiss suggests that low abundance in France and Latinized Iberia are the result of displacements that took place after the Neolithic onset. This implies a founding presence in Europe that exceeds 8000 years.  

The database of allele frequencies is a very convenient tool for analyzing HLA haplotype frequency in a worldwide scale.

1 A*01:01-B*08:01-C*07:01-DRB1*03:01-DQB1*02:01 Ireland South
2 A*01:01-B*08:01-C*07:01-DRB1*03:01:01-DQB1*02:01 England North West
3 A*01:01-B*08:01-C*07:01-DRB1*03:01-DQB1*02:01-DPB1*04:01 Ireland South
4 A*01:01-B*08:01-C*07:01-DRB1*03:01-DQB1*02:01 Poland
5 A*01:01-B*08:01-C*07:01-DRB1*03:01-DQB1*02:01 USA Hispanic pop 2
6 A*01:01-B*08:01-C*07:01-DRB1*03:01-DQB1*02:01-DPB1*01:01 Ireland South
7 A*01:01-B*08:01-C*07:01-DRB1*03:01-DQB1*02:01 USA African American pop 4
8 A*01:01-B*08:01-C*07:01-DRB1*03:01-DQB1*02:01 USA Asian pop 2

Tuesday, December 6, 2011

Experimental test: running Admixture on phased samples and estimating ancestries at each locus in a population of admixed individuals (LAMP))

As a methodological tool, there is nothing wrong with the combined usage of Plink/Admixture/Lamp for assessing the popupaltion stratification and the levels of individual admixtures. One of  the goals of our project is to expose both the advantages and the limitation of particular methodologies in BGA analyses, and therefore this particular  experimental test seeks to expose  the grip of unified Plink/Admixture/Lamp approaches  as  a methodological contrivance.
Thus the goal of the experiment described herein  is to lay  bare some of practical and thereoretical  obstacles that have  hitherto obscured the legibility of our previous biogenographic analysis of the population substructures in Europe.

We routinely began our analysis in Plink software to filter the combined dataset to include only SNPs on the 22 autosomal chromosomes with minor allele frequency >1% and genotyping success >99%. 
Because background linkage disequilibrium (LD) can affect both principal component and structure-like analysis, wethinned the marker set by excluding SNPs in strong LD (pairwise genotypic correlation r2>0.4) in a window of 100 SNPs (sliding the window by 10 SNPs at a time). We also used the following Plink techniques for obtaining the homogeneous sample: pairwise clustering based on IBS is for detecting pairs of individuals who look more different from each other than you'd expect in a random, homogeneous sample; evaluating 

Those reference individials, who look more different (more than 3 SD) to the rest of the data set and/or have high PIHAT values ( greater than 0.05) and higher degree of inbreeding (homozygocity)* were removed from our training set.
* Note that the degree of inbreeding is defined as the probability that identical homozygocity occurs in a locus. 

The stratification of samples according to the levels of homozygocity,X axis -total ammount of homozygous segments in Kb (kilobases) units; Y-axis - average size of homozygous segments in Kb  units

The levels of individual homozygocity in data set: NSEG (number of segments) on X axis is plotted against  the total length of homozygous segments in KB (Y axis)

The pairwise clustering based on IBD. The total length of IBD segments (X axis) is plotted against PIHAT values (
P(IBD=2)+0.5*P(IBD=1) ( proportion IBD ))

After frequency and genotyping pruning, and removing the outliers, the final data set consisted of 90455 SNPs and 317  individuals (289 males, 82 females) that were used in subsequent analyses.

First of all, we used ADMIXTURE (Alexander, Novembre, Lange 2009) implementing a structure-like
(Pritchard, Stephens, Donnelly 2000) model-based maximum likelihood (ML) clustering algorithm to assess population structure in the whole data set .

In order to maintain the compatibility with MDLP calculator, we choosed K=7 as a sensible modeling choice. The Q estimates, obtained from Admixture run, were plotted  in R.

Please note that only participants of the MDLP project are plotted on this barplot. The full list of Q estimates (including "reference" populations) is available in this  spreadsheet.

The seven ancestral components (K) inferred at this level of resolution are:

Trans-Caucasian -red 
Balkanic/Mediterrean -yellow    
North-Caucasian -green
Altaic    - light blue
Balto-Slavic    - dark blue 
Balto-Finnic/North-European -purple

As usual, we labeled these components for mnemonic purposes only:  researchers should thus be cautious in interpreting  the inferred components in terms of conventional population history.

A neighbor-joining tree based on inter-population Fst distances
 As a next step we splitted whole-genome PLINK binary files (371 samples) to 22 separate chromosomal chunks and sunsequently used Admixture software to evaluate population structure on each of 22 chromosomes.  After that we used a pipeline for phasing PLINK format data with BEAGLE and converting phased datat back to PLINK format.

We assumed that the admixed population (represented by VID samples) being analyzed in our project  has arisen from the admixture of  7 separate ancestral populations, and that phased data are available from unadmixed reference populations  are closely related to the true ancestral population.  Under this assumption we again used ADMIXTURE software to infer ancestral components in the phased data set of separate chromosomes.  Please note that we intentionally left those components unlabeled.

Finally, we used LAMP (Local Ancestry in adMixed Populations) (Sankararaman et al.2008) for inferring individual admixture. This simply involves the application of either of the above procedures (locus-specific ancestries when the ancestral populations are not known or locus-specific ancestries of the ancestral populations are known) followed by averaging these locus-specific ancestries to obtain the individual admixture. We estimated allele frequencies (stratified by a categorical cluster) in Plink software (to produce summary of allele frequencies that is stratified by a categorical cluster variable, we used the plink --file data --freq --within  option).We fixed the HapMap recombination rate (estimated separately for each of 22 chromosomes) according to the units of distance specified in chromosome posfile. We also used different number of generations G (generations) 5, 10,25  , assuming that there are K=7  ancestral populations A1, …, AK that have been mixing for g=5,10,25 generations. The ancestries generated by LAMP were visualized in Generategraph package.

 We organized the output into separate  Excel spreadsheets to help facilitate analysis and interpretation of the results.  Each of provided Excel spreadsheets includes the following sections:

1) ADMIXTURE output for unphased chromosomal data (Chr*-phased)
2) ADMIXTURE output for unphased chromosomal data (Chr*-unphased)
3) LAMP results for G=5 (Chr*-lamp-gen5)
4) LAMP results for G=10 (Chr*-lamp-gen5)
5) LAMP results for G=25 (Chr*-lamp-gen5) 


Chr1 Chr2 Chr3 Chr4 Chr5 Chr6 Chr7 Chr8

The rest of chromosomal data will be added asap.



Monday, November 14, 2011

Some achievements of the MDL project

Looking back at the project's goals posited  6 months ago, we would like to recapitulate some of the most important goals in order to analyzes how they have been achieved:

1.Performing the comprehensive Plink analysis, including the estimation of homozygous ROH (shared clusters and groups of homozygosity), possible Mendelian errors, extended LD-haplotypes (based on values of R2), shared IBD segments and IBS matrix (Plink format).

Although we performed all described types of Plink analysises and eve shared some results on the project's blog, we didn't consider these results worth of extensive coverage. And likewise, there was no interest in those analysises on behalf of the project's members.

Experiments with relatedness
Graphoanalytical approach to visualizing relatedness
IBD sharing
IBS similarity matrix in R

2.Phasing the genotype files, i.e establishing the haploid phase (this is a separate analysis demanding genotypes of your parents, so it will not be performed on a regular base) (Beagle or Merlin output format). 

We performed ad-hoc phasing of the genotypes in our project (MDLP) and, in order to assess possible discrepancies between phased and unphased data, we performed ADMIXTURE analysis (with 4 assumed clusters K=4) separately for original unphased dataset and BEAGLE-phased dataset.

Analyzing admixture in phased v.unphased dataset

3. Using AISconvert (based on HIRsearch) and Germline software to detect IBD segments.

Used only occassionally in combination with other analyses

Analyzing admixture in phased v.unphased dataset 
Grapho-analytical approach to the visualisation of IBD shared segments
IBD sharing

4.Using ADMIXTURE/STRUCTURE software for detecting admixture clusters and claculating allele frequencies.

We performed a plenty of ADMIXTURE and STRUCTURE runs (using different a priori number of assumed clusters under different models of  admixture). Discussions of ADMIXTURE results contibuted  the most signficant part to the MDLP's blog.
The allele frequencies, estimated in K=7 Admixture run, were provided for creating a custom modification of DIYDodecad's calculator (MDLP).

Analyzing admixture in phased v.unphased dataset
First results: Admixture unsupervised run
Admixture analysis: sorted after Baltic-Slavic component
Admixture results: Baltic-Slavic
Admixture analysis: the rest of groupings
The output of the PLINK and ADMIXTURE algorithms
Admixture clusters, Mclust and populations concordance
DIYDodecad calculator v2.0 for my BGA project (MDLP).
Root Means Square Comparison Excel 2007 Macro Enabled XLSM spreadsheet for the Magnus Ducatus Lithuaniae Project data

and many more ..

5. Creating MDS and  PCA plots
PCA plots (Eigensoft)

PCA plots for reference populations and project participants
MDS and PCA plots: for V157-V247
A close-up on "the core" of the MDL project

6.Creating RHHmapper schemes showing the location of rare heterozygous and homozygous genotypes

RHH mapper: results for V158-V165 and V201-V202

7 months of MDLP project: alpha phase is over

We would like to announce that the project has been active for more than 6 months, i.e fairly long to accomplish some of the posited goals. The main goal of the project's preliminary (alpha) phase was to collect a statistically reliable sample for obtaining the statistically significant results. The current dataset (MDLP v2) includes 531 unrelated individuals (379 males, 152 females) with 310652 SNPs, of those 183 individuals (48 Romanians, 2 Russians,17 Chuvashes,12 Uzbeks,16 Turks,18 Armenians,15 Lezgins,20 Georgians,19 Hungarians,8 Lithuanians,8 Belorussians) from Behar et all (2010) dataset.41 individuals from HGDP (25 Russians, 16 Adygei), 175 individuals from the 1000 Genomes Project (83 British, 92 Finns) and 62 individuals from Yunusbayev et all. (2011) paper (14 Mordovians, 16 Nogays,13 Bulgarians,19 Ukrainians).

 The ethnic distribution of the whole set would look as follows (ethnic groups in red need more participants/samples )

Belarussian 18
Adygei 16
Armenians 18
Aszkenazi 2
Bulgarians 13
Chuvashs 17
Finns 92
British 83
Georgians 20
Hungarian 20
Latvian 1
Lezgins 15
Lithuanians 27
Mordovians 14
Nogays 16
Ossetians 14
Norwegians 2
East Germans 7
Others  8
Poles 18
Romanians 14
Russians 36
Swedish 2
Turks 16
Ukrainians 30
Uzbeks 12

Another interesting characteristics of sample is that one of average inbreeding coefficient in each particular population,  based on the observed versus expected number of homozygous genotypes in given population.

FID F-coefficient
Lithuanian-average 0.0158738
Finn-average 0.01375742
GBR_Orkney-average 0.013074288
Lezgin-average 0.012808472
Belorussian-average 0.011024444
GBR_Cornwall-average 0.010527961
GBR_Kent-average 0.009641047
Georgian-average 0.00949285
Turk-average 0.0093435
Hungarian-average 0.007138795
Adygei-average 0.006826329
Romanian-average 0.006763092
Russian-average 0.006179208
Uzbek-average 0.005255747
Armenian-average 0.004329326
Chuvash-average 0.004147971

The following characteristic of  MDLP - an average number of shared IBD segments per population is especially valuable for evaluating the genomic structure of population. I've limited the results to Slavic populations only.

Poles 0.878788
Belarusians 0.722008
Ukrainaians 0.676113
Russians 0.561878
Lithuanians 0.548961


Friday, November 4, 2011

The major revision of MDLP: adding Yunusbayev et all.2011 data

Earlier this month i added Bulgarians, Ukrainians and Mordovians from the reference samples in Yunusbayev et all.2011 paper and calculated PCA loadings of new obtained

As promised earlier, i have wasted a couple of hours on plotting PCA loadings (from Eigensoft analysis of my MDL BGA project) in graphical interface of indispensable R-package "BiplotGUI". I've made a lot of efforts combining diffirent types of statistical tests into one meaningful illustration.

Below are results of my experiments with Biplot.

Another cool feature of BiplotGUI is that it fully supports rgl, which is a 3D visualization system based on OpenGL. It provides a medium to high level interface for use in R, currently modelled on classic R graphics, with extensions to allow for interaction and creation of 3D animations/movies.

I've not figured yet how it works, but the next time i will be doing PCA analysis, i'll try to make 3D rotating animations for members of my project. 

(Via) DODECAD:Comparing different ADMIXTURE runs using Zombies

Dienekes Pontikos of DODECAD BGA project  was using our MDLP calculator (based on DIYDodecad methodological paradigma) as proof concept for comparing/mapping the Eurasian components inferred from the Dodecad-dv3 dataset  (West Asian, West European, East European, Mediterranean, Northeast Asian, Southeast Asian) against MDLP components (Scandinavian,Volga_Region, Altaic, Celto_Germanic, Caucassian_Anatolian_Balkanic, Balto_Slavic, North_Atlantic).

This is what he did :

".. To compare components across different projects; there has been a proliferation of different ancestry projects since the launching of Dodecad nearly a year ago, and since all of them slightly different individuals/SNPs/terminology, it is quite useful to be able to gauge how one component from one project maps onto other components in other projects. As proof of concept, I took the MDLP calculator from the Magnus Ducatus Lituaniae Project and generated 50 zombies for each of its 7 ancestral components:
  1. Scandinavian
  2. Volga_Region
  3. Altaic
  4. Celto_Germanic
  5. Caucassian_Anatolian_Balkanic
  6. Balto_Slavic
  7. North_Atlantic
 I then inferred the ancestry of the MDLP zombies using Dodecad v3, and vice versa. Since Dodecad v3 also includes populations (e.g., Africans) not considered by MDLP, I did not try to map those onto MDLP.

I will comment on the MDLP-to-dv3 mapping:
  1. The MDLP "Scandinavian" component appears to be West/East European with a little Mediterranean and a little Northeast Asian
  2. The MDLP "Volga_Region" component appears to be East European with some Northeast Asian
  3. The MDLP "Altaic" component is West Asian+Northeast Asian+Southeast Asian. Note that in Dodecad v3, the Northeast Asian component peaks at Chukchi, Nganasan, and Koryak, and most other east Eurasian populations have much less of it
  4. The MDLP "Celto-Germanic" component is (surprisingly) Mediterranean-dominated. One possible interpretation is that in the context of MDLP this captures one aspect of the difference between Southwestern and Northeastern Europe -higher Mediterranean in the former-, whereas the...
  5. ... MDLP "North-Atlantic" component seems to be entirely West European, and is capturing a different aspect of east-west variation in Europe.
  6. The MDLP "Balto-Slavic" appears the reverse of the "Celto-Germanic" with lower Mediterranean and reversed East/West European
  7. Finally, the MDLP "Caucassian_Anatolian_Balkanic" component is predictably mainly West Asian, but with a little Mediterranean and Southwest Asian as well
A different way of comparing the different components is to include them all in a joint MDS plot, or calculate various types of distances between them (e.g., Fst).

For example, the first couple of dimensions are dominated by the African/Asian components of Dodecad v3 that are not present in MDLP.Notice, however, the position of "Altaic", right where one might expect to find it between West and East Eurasians.

It appears that the "North_Atlantic" component may be centered on a small number of related individuals."

PS. Our MDLP calculator has been occasionally used by various people, and thus results arecurrently being disseminated over the large number of  the different Internet communities of DNA-genealogy/molecular anthroplogy's hobbysts. We are going to collect the results made publicly available by those hobbysts and out them into one spreadsheet for the futher analysis.