My Blog List

Monday, November 14, 2011

Some achievements of the MDL project

Looking back at the project's goals posited  6 months ago, we would like to recapitulate some of the most important goals in order to analyzes how they have been achieved:

1.Performing the comprehensive Plink analysis, including the estimation of homozygous ROH (shared clusters and groups of homozygosity), possible Mendelian errors, extended LD-haplotypes (based on values of R2), shared IBD segments and IBS matrix (Plink format).

Although we performed all described types of Plink analysises and eve shared some results on the project's blog, we didn't consider these results worth of extensive coverage. And likewise, there was no interest in those analysises on behalf of the project's members.

Experiments with relatedness
Graphoanalytical approach to visualizing relatedness
IBD sharing
IBS similarity matrix in R

2.Phasing the genotype files, i.e establishing the haploid phase (this is a separate analysis demanding genotypes of your parents, so it will not be performed on a regular base) (Beagle or Merlin output format). 

We performed ad-hoc phasing of the genotypes in our project (MDLP) and, in order to assess possible discrepancies between phased and unphased data, we performed ADMIXTURE analysis (with 4 assumed clusters K=4) separately for original unphased dataset and BEAGLE-phased dataset.

Analyzing admixture in phased v.unphased dataset

3. Using AISconvert (based on HIRsearch) and Germline software to detect IBD segments.

Used only occassionally in combination with other analyses

Analyzing admixture in phased v.unphased dataset 
Grapho-analytical approach to the visualisation of IBD shared segments
IBD sharing

4.Using ADMIXTURE/STRUCTURE software for detecting admixture clusters and claculating allele frequencies.

We performed a plenty of ADMIXTURE and STRUCTURE runs (using different a priori number of assumed clusters under different models of  admixture). Discussions of ADMIXTURE results contibuted  the most signficant part to the MDLP's blog.
The allele frequencies, estimated in K=7 Admixture run, were provided for creating a custom modification of DIYDodecad's calculator (MDLP).

Analyzing admixture in phased v.unphased dataset
First results: Admixture unsupervised run
Admixture analysis: sorted after Baltic-Slavic component
Admixture results: Baltic-Slavic
Admixture analysis: the rest of groupings
The output of the PLINK and ADMIXTURE algorithms
Admixture clusters, Mclust and populations concordance
DIYDodecad calculator v2.0 for my BGA project (MDLP).
Root Means Square Comparison Excel 2007 Macro Enabled XLSM spreadsheet for the Magnus Ducatus Lithuaniae Project data

and many more ..

5. Creating MDS and  PCA plots
PCA plots (Eigensoft)

PCA plots for reference populations and project participants
MDS and PCA plots: for V157-V247
A close-up on "the core" of the MDL project

6.Creating RHHmapper schemes showing the location of rare heterozygous and homozygous genotypes

RHH mapper: results for V158-V165 and V201-V202

7 months of MDLP project: alpha phase is over

We would like to announce that the project has been active for more than 6 months, i.e fairly long to accomplish some of the posited goals. The main goal of the project's preliminary (alpha) phase was to collect a statistically reliable sample for obtaining the statistically significant results. The current dataset (MDLP v2) includes 531 unrelated individuals (379 males, 152 females) with 310652 SNPs, of those 183 individuals (48 Romanians, 2 Russians,17 Chuvashes,12 Uzbeks,16 Turks,18 Armenians,15 Lezgins,20 Georgians,19 Hungarians,8 Lithuanians,8 Belorussians) from Behar et all (2010) dataset.41 individuals from HGDP (25 Russians, 16 Adygei), 175 individuals from the 1000 Genomes Project (83 British, 92 Finns) and 62 individuals from Yunusbayev et all. (2011) paper (14 Mordovians, 16 Nogays,13 Bulgarians,19 Ukrainians).

 The ethnic distribution of the whole set would look as follows (ethnic groups in red need more participants/samples )

Belarussian 18
Adygei 16
Armenians 18
Aszkenazi 2
Bulgarians 13
Chuvashs 17
Finns 92
British 83
Georgians 20
Hungarian 20
Latvian 1
Lezgins 15
Lithuanians 27
Mordovians 14
Nogays 16
Ossetians 14
Norwegians 2
East Germans 7
Others  8
Poles 18
Romanians 14
Russians 36
Swedish 2
Turks 16
Ukrainians 30
Uzbeks 12

Another interesting characteristics of sample is that one of average inbreeding coefficient in each particular population,  based on the observed versus expected number of homozygous genotypes in given population.

FID F-coefficient
Lithuanian-average 0.0158738
Finn-average 0.01375742
GBR_Orkney-average 0.013074288
Lezgin-average 0.012808472
Belorussian-average 0.011024444
GBR_Cornwall-average 0.010527961
GBR_Kent-average 0.009641047
Georgian-average 0.00949285
Turk-average 0.0093435
Hungarian-average 0.007138795
Adygei-average 0.006826329
Romanian-average 0.006763092
Russian-average 0.006179208
Uzbek-average 0.005255747
Armenian-average 0.004329326
Chuvash-average 0.004147971

The following characteristic of  MDLP - an average number of shared IBD segments per population is especially valuable for evaluating the genomic structure of population. I've limited the results to Slavic populations only.

Poles 0.878788
Belarusians 0.722008
Ukrainaians 0.676113
Russians 0.561878
Lithuanians 0.548961