Meanwhile i keep receiving 23andme and FF raw data file, i decide to perform a quick ADMIXTURE supervised analysis and compare the results to LAMP/STRUCTURE output.
In order to make the presentation clear and easy to follow, i will start with the simpliest part of my analysis, which is supervised ADMIXTURE (K=5) analysis. Supervised analysis allows more accurate estimation of the ancestries of the individuals,by specifying the ancestries of the reference individuals.
I did supervised ADMIXTURE analysis by selecting 6 reference populations - Orcadians and Russians (Vologda) from HGDP project; Romanians, Hungarians, Russians (Tver), Lithuanians and Belorussians from public dataset (The genome-wide structure of the Jewish people. Nature 2010 Jul 8;466(7303):238-42. ). In our particular case, Orcadians represents an abstract proxy for the whole NW European component, Romanians and Hungarians as proxies for Central-European component (while Hungarians represent more specific Subcarpathian component, we consider Romanians to have more genetic affinity to SE (Balkan) component). Russians from Vologda define here North-European component -and finally, ,Lithuanians, Belorussians and Russians from Tver are included to represent the main genetic component in North-Eastern Europe.
Before manipulating with the reference data in Plink, i removed 2 pairs of close relatives (2Orcadians and 2 Hungarians) and 2 Romanians with Roma admixture . Then, I excluded SNPs with missing rates greater than 1% and performed the SNP prunning, based on the variance inflation factor and pairwise genotypic correlation. After the prunning, i included SNPs with MAF >= 0.05 and with maximum 1 missing allele per-person. Then, i performed LD-based pruning using a window size of 50, a step of 5 and r^2 threshold of 0.3. After that i had at my disposal the dataset with 121 included individuals and circa 140Kb SNPs. 27 participants of the MDL project were included into the ADMIXTURE run:
Then i performed linkage disequilibrium based pruning using a window size of 50, a step of 5 and r^2 threshold of 0.3:
The results are in Google spreadsheet
Before manipulating with the reference data in Plink, i removed 2 pairs of close relatives (2Orcadians and 2 Hungarians) and 2 Romanians with Roma admixture . Then, I excluded SNPs with missing rates greater than 1% and performed the SNP prunning, based on the variance inflation factor and pairwise genotypic correlation. After the prunning, i included SNPs with MAF >= 0.05 and with maximum 1 missing allele per-person. Then, i performed LD-based pruning using a window size of 50, a step of 5 and r^2 threshold of 0.3. After that i had at my disposal the dataset with 121 included individuals and circa 140Kb SNPs. 27 participants of the MDL project were included into the ADMIXTURE run:
Then i performed linkage disequilibrium based pruning using a window size of 50, a step of 5 and r^2 threshold of 0.3:
V158 |
V157 |
V160 |
V161 |
V162 |
V163 |
V164 |
V165 |
V201 |
V202 |
V166 |
V167 |
V168 |
V169 |
V170 |
V171 |
V172 |
V173 |
V174 |
V175 |
V176 |
V177 |
V178 |
V179 |
V180 |
V181 |
V182 |
The results are in Google spreadsheet
Thanks Vadim, very interesting....appreciate all your work.
ReplyDeleteIf there is a list of IDs pending, pls add myself as #V179:
Russian Father from St. Petersburg (some Finn influence there) and russian mother from the urals but since she is H5b, not sure where before then....I am surprised at the % hungarian showing here, thought it would be more belarus and grateful for the 48% russian!
Great first ADMIXTURE. As V173, my MtDNA line is H11a2 with GGgrandmother born in Poland.
ReplyDeleteHere is a Zipped spreadsheet that can be sorted matching a ID with there overall costed percentages. MS Excel 2007 format. Download and extract.
ReplyDeletehttp://tinyurl.com/MDLPart-ISupervAdmixK5