Determination of haplotype phase is becoming increasingly important as
we enter the era of large-scale sequencing because many of its
applications, such as imputing low-frequency variants and characterizing
the relationship between genetic variations in different populations. Haplotype phase can be
generated through laboratory-based experimental methods, or it can be
estimated using computational approaches. We assess the haplotype
phasing method that is aravailable in BEAGLE software, focusing in particular on using its output in ADMIXTURE analysis.
For simplicity's sake we have selected individuals from "Balto-Slavic" cluster (the cluster attribution of individuals were inferred from Dienekes's Mclust using 11 MDS dimension), which is the major cluster of our project. Here is an all inclusive list of IDs for selected participants of our project:
V158
V157
V160
V202
V169
V170
V171
V174
V176
V177
V180
V181
V188
V189
V196
V205
V208
V211
V215
V218
V220
V221
V222
V228
V225
V232
V236
V237
V235
V231
V244
V246
V238
We had thinned the genotype data of selected individuals to c.100 000 SNPs, removing SNPs in strong LD and low quality SNps. After that we used GERMLINE pipeline for phasing PLINK format data with BEAGLE and processing in GERMLINE (phasing was performed in a homologous populations), Then, in order to assess possible discrepancies between phased and unphased data, we performed ADMIXTURE analysis (with 4 assumed clusters K=4) separately for original unphased dataset and BEAGLE-phased dataset.
To our surprise, we haven't be able to find expected signficant differences between phased and unphased multi -SNP markers genotypes (the range of difference is c.1-5%).
Unphased data:
Phased data:
Spreadsheet with ADMIXTURE results can be found here.
For simplicity's sake we have selected individuals from "Balto-Slavic" cluster (the cluster attribution of individuals were inferred from Dienekes's Mclust using 11 MDS dimension), which is the major cluster of our project. Here is an all inclusive list of IDs for selected participants of our project:
V158
V157
V160
V202
V169
V170
V171
V174
V176
V177
V180
V181
V188
V189
V196
V205
V208
V211
V215
V218
V220
V221
V222
V228
V225
V232
V236
V237
V235
V231
V244
V246
V238
We had thinned the genotype data of selected individuals to c.100 000 SNPs, removing SNPs in strong LD and low quality SNps. After that we used GERMLINE pipeline for phasing PLINK format data with BEAGLE and processing in GERMLINE (phasing was performed in a homologous populations), Then, in order to assess possible discrepancies between phased and unphased data, we performed ADMIXTURE analysis (with 4 assumed clusters K=4) separately for original unphased dataset and BEAGLE-phased dataset.
To our surprise, we haven't be able to find expected signficant differences between phased and unphased multi -SNP markers genotypes (the range of difference is c.1-5%).
Unphased data:
Phased data:
Spreadsheet with ADMIXTURE results can be found here.