Dienekes has brought my attention to StepPCO - a vanilla software used foe estimating the time of admixture events.
The software is written in R code and available online. For the past few days i was struggling with software troubles, trying to make this piece of code run on my computer. The basic problem here,however, is the lack of clear software documentation (there is a "readme" file in distro version of StePPCO, but it is not very helpful):
So, technically speaking, StepPCO workflow consists of 4 stages:
1)Calcuation of PCO object
2. Calculation of SPCO
4.Time of admixture estimate
Despite all my efforts ito make this particular software run, i wasn't able to get past the second stage -SPCO calculation. I used a genotype file for Chr1, extracted from the transposed PLINK dataset of MDL project, a file with physical positions of snps and genetic map (with included genetic distances in cM). The first part - PCO caclulation -seemed to run smoothly, alhtough there are some problems with memory optimization, and i was able to make PCA plot from calculated PCO object (see plot attached below). In contrast to that, SPCO calculation, after running some 45-50 minutes, exited with series of marshalling errors. I have reported this bug to software's developpers and will be waiting for their instructions.
Genome Biology 2011, 12:R19 doi:10.1186/gb-2011-12-2-r19
Dating the age of admixture via wavelet transform analysis of genome-wide data
Irina Pugach et al.
Abstract
We describe a PCA-based genome scan approach to analyze genome-wide admixture structure, and introduce wavelet transform analysis as a method for estimating the time of admixture. We test the wavelet transform method with simulations and apply it to genome-wide SNP data from eight admixed human populations. The wavelet transform method offers better resolution than existing methods for dating admixture, and can be applied to either SNP or sequence data from humans or other species.
The software is written in R code and available online. For the past few days i was struggling with software troubles, trying to make this piece of code run on my computer. The basic problem here,however, is the lack of clear software documentation (there is a "readme" file in distro version of StePPCO, but it is not very helpful):
So, technically speaking, StepPCO workflow consists of 4 stages:
1)Calcuation of PCO object
source spco-wt.r into R
load the following required input files:
- genotype file: tab delimited file with genotypes for all the individuals (columns) and markers (rows); this file should be separate for each chromosome. Genotypes should be coded as 1 and -1 for the two homozygous alleles (AA and BB), 0 for the heterozygous (AB), use NA for missing data (My note: this genotype file could be easily constructed from TPED file format with some minor PERL and sed text parsing).
see example genotype file chr22 in Chr22.Rdata in this directory
- vector of physical positions of SNPs along a chromosome
see example PhysPos22 in Chr22.Rdata in this directory
- to generate plots you will need a color file, see example
colorfile.txt in this directory
EXAMPLE:
to calculate PC1 and PC2 using all individuals in our example genotype table, run
pco(chr22)->chr22.pco
In our example, genotypes are from 3 HAPMAP populations: columns 1:20
contain genotypes for ASW (individuals of African ancestry from the
Southwestern USA, known to have European admixture), columns 21:40
contain genotypes for CEU (individuals of European ancestry), and
columns 41:60 contain genotypes for Yoruban (YRI) individuals from
sub-Saharan Africa. To use only parental populations (CEU and YRI) for finding principal axes, and
and then to project admixed inds (ASW) onto it, run
pco(chr22, axis.who=colnames(chr22[,21:60]))->chr22.pco
to plot your results, run:
plot.pco(chr22.pco)
if you don't want to see individual labels for each sample, use:
plot.pco(chr22.pco, names="none")
2. Calculation of SPCO
Calculation of SPCO:3.Wavelet transform
source spco-wt.r into R
the following input files are required:
- genotype file
see example genotype file chr22 in Chr22.Rdata in this directory
- vector of physical positions along a chromosome of all SNPs in the genotype file
see example PhysPos22 in Chr22.Rdata in this directory
- table of genetic distances (in cM) between all SNPs in the genotype
file
see example GenMap22 in Chr22.Rdata in this directory
To generate such GenMap file we downloaded genome-wide recombination
rates estimated as part of the HapMap project
(http://hapmap.ncbi.nlm.nih.gov/downloads/recombination/latest/rates/),
read the
http://hapmap.ncbi.nlm.nih.gov/downloads/recombination/latest/rates/genetic_map_chr22_b36.txt
into R, and interpolated genetic distances, using:
interpolate.genetic.map(PhysPos22, hapmap.table)->GenMap22
- to generate plots you will need a color file, see example
colorfile.txt in this directory
To run call:
spco(table,....)
EXAMPLE:
to calculate StepPCO for our example genotype table, using YRI and CEU
as parental populations, and "3 sigma" as a window method, run
spco(chr22, pco=chr22.pco, genetic.map=GenMap22, Nbins=1024, window.size="3 sigma",pop1=colnames(chr22[,21:40]),pop2=colnames(chr22[,41:60]),max.window.size=Inf)->spco.chr22
to plot results for a particular admixed individual
(e.g. "ASW_20126"), use:
plot.spco.stats(spco.chr22, who=colnames(chr22)[19], main=paste("SPCO: CEU, YRI and ",colnames(chr22)[19]," Chromosome 22", sep=""))
source spco-wt.r into R
calculated spco is required as input
EXAMPLE:
to run wavelet transform analysis for our example genotype table, first calculate spco and then use:
spco.wtstats(spco.chr22, threshold=0.1, maxlevel=5)->wtstats.chr22
to filter out noise and normalize:
for each level find a threshold amplitude, which is present in every individual whether admixed or not, remove everything below it as noise and recalculate "centers":
threshold.chr22<-vector() for (i in 1:5){max(wtstats.chr22$wt.summary[c(21:60),i],na.rm=TRUE)->threshold.chr22[i]}
for (i in 1:60){cutoff(wtstats.chr22$wt.summary[i,1:5],thr=threshold.chr22[1:5])->wtstats.chr22$wt.summary[i,1:5]}
for (i in 1:60){wtstats.chr22$wt.summary[i,"center"] <- sum(wtstats.chr22$wt.summary[i,1:5]*(1:5))/sum(wtstats.chr22$wt.summary[i,1:5])} to normalize subtract from each calculated "center" log2 of the length of chromosome in Morgans: wtstats.chr22$wt.summary[,"center"] <- (wtstats.chr22$wt.summary[,"center"]) - (log2(0.7366522))
4.Time of admixture estimate
Time of admixture estimates are based on comparison to the results obtained for simulated data. For instructions on how to run simulations see README.recsim file.
Despite all my efforts ito make this particular software run, i wasn't able to get past the second stage -SPCO calculation. I used a genotype file for Chr1, extracted from the transposed PLINK dataset of MDL project, a file with physical positions of snps and genetic map (with included genetic distances in cM). The first part - PCO caclulation -seemed to run smoothly, alhtough there are some problems with memory optimization, and i was able to make PCA plot from calculated PCO object (see plot attached below). In contrast to that, SPCO calculation, after running some 45-50 minutes, exited with series of marshalling errors. I have reported this bug to software's developpers and will be waiting for their instructions.
I have reached stage #3 myself. In the code of stage #3 make sure to:
ReplyDelete1. The number of levels depends on the length of the chromosome (5 for small Chr 22 and 7 for long Chr 1). Consult the paper
The numbers in the readme code need to be replaced depending on how many individuals you are dealing with. I have the 'admixed' individuals first in my file, followed by group A and group B. In the following example 42 from A, 102 from B, and 471 admixed ones.
A <- 42
B <- 102
ADM <- 471
threshold.chr1<-vector()
for (i in 1:7){max(wtstats.chr1$wt.summary[c((ADM+1):(ADM+A+B)),i],na.rm=TRUE)->threshold.chr1[i]}
for (i in 1:(ADM+A+B)){cutoff(wtstats.chr1$wt.summary[i,1:7],thr=threshold.chr1[1:7])->wtstats.chr1$wt.summary[i,1:7]}
for (i in 1:(ADM+A+B)){wtstats.chr1$wt.summary[i,"center"] <- sum(wtstats.chr1$wt.summary[i,1:7]*(1:7))/sum(wtstats.chr1$wt.summary[i,1:7])}
wtstats.chr1$wt.summary[,"center"] <- (wtstats.chr1$wt.summary[,"center"]) - (log2(max(GenMap1[,2])-min(GenMap1[,2])))/100
The last item calculates the length of the chromosome in centimorgans (0.7366522cM is the length of chr22).
Hope this helps, but do e-mail me if you figure out step #4 or have any other questions.
Replace cM with M in the above.
ReplyDeleteI'm thinking this might be your problem, how to do the GenMap correctly
ReplyDeletehapmap.table<-read.table("genetic_map_chr1_b36.txt" ,header=T)[,c(1,3)]
interpolate.genetic.map(PhysPos1, hapmap.table)->GenMap1
You should run this before the code I posted above.
Thank you very much for your corrections and suggestions.
ReplyDeleteI'll have a look at StepPCO one more time tomorrow.
Dienekes said...
ReplyDelete" I'm thinking this might be your problem, how to do the GenMap correctly
hapmap.table<-read.table("genetic_map_chr1_b36.txt" ,header=T)[,c(1,3)]
interpolate.genetic.map(PhysPos1, hapmap.table)->GenMap1
You should run this before the code I posted above."
Still doesn't work for Chr1 HapMap's genetic map and PhysPos1 (your trick works only for included Chr22 example).
>interpolate.genetic.map(PhysPos1, hapmap.table)
Error in new.gen.map[definedpp, 2] <- gen.map[definedgm, 2] :incorrect number of subscripts on matrix
You get the PhysPos1 as follows:
ReplyDeletePhysPos1<-read.table("plink.map")[,4]
Or:
PhysPos1<-read.table("plink.bim")[,4]
if you keep your data in binary ped format
I hope this works.