My Blog List

Wednesday, May 4, 2011

Troubles with StepPCO

Dienekes has brought my attention to StepPCO - a vanilla software used foe estimating the time of admixture events.

Genome Biology 2011, 12:R19 doi:10.1186/gb-2011-12-2-r19

Dating the age of admixture via wavelet transform analysis of genome-wide data

Irina Pugach et al.


We describe a PCA-based genome scan approach to analyze genome-wide admixture structure, and introduce wavelet transform analysis as a method for estimating the time of admixture. We test the wavelet transform method with simulations and apply it to genome-wide SNP data from eight admixed human populations. The wavelet transform method offers better resolution than existing methods for dating admixture, and can be applied to either SNP or sequence data from humans or other species.

The software is written in R code and available online. For the past few days i was struggling with software troubles, trying to make this piece of code run on my computer. The basic problem here,however, is the lack of clear software documentation (there is a "readme" file in distro version of StePPCO, but it is not very helpful):

So, technically speaking, StepPCO workflow consists of 4 stages:
1)Calcuation of PCO object
source spco-wt.r into R
load the following required input files:
- genotype file: tab delimited file with genotypes for all the individuals (columns) and markers (rows); this file should be separate for each chromosome. Genotypes should be coded as 1 and -1 for the two homozygous alleles (AA and BB), 0 for the heterozygous (AB), use NA for missing data (My note: this genotype file could be easily constructed from TPED file format with some minor PERL and sed text parsing).

see example genotype file chr22 in Chr22.Rdata in this directory

- vector of physical positions of SNPs along a chromosome
see example PhysPos22 in Chr22.Rdata in this directory

- to generate plots you will need a color file, see example
colorfile.txt in this directory

to calculate PC1 and PC2 using all individuals in our example genotype table, run


In our example, genotypes are from 3 HAPMAP populations: columns 1:20
contain genotypes for ASW (individuals of African ancestry from the
Southwestern USA, known to have European admixture), columns 21:40
contain genotypes for CEU (individuals of European ancestry), and
columns 41:60 contain genotypes for Yoruban (YRI) individuals from
sub-Saharan Africa. To use only parental populations (CEU and YRI) for finding principal axes, and
and then to project admixed inds (ASW) onto it, run

pco(chr22, axis.who=colnames(chr22[,21:60]))->chr22.pco

to plot your results, run:

if you don't want to see individual labels for each sample, use:
plot.pco(chr22.pco, names="none")

2. Calculation of SPCO
Calculation of SPCO:
source spco-wt.r into R
the following input files are required:

- genotype file
see example genotype file chr22 in Chr22.Rdata in this directory

- vector of physical positions along a chromosome of all SNPs in the genotype file
see example PhysPos22 in Chr22.Rdata in this directory

- table of genetic distances (in cM) between all SNPs in the genotype
see example GenMap22 in Chr22.Rdata in this directory

To generate such GenMap file we downloaded genome-wide recombination
rates estimated as part of the HapMap project
read the
into R, and interpolated genetic distances, using:, hapmap.table)->GenMap22

- to generate plots you will need a color file, see example
colorfile.txt in this directory

To run call:

to calculate StepPCO for our example genotype table, using YRI and CEU
as parental populations, and "3 sigma" as a window method, run

spco(chr22, pco=chr22.pco,, Nbins=1024, window.size="3 sigma",pop1=colnames(chr22[,21:40]),pop2=colnames(chr22[,41:60]),max.window.size=Inf)->spco.chr22

to plot results for a particular admixed individual
(e.g. "ASW_20126"), use:

plot.spco.stats(spco.chr22, who=colnames(chr22)[19], main=paste("SPCO: CEU, YRI and ",colnames(chr22)[19]," Chromosome 22", sep=""))
3.Wavelet transform
source spco-wt.r into R

calculated spco is required as input

to run wavelet transform analysis for our example genotype table, first calculate spco and then use:

spco.wtstats(spco.chr22, threshold=0.1, maxlevel=5)->wtstats.chr22

to filter out noise and normalize:
for each level find a threshold amplitude, which is present in every individual whether admixed or not, remove everything below it as noise and recalculate "centers":

threshold.chr22<-vector() for (i in 1:5){max(wtstats.chr22$wt.summary[c(21:60),i],na.rm=TRUE)->threshold.chr22[i]}

for (i in 1:60){cutoff(wtstats.chr22$wt.summary[i,1:5],thr=threshold.chr22[1:5])->wtstats.chr22$wt.summary[i,1:5]}

for (i in 1:60){wtstats.chr22$wt.summary[i,"center"] <- sum(wtstats.chr22$wt.summary[i,1:5]*(1:5))/sum(wtstats.chr22$wt.summary[i,1:5])} to normalize subtract from each calculated "center" log2 of the length of chromosome in Morgans: wtstats.chr22$wt.summary[,"center"] <- (wtstats.chr22$wt.summary[,"center"]) - (log2(0.7366522))

4.Time of admixture estimate

Time of admixture estimates are based on comparison to the results obtained for simulated data. For instructions on how to run simulations see README.recsim file.

Despite all my efforts ito make this particular software run, i wasn't able to get past the second stage -SPCO calculation. I used a genotype file for Chr1, extracted from the transposed PLINK dataset of MDL project, a file with physical positions of snps and genetic map (with included genetic distances in cM). The first part - PCO caclulation -seemed to run smoothly, alhtough there are some problems with memory optimization, and i was able to make PCA plot from calculated PCO object (see plot attached below). In contrast to that, SPCO calculation, after running some 45-50 minutes, exited with series of marshalling errors. I have reported this bug to software's developpers and will be waiting for their instructions.

Tuesday, May 3, 2011

PED2RAW is now featured on Softpedia

PED2RAW is now featured on Softpedia. Congrats to Vasily Gavrilov.


Convert PED genome file fast and easy.

ped2raw is a small, command prompt based application specially designed to help you convert PED genome file into RAW SNP format. Creates multiple files per PED file, i.e. 1 file per person and it offers support for big files.

Because it's command prompt based, don't expexct any fancy GUI to guide you through the conversion process. So, to convert PED files, take ped2rawfor a spin and check out its capabilities.

STRUCTURE (K=10, N=159) results

STRUCTURE (K=10, N=159, 7 CPU days) results

Grapho-analytical approach to the visualisation of IBD shared segments

I just wasted 2 past weeks thinking of new method for visualising the scope of IBD segments' "sharedness" between different populations in project. I, personally, consider PLINK-based inference of  overlapping IBD segments to be very useful for the project's purposes . PLINK has functions to detect specific segments shared between distantly- related individuals, for discovering long shared IBD segments, one can use GERMLINE algorithm.

As mentioned in PLINK's manual, the --segment option in PLINK command line generates a file plink.segment  which has the fields (one row represents one segment between two compared individuals):

FID1       Family ID of first individual
     IID1       Individual ID of first individual
     FID2       Family ID of second individual
     IID2       Individual ID of second individual
     PHE        Phenotype concordance: -1,0,1
     CHR        Chromosome code
     BP1        Start physical position of segment (bp)
     BP2        End physical position of segment (bp)
     SNP1       Start SNP of segment
     SNP2       End SNP of segment
     NSNP       Number of SNPs in this segment
     KB         Physical length of segment (kb)
To get an idea of how IBD segments file may look like, you may want to take a look at an example of  Plink segment file, which includes a list of estimated IBD sharing segments between MDL project participants (download link).

The first thing what came in my mind  when  i was tackling  IBD OVERLAP and INDIV file formats, was to read IBD sharing data as graph objects. At least, i have not figured out a better solution and grapho-analytical approach is very straightforward. Suppose, you want to read shared segments file in R, assuming that graphs of sharedness is undirected. Then, R-routine for reading in the file and converting to an igraph object is very simple:

segments <- read.csv("IBD.shared.segments", header = F)
g <-, directed = F)
//returns number of vertices
//returns number of edges
Given a graph object (read in and transformed from  PLINK segment file format) and a set of graph manipulation rules, one can basically do with graphs whatever he or she intends to do.  For instance, one can calculate the largest “clique” of graph and create  a new graph of that largest clique. A clique is a maximally-connected subgraph in which every vertex connects to every other vertex:

lc <- largest.cliques(g) <- subgraph(g, lc[[1]])

One can also assign uniformly random weights to the edges and set the color edge attribute to “red” for edges with high weight, and  find the shortest path between two nodes of graph, etc.

More details about igraph object on Igraph's homepage.

Please consider the real-life example of graph, representing the amount of sharedness between project participants. What can we learn from the graphical representation of project's "communities" based on the amount of IBD sharedness? In our particular case we can easily identify the patterns of "IBD sharedness" within and between populations. For example, there are two "IBD communities" in Finnish "population sample" in our project (black circles on the graph). These two enigmatic "communities" appear to be "things in themselves" being focused on two sharing "midpoints" in Finnish sample (V151;V156), while V151 and V156 appear to be only Finns linked by means of IBD segment sharing to North_Russians, and, to lesser extent, Balto-Slavic cluster.

That looks pretty cool.

Monday, May 2, 2011

Splitting MDL datasets into subsets of informative and robust clusters

Yesterday i used Mclust R package on the entirety of MDL dataset to split MDL datasets into several partially-intersecting 5 subsets (see legend, each row represents a single participant, identified by the row number, e.g V+1=V1;...;V+164=V164). 5 clusters  was attained with 4 dimensions retained.

From Mclust's page

MCLUST is an R package for normal mixture modeling via EM, model-based clustering, discriminant analysis and density estimation.

The method as explained by  Dienekes Pontikos

In short, this method exploits the clusteredness of individuals along different dimensions of the MDS representation of dense genotypic data. It uses a powerful model-based clustering algorithm (MCLUST) that can infer the existence of clusters of different size, shape, and orientation in the MDS space, and which automatically optimizes for the Bayes Information Criterion, balancing off detail with parsimony.

The only parameter that I need to specify to MCLUST is the number of MDS dimensions to retain (for a more detailed analysis, see here), as extra dimensions may add "clusteredness" but also noise. In order to decide on how many dimensions to retain, I empirically run MCLUST with a different number of dimensions (from 2 to 50).

The output of the PLINK and ADMIXTURE algorithms we are going to present here, is supposed to reflect "genetic substructures" in both subsample of project participants and reference populations. I guess that the best practice for analyzing admixture would be to dissect populations, starting with K=3 or K=4 assumed explicit ancestral populations and ascending to K=7 or K=8 (although the latter K is a doubtless overkill, and that can lead to the formation of "spurious" clusters).

I used Razib Khan's script to visualize the output of Admixture run for MDL sample (please don't read too much into the first experimental results, because i disclaim all liability for harm done by the inaccurate misinterpretation of results :) ).

MDS plot of genetic distances between populations on the basis of ~500 000 snps
MDS GNUplot (read Davidski's post explaining how datasheets (read: Plink matrix files*) that can be turned into interactive genetic maps with just a couple of simple commands in GNUplot). In Plink, to perform multidimensional scaling analysis on the N x N matrix of genome-wide IBS pairwise distances, use the --mds-plot option in conjunction with --cluster. This command takes a single parameter, the number of dimensions to be extracted. For example, assuming we have already calculated the plink.genome file,
plink --file mydata --read-genome plink.genome --cluster --mds-plot 4
creates the file

I hadn't enough time to find informative angles, so please take a look at plot as it is. The legend of plot is the same as in previous posts.

The same MDS plot with population ID labels

R-Qplot visialusation of PLINK-generated mds coordinates

MDS-plot (Dim=2)

MDS plots of MDL-plink data matrix

Visualised in GNUplot: splot "plink.mds" using 4:5:6:2 with labels

R-hclust dendrogram of MDL project's genomes

Hierarchy of clusters in Hclust object

North_Russian V 1
North_Russian V 2
North_Russian V 3
North_Russian V 4
North_Russian V 5
North_Russian V 6
North_Russian V 7
North_Russian V 8
North_Russian V 9
North_Russian V 10
North_Russian V 11
North_Russian V 12
North_Russian V 13
North_Russian V 14
North_Russian V 15
North_Russian V 16
North_Russian V 17
North_Russian V 18
North_Russian V 19
North_Russian V 20
North_Russian V 21
North_Russian V 22
North_Russian V 23
North_Russian V 24
North_Russian V 25
BY V 26
BY V 27
BY V 28
BY V 29
BY V 30
BY V 31
BY V 32
BY V 33
BY V 34
Chuvash V 35
Chuvash V 36
Chuvash V 37
Chuvash V 38
Chuvash V 39
Chuvash V 40
Chuvash V 41
Chuvash V 42
Chuvash V 43
Chuvash V 44
Chuvash V 45
Chuvash V 46
Chuvash V 47
Chuvash V 48
Chuvash V 49
Chuvash V 50
Chuvash V 51
HU V 52
HU V 53
HU V 54
HU V 55
HU V 56
HU V 57
HU V 58
HU V 59
HU V 60
HU V 61
HU V 62
HU V 63
HU V 64
HU V 65
HU V 66
HU V 67
HU V 68
HU V 69
HU V 70
LT V 71
LT V 72
LT V 73
LT V 74
LT V 75
LT V 76
LT V 77
LT V 78
LT V 79
LT V 80
GE V 81
GE V 82
GE V 83
GE V 84
GE V 85
GE V 86
GE V 87
GE V 88
GE V 89
GE V 90
GE V 91
GE V 92
GE V 93
GE V 94
GE V 95
GE V 96
GE V 97
GE V 98
GE V 99
GE V 100
RO V 101
RO V 102
RO V 103
RO V 104
RO V 105
RO V 106
RO V 107
RO V 108
RO V 109
RO V 110
RO V 111
RO V 112
RO V 113
RO V 114
RO V 115
RO V 116
FI V 117
FI V 118
FI V 119
FI V 120
FI V 121
FI V 122
FI V 123
FI V 124
FI V 125
FI V 126
FI V 127
FI V 128
FI V 129
FI V 130
FI V 131
FI V 132
FI V 133
FI V 134
FI V 135
FI V 136
FI V 137
FI V 138
FI V 139
FI V 140
FI V 141
FI V 142
FI V 143
FI V 144
FI V 145
FI V 146
FI V 147
FI V 148
FI V 149
FI V 150
FI V 151
FI V 152
FI V 153
FI V 154
FI V 155
FI V 156
BY3 V 157
BY2 V 158
BY1 V 159
Polish V 160

Сообщить модератору

Admixture analysis barplot

The barplot is visualised here only to give a general picture and offer an overview of genetic substructure in represented samples form reference populations. Each column represents one individual (V1....V159 counting from left).

Admixture analysis: the rest of groupings


Admixture results: Baltic-Slavic

Just a quick follow-up to the previous post. Looking at admixture analysis results, we'll be able to identify another candidate for the Baltic-Slavic component. . That's a very interesting point about Belarusians being genetically  more similar to Lithuanians than to other Slavs. Thus, the results of unsupervised analysis seems to contradict Balanovsky's thesis, which holds that Belarusians are genetically closer to Poles (Balanovsky 2008).

Admixture analysis: sorted after Baltic-Slavic component

The results of unsupervised K=7 admixture analysis of the  whole reference  populations + 5 first participants (N=159). These results should be taken with caution for two obvious reasons: a) this  is an output of  simple,  unsupervised admixture analysis, b) K=7 - even in case of "pruned" SNP-dataset - appears in most cases to be redundant for such  inbred homozygous populations (in general, the high level of homozygosity is indicative of  low admixture levels, since it is obvious that heterozygotes will increase in an admixture population).

Origin of Belarusians

Preliminary discussions of Belarusian ethnic origin (to be clarified with results of BGA analysis)

We will start with a version of Belarusian ethnic origin according to Jermalovich.

". Initially, the entire Belarusan ethnic area was inhabited by Baltic tribes. In the sixth-to-seventh centuries, Slavic tribes came from the west. The Slavs fused with the Balts to form a new Balto-Slavic nation - the ancestors of today's Belarusans. The prevalent language became Slavic - Old Belarusan - which retained many Baltic elements in its pronunciation and vocabulary. Baltic elements are still recognizable in the folklore (dances, songs, costumes, folk ornaments, etc.) and toponymics (place names) of the present-day Belarus.
The Slavs belonged to tree main tribes: the Kryvicy, who settled in northern and central Belarus, the Dryhvicy(Drehavicans), who settled in the south along the river Prypiac' and Radzimicy(Radzimicans) who settled in the upper Dniapro (Dnepr) region. Two other tribes, the Severianie and the Viacicy(Viaticens), settled further east and only parts of their former territories are now considered ethnically Belarusan."

These Slavic tribes all adopted Christianity in its Eastern Orthodox from the Kievan Rus. For some time they were also politically dependent from Kiev. However, the Kryvicy soon established their own state in the north - the Duchy of Polacak. Later, the duchies of Turau-Pinsk and Novaharodak were established in the south and Southwest respectively. The territory of the duchy of Novaharodak, which had been colonized by both the Kryvicy and the Dryhvicy, extended along the Nioman river between Horadzien (currently know as Hrodna or Grodno) in the west and Novaharodak in the east.
The duchy of Novaharodak was virtually surrounded by unassimilated Baltic tribes: the Jacviahi(Yatvegians) in the west (who were later belarusianized), the Nalscany in the north, the Litva in the east and Northeast, and the area called Aukstota in the south-eastern part of modern Lithuania. Another Baltic tribe, the Samogitians, who lived between Aukstota and the Baltic Sea, did not merge with the Slavs but remained a separate ethnic group. It was the Samogitians and the inhabitants of Aukstota who became the ancestors of today's Lithuanians.
Aukstota - which means highland in Lithuanian - is nowhere mentioned as such in the medieval chronicles and did not seem to be a political entity. Many of its place names contain the word aukstas, or high, while Samogitia, further to the west, means lowland. The city of Vilna (currently know as Vilnius) the present capital of Lithuania, is in Aukstota and Lithuanian scholars consider Aukstota to be the heart of the Lithuanian ethnic area. They claim that Aukstota was the location of the original Litva and that it was from here that Duke Mindouh (Mindaugas in modern Lithuanian) left to conquer the adjoining Duchy of Novaharodak and established the Grand Duchy of Litva.
However, the separate ethnic character of Litva is supported by the non-Slavic names of their leaders - Mindouh, Vojsalk, etc. Mr. Jermalovic feels that Litva referred to an unassimilated Baltic ethnic island.