My Blog List

Friday, July 1, 2011

A close-up on "the core" of the MDL project

The PCA and MDS plots fleshed out very stable clusters of what appears to be the core of the MDLP projects. I would label these clusters after major population groups constituning a core set of the project's populations.

1) "Russian" cluster: V164, V163, V207
2) The intermediate cluster of admixed eastern Slavs (russians+belarussians, ukrainians+russians etc.,): V219, V165, V186, V187, V161, V189, V185, V201, V202, V205, V215, V196
3) "Belarussian" cluster: V188, V158,V157,V174
4)  The intermediate cluster of  Belarussian-admixed Lithuanians and vice versa: V162, V214, V210, V220
5) "Lithuanian" cluster: V191, V192,V184, V211, V216, V183, V190, V218, V170
6) "Polish" cluster:  V169, V205, V160, V208, V180, V187, V181, V176, V177

ADMIXTURE results for all participants up to V222

Dienekes was kind enough to share his C++ code for converting the 1000 genomes dataset  to PLINK format. Since many participants of our project have Central European and British ancestry i have added the converted dataset of GBR (“British from England and Scotland”),  along with Orcadian sample from HGDP project and CEU panel from HapMap project.  These samples proved to be useful in the delineation of NWE ancestry in the participants of our project.

I have also added Finns from the 1000genomes project to access the possible North-Western Asian component in the reference set of Russian population.

Here are results of  unsupervised and supervised runs of ADMIXTURE on the pruned MDLP datasets. I haven't intentionally labeled the inferred components, since there is still considerable disagreement among genomic bloggers as to what these components might be. 

Unsupervised ADMIXTURE run:

Supervised  ADMIXTURE run

PS. I have also tried to run STRUCTURE software on the whole MDLP dataset (in STRUCTURE input format circa 145 Mb file), but this trial run have ended in disaster and suspension of one from my LINUX accounts :/  It means that i won't be able to run STRUCTURE analysis for the whole dataset in a couple of months. Instead, i would try to analyse each chromosome separately.

PS.PS. Pat Berge has requested me to publish ADMIXTURE/PCA/MDS data in spreadsheet
PCA/MDS plot data

Sunday, June 26, 2011

SNP Map application: a new genomic toy

Smallaxe of has created a platform for developing tools to help exploring RAW data files.
Basically it is useful for identifying possible areas of your genome that align with different regions and populations. The user interface at this point is fairly rudimentary, but it makes doing the SPSMart sort of thing quite a bit more easy. It can accept 23andMe and FTDNA FamilyFinder data files.
Here are guidelines for using SNPMap tool:
The general sequence when using the tool is:
1. Load your genetic data using the File menu. The program can accept either 23andMe or FTDNA Family Finder unzipped data files. Note that the program is running locally on your computer, so your data stays with you. No need to send your data to me.

2. Select the populations you're interested in. There are two lists on the left. The top list shows the world regions. The bottom list shows the population data sets within the selected region. You can check/uncheck the regions and populations you want used in the analysis. If you've checked more than one region, then the analysis will compare those regions against each other. If you've checked only a single region, then the analysis will compare the individual populations within that region against each other. You may have to click twice on an item to toggle the check box.

3. Choose the chromosome. For speed, and to make the display of large amounts of info easier, the analysis works with a single chromosome at a time. There is a dropdown box for selecting the chromosome. If you are using FTDNA data, remember that you will have two files - one with just the X chromosome data, and one with chromosomes 1-22.

4. Click on the Recalculate button. This will perform the analysis of your data against the reference population data.
The symbol that looks like an equals sign with a slash through it means the person's genetic data is very unlike that region/population for that SNP, and other regions/populations are more likely. If the person is homozygous, then two of those symbols show up. One symbol means the person is at most 50% ancestry of that region/population at that location of their genome. Two symbols means the person may be 0% ancestry of that region/population at that location. The circle symbol (it's actually a happy face) means the person is very like that region/population, and the SNP is a good marker for that region/population vs. all the others being compared. One happy face means probably at least 50% that region/population, and two means probably 100% that region/population.

As you get more familiar with the program, you can adjust the Reliable/Noisy slider to change the threshold the program uses to distinguish SNPs of interest.

By default, only SNPs of interest are displayed. If you check the All SNPs checkbox and then Recalculate, all the SNPs in your data for which there is any population data will be displayed.

If you select a row Populations list and then right click in the list, a menu will pop up with some options, including opening the Yale ALFRED website with more detailed information about the selected population.

You can select one or more rows in the SNP results list and then right click in the list to see a menu of options. If you select a single row, the menu will have items for opening the Yale ALFRED website with detailed population information about the selected SNP, and an item for opening the NIH database website containing a variety of detailed SNP information. If you've selected one or more rows, the menu will let you copy the list of selected SNP ids to the clipboard. This is useful for pasting into the SNP list entry on the SPSMart website for further SNP exploration. You can also copy all the selected data to the clipboard in comma delimited format suitable for pasting into a spreadsheet program such as Excel or OpenOffice.