Tuesday, October 31, 2017
It's an unfortunate reality that most commercial genetic ancestry tests out there are rather lame. They're not wrong per se, but that's probably the best that can be said about them. And let's be honest, that's no longer enough considering how far this area of science has come in recent years. To try and remedy this problem, I'll be offering a wide range of highly accurate and unique, but low cost, ancestry tests here, in my makeshift online store, based on analyses on my other blog (see here). These tests will focus on either recent or ancient ancestry, or both, using the latest reference samples from scientific literature whenever possible. To make a purchase, send your request, autosomal genotype data (from AncestryDNA, FTDNA or 23andMe) and money (via PayPal) to eurogenesblog at gmail dot com. Let's start things rolling with my genetic and linguistic landscape of Europe north of the Alps, Balkans and Pyrenees (see here). For a mere $6 USD I will pinpoint your location on the plot below amongst a variety of modern-day and ancient individuals. You'll also receive the principal component coordinates, which you can use to model your ancestry proportions (for instance, like here). Please keep in mind, however, that to ensure sensible results in this particular analysis, practically all of your ancestry has to derive from Central, Eastern and/or Northern Europe. Most of my other tests won't be so restrictive.
Sunday, September 10, 2017
This is the first of a series of guides to modeling your ancient ancestry with the Global 10/nMonte2 method. I do already have a user guide for running Global 10 and Basal-rich K7 data with nMonte and 4Mix (see here). However, in this series I’m going to recommend specific models that produce results similar to those from my experiments with other methods, such as qpAdm, as well as from scientific literature. Hopefully, this will help users achieve more sensible and accurate outcomes, and avoid problems such as overfitting. Let’s start with models for modern-day Europeans that focus on Yamnaya-related ancestry, which very likely represents a genetic signal of early Indo-European dispersals during the Early to Middle Bronze Age from the Pontic-Caspian steppe. It’s now clear via a wide range of methods that about half of the genomes of modern-day Eastern and Northern Europeans, and up to about a quarter of the genomes of modern-day Southern Europeans, are derived from such Yamnaya-related sources. Any tests dealing with ancient European substructures that don’t, one way or another, reflect this robust inference must be considered inadequate. So if my models are to be useful, then this is what they must show. And indeed they do. Here are a few examples focusing on modern-day and ancient England, in chronological order:
England_Iron_Age Yamnaya_Samara 49.75 Barcin_N 32.3 Hungary_HG 17.95 distance%=0.5318 / distance=0.005318 England_Roman Yamnaya_Samara 45.65 Barcin_N 33.35 Hungary_HG 21 distance%=0.4668 / distance=0.004668 England_Anglo-Saxon Yamnaya_Samara 44.95 Barcin_N 31.6 Hungary_HG 23.45 distance%=0.5409 / distance=0.005409 English_Cornwall Yamnaya_Samara 44.55 Barcin_N 36.95 Hungary_HG 18.5 distance%=0.3699 / distance=0.003699 English_Kent Yamnaya_Samara 45.2 Barcin_N 36.85 Hungary_HG 17.95 distance%=0.4875 / distance=0.004875The full output is available in a zip folder HERE. I’m not claiming that these ancestry proportions are perfect, especially for Southern Europeans, who generally have very complex ancestry, but they do make a lot of sense. One obvious problem with the Global 10 is that some of its dimensions, or PCs, exaggerate affinity between modern-day and Mesolithic Europeans. This is especially true for PC6. Hence, to try and mitigate this problem I decided to remove PC6 from the Global 10 datasheet used in my analysis. To try these models on your own genome, remove PC6 from your Global 10 coordinates file, and use the data text files provided in the zip folder linked to above. It’s best to rely on the datasheets specifically designed for your ethnic group or region of Europe. But feel free to tweak my models. There’s no harm in experimenting if you’re cautious and sensible about it. Indeed, using Iberia_HG or Loschbour along with Hungary_HG appears to produce more accurate outcomes for many Western Europeans. The important, but often neglected, point to keep in mind is that I designed the Global 10 to help replicate results from more reliable but technically less accessible methods, and not to challenge any generally accepted models. In the near future, a wider choice of ancient samples should enable me to fine tune and improve the models. For instance, a slightly more eastern-shifted forager reference population than Hungary_HG, such as the yet to be published Lithuanian Narva samples (see here), will probably shift the results slightly for Northeast Europeans, perhaps by bringing down their Yamnaya-related ancestry proportions by a few per cent. Moreover, adding a wide range of yet to be published Middle to Late Neolithic European samples, such as those from the Globular Amphora Culture (GAC), should prove an interesting exercise. Please note that the discussion pertaining to this post is at my other blog HERE. See also... Global 10: A fresh look at global genetic diversity
Thursday, August 10, 2017
I've updated the Basal-rich K7 spreadsheet and the Global 10 datasheets with a plethora of ancient individuals and populations, including Anglo-Saxons, British Celts (labeled England_IA), Minoans, Mycenaeans, Bronze Age Iberians and many more. The Basal-rich K7 Global 10: A fresh look at global genetic diversity An nMonte and 4mix guide for the participants of the Basal-rich K7 and/or Global 10 tests Please note that the discussion pertaining to this post is at my other blog HERE.
Monday, May 8, 2017
Copied from a thread at the Anthrogenica forum because unfortunately it seems that a lot of people can't access the post: This is an nMonte and 4mix guide I have written for people who donated to the Eurogenes Project in order to take part in the Basal-rich K7 and/or Global 10 tests of that project and subsequently received their test results. For information on how to participate in one or both of the Basal-rich K7 and Global 10 tests, see the link below: Fund-raising offer: Basal-rich K7 and/or Global 10 genetic map In your results you receive from Davidski by email, you are provided with your Basal-rich K7 component percentages and your position on the Basal-rich K7 PCA if you took the Basal-rich K7 test, and your Global 10 PCA coordinates and your position on the Global 10 PCA if you took the Global 10 test. You will need your Basal-rich K7 component percentages and/or Global 10 PCA coordinates in order to make use of nMonte and 4mix, which allow you to be modeled as a mix different populations in varying ancestry percentages and varying distance levels based on either of your Basal-rich K7 and Global 10 results. You can download nMonte and 4mix from these links respectively: nMonte 4Mix Because that it can run multiple targets at the same time, I gave the link to 4mix_multi rather than classical 4mix. They are basically the same in all other aspects. In order to use nMonte and 4mix you need to have the R software installed on your PC. You can download it from one of the mirrors here: CRAN mirrors Making a target file for Basal-rich K7: Open Notepad and copy and paste the Basal-rich K7 component names and your Basal-rich K7 component percentages along with your name in this format: Basal-rich K7 spreadsheet Global 10 datasheet Save the input file as input. Here is an example of a Basal-rich K7 input file for nMonte: https://www.familytreedna.com/groups/anatol-balkan-caucas/about https://www.facebook.com/groups/800912433320422/ See also... A more specific guide to modeling your genome with the Global 10/nMonte method: Your ancient ancestry #1
Thursday, September 22, 2016
Judging by the Google search terms that are bringing traffic to this and my other blogs, a total newb to the scene is analyzing the Orcadian samples from the HGDP at GEDmatch with my K15 test. Please keep in mind that you will not see coherent results for many of the academic samples available online when using my tests. That's because I used these samples to source the allele frequencies for the tests. As a result, their ancestry proportions will often be very different from those of other samples from the same ethnic groups that were not used in this way. I call this problem the calculator effect, and it's described in my blog posts at the links below:
Wednesday, July 22, 2015
A few people are asking me about the effects of marker overlap or genotype rate on test accuracy. Logic dictates that the better the overlap, the more accurate the results, but this isn't strictly true. Here's what I've learned over the years:
- accuracy doesn't necessarily improve with higher marker overlap, it improves (up to a certain point) with more markersIn other words, a well designed test based on 200,000 SNPs will produce very accurate results for a genotype file with a marker overlap of 50%. On the other hand, another well designed test, based on just 50,000 SNPs, is likely to produce less accurate results for a genotype file with a marker overlap of 100%.
- you will still see accurate results using as little as 25,000 SNPs, as long as the test doesn't suffer from any serious problems
- poorly designed tests, such as those based on less than 1000 reference samples, always produce garbage results no matter what the marker overlap
So how can you tell a well designed test from a poorly designed one? It's easy, just have a look at the results they're producing for people with less complex ancestry. For instance, ask a Lithuanian, Swede or Pole what they're seeing at the top of their oracles. Is the Swede seeing Swedish or, say, German? If the answer is German instead of Swedish, or at least some type of Scandinavian, then the test is garbage and best ignored.
By the way, the recent Allentoft et al. paper on the ancient genomics of Eurasia includes a useful discussion on the effects of missing markers on the accuracy of both ADMIXTURE and PCA results. Refer to section 6.2 in the freely available supplementary info PDF here.
Tuesday, May 12, 2015
Thanks to Eurogenes project member DESEUK1. A zip file with the R script, instructions and a couple of data sheets is available here.
So let's model Poles as a bunch of ancient genomes from Central and Eastern Europe using output from my K8 analysis.
Copy & Paste: source('4mix.r')
Copy & Paste: getMix('K8avg.csv', 'target.txt', 'HungaryGamba_EN', 'HungaryGamba_HG', 'Karelia_HG', 'Corded_Ware_LN')
After a few seconds you should see the results...
Target = 19% HungaryGamba_EN + 14% HungaryGamba_HG + 2% Karelia_HG + 65% Corded_Ware_LN @ D = 0.0062
Obviously the script can use ancestry proportions and/or population averages from any test, provided they're formatted properly. The accuracy of the modeling will depend on the quality of the input.
Update 19/05/2015: A new version of the 4mix script that can run multiple targets is available here, courtesy of Open Genomes.