You are visitor number | since 10 Jan 2002 |
Current Version: 27 Jan. 2002
--- Back to the TMRCA calculator
Method: Walsh, Bruce , 2001. Estimating the time to the MRCA for the Y chromosome or mtDNA for a pair of individuals, Genetics 158: 897--912
The goal is to use molecular markers (DNA sequences showing variation in the population; also called genetic markers or simply markers) to provide information on how closely the Y chromosomes from two individuals are related. Relatedness is quantified by TMRCA, the Time to the Most Recent Common Ancestor (MRCA), which is how many generations the two examined Y chromosomes are from a common ancestor.
Estimates of TMRCA are based on the observed number of mutations by which the two Y chromosomes differ. Since mutations occur at random, the estimate of a TMRCA is not an exact number (i.e., 7 generations), but rather a probability distribution. As one uses more and more markers, the distribution becomes tighter and tighter about its mean value (its variance becomes smaller), and estimates have higher precision.
In an ideal world, we could directly count the total number of mutations at which two chromosomes differ. This is not be quite as straightforward as it sounds, as it makes assumptions about the underlying mutation process . Further, our estimates of mutation rates are still somewhat imprecise. To accommodate both these concerns, we use models that make certain assumptions. The philosophy has been to incorporate the most recent information, but also realizing that to correctly apply many of our refinements, we simply don't yet have all the information. As this information becomes available (such as good estimates of marker-specific mutation rates), the calculator will be updated to incorporate new data. Much of the data will come from the analysis of the growing FTDNA database.
From the user standpoint, the good news is that once your genotype has been obtained, you can return to this website as these refinements are incorporated to obtain an even more refined estimate of the time to MRCA between you and the individual(s) of interest.
The ideal genetic marker for genealogical studies is one that has a very high mutation rate, so that it will only stay in its current state for a few generations. A type of DNA sequence called a microsatellite has this property, with mutations rates on the order of 1/500. Microsatellites are small blocks of DNA with repeated sequences, for example ATATATAT is a sequence with four AT repeats. When mutations occur in microsatellites, the length (or size) of the satellite changes, for example from a four repeat to a five or three repeat.
One can also use other markers, the most common being SNPs, pronounced"Snips", for single nucleotide polymorphisms. The problem with SNPs for genealogical work is that they have very low mutation rates. This makes them very useful for very deep genealogies (i.e. thousands of generations), but not very informative for very recent genealogies (in the tens to hundreds of generations). Hence, SNP markers are not very informative for genealogists.
It is likely that mutation rates differ at microsatellites. Again, if we could find those with the highest mutation rates, these will provide the most information. Since this information is still unresolved, for a first pass, we will assume all microsatellite markers have the same rate (as information on rate differences becomes available, the calculator will be updated to allow for marker-specific differences).
For now, we have used the best overall estimate for our markers and two bounds in the calculator. These are
As mentioned, a potential problem when comparing the marker information from two individuals is counting the
Second, when two individuals differ by (say) two steps in the length of a microsatellite marker, is this two mutations? One mutation? The correct counting is critical. Most microsatellite mutations appear to be single-steps, but there are rare cases (about 1/30 - 1/50) being a two-step.
Since we are somewhat ignorant of the actual mutational details, two different assumptions are used to model mutations. These are based on standard models from molecular population genetics, and are
Note from the calculator that these methods give essentially the same result when the fraction of total matches is very high and only differ when the fraction of matches decrease. Further, the stepwise model always gives lower estimates than the infinite alleles model (as it assumes more mutations could have occurred, and hence gives longer times).
The Assumptions
The infinite alleles model is just a fancy term that population geneticists use to refer to a model where each new mutant is different (i.e.., there are an infinite number of states that an allele can mutate too, hence each mutation is assumed to be unique. The motivation is considering a long DNA sequence, were any two changes are extremely likely to given two different sequences)
If each new mutant is different, then we don;t have to correct for mutations that we did not see and hence did not score, and (as the figure below shows) we can simply look at the differences between marker loci and count the actual number of mutations that have occurred (three in the case below: the red mutation on the left individual's Y chromosome and the black and green on the right individual's).
Probability of a match: Under the Infinite alleles model, a particular marker locus is the same in both individuals if and only if no mutations have occurred. For two individuals that had a last common ancestor t generations ago, the probability is Exp[-2ut], where u is the assumed mutation rate. If markers do not match, it is assumed that only a single mutation has occurred, regardless of how much the markers differ by.
Advantages: This is the simplest mutation model, simply scoring loci as match/no match.
Disadvantages: There is a risk of undercounting the total number of mutations, and hence underestimating the actual TMRCA. If individuals are identical at a large fraction of the markers, this risk is very small. As individuals differ at more and more markers, the undercounting can become more serve.
The Assumptions
The stepwise mutational model tries to better account for the actual mutational process that occurs at microsatellite markers. What is scored is marker length. Hence, as the picture below shows, two mutations, a (say) -1 mutation followed by a +1 mutation returns the marker to its original length.
The stepwise mutation model looks at the frequency spectrum (0,1,2,3 ..) of the mismatches, namely how many loci show no mismatches, 1 mismatch, 2 mismatches, etc. Its simplest form is the one-step, symmetric model, which assumes only one step per mutation, with equal probability of increasing (+1) or decreasing (-1). More complex stepwise mutational models can be constructed, but this is a little premature until more information on the mutational process is available. Currently, there is very good evidence for the one-step model, as only 1/30 to 1/50 of all (the small number of) observed mutations are two-steps. It must be stressed that these are actual mutations that are observed (i.e.., father and son differ by two), as opposed to two individuals that differ by two, where we have no information as to whether this is one , two , or multiple steps.
Probability of a match: Under the one-step, symmetric stepwise model, the probability that two individuals with a last common ancestor t generations ago differ by i (i = 0, 1, 2, etc.) at a marker is 2Exp[-2ut] * Bessel(i, 2ut), where Bessel(k, x) is the i-th order modified type I Bessel function (see Walsh 2001 for details).
Advantages: The stepwise model more fully accounts for the likely underlying mutation process.
Disadvantages: There are a large number of stepwise models one could use -- for example, we could allow for two steps, but would then have to assign these probabilities. Hence, the stepwise model is certainly an improvement, but it, too, may be the wrong mutational model for a particular locus. Current fits of the data show that the symmetric one-step model fits the data well, but this may certainly change for any particular marker. If such information becomes available, it will be incorporated.
If most mutations are one-step, but a few two-step mutants exist, then the true distribution of the TMRCA falls between the infinite alleles and stepwise models. For example, suppose we observed an exact match at 20 markers, and a two-step mismatch at one marker. This is more probably a single mutation as opposed to two (or more) one-step mutations. If the latter were true, we would expect other loci to have mutated as well, unless they had lower mutations rates.
All of the times to MRCA reported here are given in terms of generations. This is both the natural way of expressing the results and also has the benefit that one can easily translate these into years back to a common relative by making assumptions about the average number of years a standard human generation is. The values in the literature range from 15 to 25 years. Further, it is likely the case that the average generation time varies with geographic location and hence with different groups of people. This information can also be incorporated into a translation of generations into years.
The approach we use to compute the probability curves for TMRCA is a Bayesian analysis. Detailed notes on Bayesian Analysis can be found on Bruce Walsh's Issues in Biostatistical Analysis course webpage (in the form of a 21 page pdf file).
In brief, a Bayesian analysis starts with some prior knowledge about the TMRCA, and then uses the marker information to constructor a posterior distribution given both the initial assumption and the observed marker data. The reader may ask "Just what prior knowledge do we have about the time to the MRCA?". Actually, the effective size of the population (a long-term historical weighting of population sizes) provides the distribution of TMRCA for two random individuals, as this time follows a geometric distribution with parameter 1/Ne, Ne being the effective population size. (The interested reader can find more details in Bruce Walsh's lecture on Genetic Drift from his Genetics 320 course at the University of Arizona.) Hence, without knowledge of the marker genotypes, this is our prior distribution for TMRCA for two individuals. Obviously, the marker information provides for a much more refined estimate.
We assumed a value of Ne =50,000 in our analysis. Actually, the prior value of Ne chosen has essentially no effect unless (i) it is very small (500 or less) and (ii) the two individuals are rather divergence. In this case, the smaller assumed Ne decreases the estimated TMRCA compared to that for a larger population.
While we used a prior based on a large effective population size, if two individuals share the same common last name, this prior information very likely makes the population we are assuming to be sampling from a much smaller set of individuals. One way to incorporate this is by assuming a much smaller effective population size in the prior. While this approach is formally correct, the problem is that we really have to make a guess as to just what value of Ne to use. Hence, we have not implemented this refinement option, although we may in the future.
A few final comments are in order.
With all this apparent uncertainly (i.e., the various different assumptions about mutation rates, models, generations times, and the like), one might quite reasonably ask what this all means.
First, this does highlight the current sources of uncertainty precluding even more accurate dating. As mentioned, the good news is that once individuals have had their markers scored, they can return here as we refine our current estimates, incorporating new information on specific marker mutation rates and models and the like.
Second, notice that when individuals are quite genetically similar, the different mutational models have little effect, providing very similar results. Where the results become most model-dependent is when individuals differ at several makers, in which case we need more care in estimating the actual number total mutations from the observed differences in marker patterns. While the results are much more dependent on mutation rates, we have a good deal of confidence that the value we assumed (the standard rate) will turn out to be an accurate rate for most markers. The high and low values were chosen as extremes to bound the results.
Finally, the most powerful way to increase accuracy is to increase the number of markers scored (it is important that these markers have high mutation rates, i.e., microsatellites as opposed SNPs). This highlights an approach that individuals might wish to consider when getting genotyped. If you are interested in whether a particular person is a relative, a simple 12 marker test can exclude the possibility of being relatives (i.e., if there are two or more mismatches). If there is zero or one mismatch, that person is a reasonably close relative. By now moving to a 21 marker test, a more accurate estimate of the TMRCA can be obtained. By future incorporation of even more markers beyond the set of 21, we can further increase the accuracy of the estimate. The point here is that one can always add on additonal markers at any time, providing even more refinement should that be wished.
One final caveat about marker number. As mentioned, for markers to be informative, they must simply not be markers for markers sake. For example, a test that incorporates (say) 10 SNP markers really offers very little power for fine-resolution of genealogical information, as these markers show too little variability in the population. They could certainly exclude someone from being a close relative, but this is also done (and much more efficiently) by using a few markers with higher mutation rates. You can see this effect by looking at the difference in the TMRCA under the standard and low mutation rates. SNP markers likely have mutation rates at least one and most likely several order of magnitude below even the low rate we used. Its the number of highly variable markers (i.e., microsatellites), not simply the total number of markers, that provides the finer level of resolution.
The point here is that a random sample will reflect the actual number of differences. The calculations exact take this into account, as what we are estimating is the probability that a locus does not change, and then translating this into a time using the mutation rate.