Computing Genetic Distances

 You are visitor number   since 24 November 2003  

A service of Family Tree DNA.

Current Version: 24 November 2003

Genetic distances --- The stepwise mutation model --- Technical details --- Replies to some FAQ's (Under construction)

Back to the TMRCA calculator

Method: Walsh, Bruce , 2001. Estimating the time to the MRCA for the Y chromosome or mtDNA for a pair of individuals, Genetics 158: 897--912

Genetic Distances

When trying to reconstruct a deep family tree, some measure (or metric) of genetic distance is required. In an ideal world, we could directly count the total number of mutations at which two chromosomes differ. This is not be quite as straightforward as it sounds, as it makes assumptions about the underlying mutation process. For most y-chromosome markers, the standard model is the Stepwise Mutation Model. This follows the changes in length, allowing a step up (or down) by one at each mutation. Note that this means that an observed match could result from no mutations, from one up and one down mutation, from two up and two down mutations, etc. For a genetic distance, we wish to translated the observed number of differences in a marker allele between two individuals (say an allele of length 24 and one of length 22) into the actual number of mutations that have occurred. Under the most simple stepwise mutational model, mutations are equally likely to increase or decrease your allelic state by one step, and the calculations below assume this symmetric single step model.

Translating the observed difference in allele sizes into an estimate of the actual number of underlying mutation

The following table translates an observed number of changes into an estimate of the actual number of mutations (assuming a mutation rate of 1/500).

What the table shows is that for two Y alleles with a MRCA of 200 (or fewer) generations ago, an excellent estimator of the total number of mutations is simply the sum of all differences. For example, for a TMRCA of 200 generations, an observed difference of 2 steps corresponds to an estimated difference of 2.11 mutations.

Implications for the "squaring" distance rule. If the squaring rule is appropriate, then an observed difference of 0 should map to 0, 1 to 1, 2 to 4, 3 to 9, 4 to 16. Note that this is certainly not the case. For example, for an MRCA of 1000 generations, an observed difference of 2 steps corresponds to an estimate of 4.08 mutations. Looks like the square rule, but an observed difference of 1 corresponds to 2.04 actual mutations (not 1 as predicted) and an observed difference of 3 or 4 steps corresponds to 3.5 and 5.4 actual mutations, respectively, not the 9 and 16 predicted from the squaring model. Most striking, note that a zero difference actually corresponds to an estimate of 3.45 mutations! This later arises because at 1000 generations, we expect (on average) about 4 mutations per marker, and many combinations of four mutations result in a zero final difference.

Observed number of steps t = 50 t = 100 t = 200 t = 300 t = 400 t = 500 t = 750 t = 1000
0 0.02 0.08 0.30 0.62 0.99 1.40 2.43 3.45
1 1.00 1.01 1.06 1.12 1.20 1.31 1.64 2.04
2 2.01 2.03 2.11 2.23 2.41 2.62 3.28 4.08
3 3.00 3.01 3.02 3.05 3.09 3.14 3.31 3.54
4 4.00 4.02 4.06 4.14 4.25 4.29 4.84 5.43

Note that these corrections depend on the actual time. However, this can be estimated using the distribution of Y marker differences (e.g., TMRCA calculator).

The Stepwise Mutation Model

The Assumptions

The stepwise mutational model tries to better account for the actual mutational process that occurs at microsatellite markers. What is scored is marker length. Hence, as the picture below shows, two mutations, a (say) -1 mutation followed by a +1 mutation returns the marker to its original length.

The stepwise mutation model looks at the frequency spectrum (0,1,2,3 ..) of the mismatches, namely how many loci show no mismatches, 1 mismatch, 2 mismatches, etc. Its simplest form is the one-step, symmetric model, which assumes only one step per mutation, with equal probability of increasing (+1) or decreasing (-1). More complex stepwise mutational models can be constructed, but this is a little premature until more information on the mutational process is available. Currently, there is very good evidence for the one-step model, as only 1/30 to 1/50 of all (the small number of) observed mutations are two-steps. It must be stressed that these are actual mutations that are observed (i.e.., father and son differ by two), as opposed to two individuals that differ by two, where we have no information as to whether this is one , two , or multiple steps.

Probability of a match: Under the one-step, symmetric stepwise model, the probability that two individuals with a last common ancestor t generations ago differ by i (i = 0, 1, 2, etc.) at a marker is 2Exp[-2ut] * Bessel(i, 2ut), where Bessel(k, x) is the i-th order modified type I Bessel function (see Walsh 2001 for details).

Advantages: The stepwise model more fully accounts for the likely underlying mutation process.

Disadvantages: There are a large number of stepwise models one could use -- for example, we could allow for two steps, but would then have to assign these probabilities. Hence, the stepwise model is certainly an improvement, but it, too, may be the wrong mutational model for a particular locus. Current fits of the data show that the symmetric one-step model fits the data well, but this may certainly change for any particular marker. If such information becomes available, it will be incorporated.

If most mutations are one-step, but a few two-step mutants exist, then the true distribution of the TMRCA falls between the infinite alleles and stepwise models. For example, suppose we observed an exact match at 20 markers, and a two-step mismatch at one marker. This is more probably a single mutation as opposed to two (or more) one-step mutations. If the latter were true, we would expect other loci to have mutated as well, unless they had lower mutations rates.

Technical Details

The values in the above table were generated as follows.

First, we are assuming the symmetric single-step model wherein with probability u = 1/500, a mutation occurs. With probability 1/2, it increases the length by one, otherwise it decreases the length by one.

This model is an example of what is called a Markov chain. Were are interested in the distribution of differences (0, 1, 2, 3, ect.) given that a total of M mutations have occurred. For example, for two mutations, the probability is 50% for no observed difference, 50% for an observed two-step difference.

The general expression for the probability of k differences given a total of M mutations is just

The next step is to compute the probability that M mutations have occurred given t generations. The probability of M mutations simply follows a poisson distribution with parameter T = 2ut,

Hence, the probability that the individuals differ by k (observed) differences given t generations to MRCA follows by summing

Finally, we can obtain an expression for the item of interest, namely Prob(M mutations | an observed difference of k) --- this allows us to estimate the actual number of mutations given the observed number of mutations. This follows using bayes' theorem,

We have expressions for all of these probabilities. Plugging in gives the table values.

Replies to some FAQ's