# Lectures 12-13: Match Probabilities: Population Genetics and the NRC II

(version 7 Jan 2008)

This material is copyrighted and MAY NOT be used for commercial purposes

 You are visitor number since 7 Jan 2008

## Computing Match Probabilities

There are two outcomes when we use DNA to compare a crime scene sample and a suspect: Exclusion and Failure to Exclude.

With an Exclusion, we are done, as the suspect did not contribute that crime sample.

The fun begins when we have a failure to exclude, namely the DNA sample from the crime scene and from our suspect match, or (more correctly) failure to exclude.

In the earlier days of DNA testing (RFLP markers), there was concern as to whether two bands actually "matched", and how one computes the probability of a random person also producing a match.

The National Research Council, or NRC was organized by the National Academy of Sciences in 1916 to associate the broad community of science and technology with the Academy's purposes of further knowledge and advising the federal government. They produce reports on a number of issues where science and public policy overlap, such as food safety. They were requested to produce a document on DNA evidence.

In 1992, the NRC issued DNA Technology in Forensic Science. You can read this online.

However, this report was very controversial, with many feeling that the proposed methods over very overlay conservative and that population genetics was not well represented.

As a result, coupled with a change in technology moving from RFLP to PCR based system, a second report (usually called NRC II was issued in 1996.

National Resource Council (NRC) (1996). The Evaluation of Forensic DNA Evidence. National Academy Press, Washington DC. NRCII is also online.

The methods for computing match probabilities suggested by the NRC II are now the common standard in DNA court cases around the country.

## Computing Match Probabilities: Why all of the Fuss?

At first blush, computing match probabilities would seen easy.

For example, suppose we have two marker loci, where the crime sample has an 11,12 at the first marker (call it M1) and a 13,13 at the second marker (call in M2).

Note (lecture 5) that the number refers to the alleles, so this sample is a 11, 12 heterozygote at marker one and a 13 homozygote at marker 2.

Thus, from probability, your initial calculation of the match probability would be

### Pr(Match) = [ 2*Pr(11 at M1)*Pr(12 at M1) ]* Pr(13 at M2)2

So that if the frequency of allele11 is 0.3, the frequency of allele 12 is 0.1, and the frequency of allele 13 is 0.1, the match probability is just

### Pr(Match) = (2*0.3*0.1) * 0.12 = 0.0006

We have made two key assumptions to get this simple result:

1. Hardy-Weinberg: freq(Aa) = 2*freq(A)*freq(a), freq(A) = freq(A)2

2. Independence: The events at Marker one have no impact on the events at Marker two, so that

• Pr(11,12 at M1 and 13,13 at M2) = Pr(11,12 at M1) * Pr(13,13 at M2)

This is often called the product rule.

Where can we go wrong?

## Departures from Hardy-Weinberg

Population genetics is the field that deals with how genes behave in populations. The Hardy-Weinberg law, from the early days of the 1900's states that under certain circumstances,

### Freq(AA) = freq(A)2 Freq(Aa) = 2*freq(A)*freq(a)

In other words, it allows us to relate genotype frequencies to allele frequencies.

Two critical assumptions (for our purposes) as

• Random mating (no mating of even distance relatives)

• No population structure, just one single random mating population is involved.

### Random mating

Departures from Random mating typically result in homozygotes being more common, heterozygotes less common.

If theta measures the departure from random mating, then

### freq(AA) = freq(A)2 + theta*freq(A)*(1-freq(A))

Suppose freq(A) = 0.1, theta = 0.03

freq(A)2 = 0.12 = 0.01

freq(A)2 + theta*freq(A)*(1-freq(A)) = 0.01 + 0.03 *0.1*0.9 = 0.0397

Even this small a value of theta results in over a four-fold difference. With 13 markers, this translates to 412 or over a 67 million-fold difference

### Population Structure

We can also see departures from Hardy-Weinberg if our "population" is really two (or more) subpopulations and we do not account for this structure.

## Lack of Independence

Two critical assumptions (for our purposes) as

• Markers are genetically unlinked -- they are on different chromosomes.

• Marker loci are statistically unlinked, knowing the value of one marker genotype gives us information on the second.

When two (or more) markers reside on the same chromosome, they tend to be inherited in pairs

This is typically not a problem for autosomal markers, as these are chosen to reside on different chromosomes.

This will be an issue when we deal with Y-chromosomal markers.

### Statistical Associations Among Unlinked Markers

Population structure, admixture (matings between different populations), and other historical features of recent population structure can create associations even between unlinked loci.

Hence, knowing that the genotype is AA at the first marker tells us the individual is from population 1 and hence the second marker has to be BB.

This statistical association is often call (somewhat misleadingly) linkage disequilibrium or LD, although no linkage is required in this example. It has also been called gametic phase disequilibrium, so you see why the term LD is used.

While this is an extreme example, even knowing a small amount of information (say a probability now becoming 0.15 instead of 0.10 by knowing one of the markers) can have a huge impact over multiple loci.

## The NRC II Recommendations

Recommendation 4.1.

• In general, the calculation of a profile frequency should be made with the product rule.

• If the race of the person who left the evidence-sample DNA is known, the database for the person's race should be used; if the race is not known, calculations for all the racial groups to which possible suspects belong should be made.

• For systems in which exact genotypes can be determined, p2 + p(1-p)*theta should be used for the frequency at such a locus instead of p2

• A conservative value of theta for the US population is 0.01; for some small, isolated populations, a value of 0.03 may be more appropriate.

• 2pipj should be used for heterozygotes.

Recommendation 4.2.

• If the particular subpopulation from which the evidence sample came is known, the allele frequencies for the specific subgroup should be used as described in Recommendation 4.1.

• If allele frequencies for the subgroup are not available, although data for the full population are, then the calculations should use the population-structure equations 4.10 for each locus, and the resulting values should then be multiplied. Equation 4.10 is

Recommendation 4.3.

• If the person who contributed the evidence sample is from a group or tribe for which no adequate database exists, data from several other groups or tribes thought to be closely related to it should be used.

• The profile frequency should be calculated as described in Recommendation 4.1 for each group or tribe.

Recommendation 4.4.

• If the possible contributors of the evidence sample include relatives of the suspect, DNA profiles of those relatives should be obtained.

• If these profiles cannot be obtained, the probability of finding the evidence profile in those relatives should be calculated with Formulae 4.8 or 4.9. These can be found on page 113 of NRC II