Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Hardy Weinberg Equilibrium, Lecture notes of Statistics

This is governed by the 'rarity' of the components (alleles) ... Works for 'ideal' population i.e. one that is in Hardy Weinberg equilibrium.

Typology: Lecture notes

2021/2022

Uploaded on 09/27/2022

judyth
judyth 🇺🇸

4.6

(27)

321 documents

1 / 19

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Duncan Taylor
duncan.taylor@sa.gov.au
Hardy Weinberg equilibrium
Allele frequencies
Genotype probabilities
Confidence intervals
Databases
Linkage
When a DNA profile obtained from a crime scene and there is a
reference DNA profile to compare we need some way to assess
the significance of the similar or different allelic information
A statistical weighting
The way that this statistical weighting is generated has a lot of
population genetics and statistics behind it
I will just scratch the surface of some of the key ideas
Hardy Weinberg
Equilibrium
To apply a statistical weighting we must know the ‘rarity’ of
the DNA profile we are examining.
This is governed by the ‘rarity’ of the components (alleles)
that make up the reference DNA profile in the population of
interest:
p2+2pq+q2=1
Works for ‘ideal’ population i.e. one that is in Hardy Weinberg
equilibrium
Genotype frequencies are constant between generations and
all frequencies sum to 1
So for example we wanted to calculate a profile frequency
for a homozygous locus [A,A] where the frequency of [A] is 0.3:
Profile frequency = p2 = (0.3)2 = 0.09
Or Heterozygous locus [A,B] where the frequency of [A] is 0.3
and [B] is 0.7:
Profile frequency = 2pq = 2(0.3)(0.7) = 0.42
Why is 2 out the front here for
heterozygotes?
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13

Partial preview of the text

Download Hardy Weinberg Equilibrium and more Lecture notes Statistics in PDF only on Docsity!

Duncan Taylor

duncan.taylor@sa.gov.au

Hardy Weinberg equilibrium

Allele frequencies

Genotype probabilities

Confidence intervals

Databases

Linkage

When a DNA profile obtained from a crime scene and there is a reference DNA profile to compare we need some way to assess the significance of the similar or different allelic information

A statistical weighting The way that this statistical weighting is generated has a lot of population genetics and statistics behind it

I will just scratch the surface of some of the key ideas

Hardy Weinberg

Equilibrium

To apply a statistical weighting we must know the ‘rarity’ of the DNA profile we are examining.

This is governed by the ‘rarity’ of the components (alleles) that make up the reference DNA profile in the population of interest: p^2 +2pq+q^2 = Works for ‘ideal’ population i.e. one that is in Hardy Weinberg equilibrium

Genotype frequencies are constant between generations and all frequencies sum to 1

So for example – we wanted to calculate a profile frequency for a homozygous locus [A,A] where the frequency of [A] is 0.3: Profile frequency = p^2 = (0.3)^2 = 0. Or Heterozygous locus [A,B] where the frequency of [A] is 0. and [B] is 0.7: Profile frequency = 2pq = 2(0.3)(0.7) = 0.

Why is 2 out the front here for heterozygotes?

In our population where p = 0.3 and q = 0.7 we would expect genotypes in the following proportions: A B A AA (0.3^2 )=0.09 AB (0.3x0.7)=0. B AB (0.7x0.3)=0.21 BB (0.7^2 )=0.

A heterozygote individual who contains an [A] and a [B] could actually be [A,B] or [B,A] = pq + qp = 2pq

And p^2 +2pq+q^2 =(0.3)^2 +2(0.3)(0.7)+(0.7)^2 = 0.09+0.42+0.49 = 1

A B C D E F A 0.2^2 =0.04 0.2x0.3=0.06 0.2x0.2=0.04 0.2x0.2=0.04 0.2x0.05=0.01 0.2x0.05=0. B 0.2x0.3=0.06 0.3^2 =0.09 0.3x0.2=0.06 0.3x0.2=0.06 0.3x0.05=0.15 0.3x0.05=0. C 0.2x0.2=0.04 0.3x0.2=0.06 0.2^2 =.0.4 0.2x0.2=0.04 0.2x0.05=0.01 0.2x0.05=0. D 0.2x0.2=0.04 0.3x0.2=0.06 0.2x0.2=0.04 0.2^2 =0.04 0.2x0.05=0.01 0.05^2 =0. E 0.2x0.05=0.01 0.3x0.05=0.15 0.2x0.05=0.01 0.2x0.05=0.01 0.05^2 =0.0025 0.05^2 =0. F 0.2x0.05=0.01 0.3x0.05=0.15 0.2x0.05=0.01 0.2x0.05=0.01 0.05^2 =0.0025 0.05^2 =0.

This idea can be extended to multi population with frequencies A=0.2, B=0.3, C=0.2, D=0.2, E=0.05, F=0.05-allele scenarios, e.g. consider a

These genotype proportions are only valid under the assumption of Hardy Weinberg equilibrium. i.e. random mating, large population, no mutation, no selection and no migration. This island population is in Hardy Weinberg equilibrium (imagine there are an infinite number of dots) have the genotype [Y,Y] have the genotype [B,Y] have the genotype [B,B]

Hardy Weinberg Island

This is the same population after 50 generations

Because the population is in HWE genotype frequencies remain constant and the population remains the same

Hardy Weinberg Island

Oh No! Someone from West HWI has started a feud with someone from East HWI. The island is divided The population sub- structure now means that there is no more random mating (which violates one of the requirements for HWE)

Hardy Weinberg Island

After one generation there would be no difference So lets speed it up a little and power through a few more generations

Move forward by 1 generation

Hardy Weinberg Island

Frequency [Y,Y] = 0. So the report says “ The chance of seeing this genotype in another unrelated individual is 1 in 4

Hardy Weinberg Island

[Y,Y]

[B,B]

But every individual on west HW island is [Y,Y] and so the chance of seeing this profile again in another east HW island is 1 in 1. By assuming the population is in HWE the strength of the evidence has been grossly overstated.

Hardy Weinberg Island

[Y,Y]

[B,B]

i.e. allele frequencies cannot be used to accurately determine genotype frequencies any longer

The imperfect

world

Reality Island Ideally we would recognise the fact that the population is not in HWE and we would calculate the statistic in a database of only west HW Island individuals. Reality differs from the ideal model in a number of ways. We will pick up the population from where we left off

Old boundaries forgotten – there may be some mating between populations at the border

Reality Island

People may migrate over the border and modern technology allows travel from one side of the island to the other

Reality Island

Rare mutations will occur, which will make some individuals very different from the general population

Reality Island

Selection comes into play when disease hits the island and it turns out one of the genotypes is susceptible

Reality Island

Migrations from far off lands will introduce exotic genotypes into the island

Reality Island

Now the picture is much more like reality. Clear boundaries do not exist but general trends can be seen

Reality Island

Population

databases

In reality we aren’t able to obtain the genotypes for every individual in the population. Practically we can only obtain samples from a small percentage of the population, and then use that to draw inferences about the whole population. The subsets of individuals are called population databases, and they are used to determine allele frequencies.

16 [Y,Y] 17 [B,B] 3 [B,Y] 1 [B,D] 37 In total

Population Databases – Allele frequencies

32 [Y] alleles 34 [B] alleles 3 [B] alleles and 3 [Y] alleles 1 [B] allele and 1 [D] allele

74 alleles in total^ Allele frequencies can be calculated by:^

alleles

Total # alleles e.g. for [Y] frequency (pY)= 32+3 74 = 0.473 Or 47.3% Similarly pB= 51.4% and pD = 1.3%

There is an obvious problem with this database It spans two populations and so contains substructure But how could you tell just from the data that was collected? We will come back to this a bit later

Population Databases

A better approach would be, once it is recognised that there is some structure in the population, to sample the two populations separately e.g.

Population Database 1

Population Database 2

Population Databases

These databases still aren’t in HWE, but will give closer estimates of genotype frequency from the allele frequencies. They may or may not show departures from HWE when tested for departures from equilibrium using the Fisher’s Exact test. The Fisher’s Exact test is quite weak in its ability to detect these departures

Population Databases

We expect human populations to depart from HWE (even if we don’t detect any) Why is this? Because we violate the assumptions required for HWE, i.e.

  • We don’t randomly mate
  • We immigrate and emigrate
  • Mutations occur
  • There may be some selective pressures
  • Our population size isn’t infinite

Population Databases (^) So if we can’t use the HWE formulae (p determine genotype^2 and 2pq) to frequencies then what can we do? We need some way to take into account the fact that people in a finitely sized population are distantly related and hence……inbred

But very distantly (^) This means that some alleles are more common (and by the inverse others would be rarer) than under HWE estimates

Inbreeding

Coancestry

Substructure

DNA profiles from current forensic profiling kits can have frequencies in the order of 1 in 10 22

This means that it each DNA profile is incredibly rare However given that we have seen a profile already, it makes it more likely to see that profile again

This is because populations are not infinitely large and the choice of mate is not random (violations of the HWE model)

Theta

population sub-population

family

These would be the only generations alive

Theta

population sub-population

family

Looking at the population that is alive at the present day we would see a picture more like that shown below Regardless of which individuals breed within a subpopulation there is going to be a distant level of relatedness between them (theta)

Theta

An allele from anyone on the left has 0 probability of being IBD to anyone on the right. No breeding between sub populations-

Theta This means if we have a profile from someone on the left it won’t give us any information about how common the profile is on the right

Theta

Person1 [A,B] Person1 [C,D]^ Person3 [A,B]^ Person4 [C,D]

[A,C]

[A,A] [A,A]

[A,D] [A,D] [A,A]

[A,C] [A,B] [B,C]

[B,D] [C,B] [B,B]

pA=0.25, pB=0.25, pC=0.25, pD=0. For both sub-pops

pA=1, pB=0, pC=0, pD=0 pA=0, pB=0.75, pC=0.25, pD=

Alleles randomly passed on to offspring cause allele frequencies between the two subpopulations to differ

  1. Common theta used in forensic calculations: Caucasian θ = 0.01 (1%) Aboriginal θ = 0.03 (3%)
  2. Thetas that correspond to familial relationships First cousins θ = 0.063 (6.3%) Second cousins θ = 0.016 (1.6%) Notice theta use in forensic setting are high compared to what you would might expect (i.e. the theta of 1% for Caucasians suggests that we might be regularly inbreeding to a level close to second cousins)
  • We always try to concede as much doubt to the defendant as is reasonable– i.e. the higher the level of theta, the more potential inbreeding we are accounting for and this will make the profile of interest more likely to be seen again in that population
  • Theta’s use is two-fold. As well as a co-ancestry coefficient it can also be used as a measure of the genetic distance between populations. So we can use a higher value to take into account the fact that relevant database might be different to our allele freq database

Theta as a measure of distance between populations also explains why theta is higher in Aboriginal groups than for Caucasian groups.

population

family sub-population

θ (^1) θ 2 θ 3

θ 1 is needed to describe the diversity within subpopulation 1 θ 2 is needed to describe the diversity within subpopulation 2 θ 3 is needed to describe the diversity within the entire population

θ (^1) θ 2 θ 3

Now we need to look at the population structure of Aboriginal and Caucasian Australians (^) Horton map

θ 1 is needed to describe the diversity within a tribe θ 2 is needed to describe the diversity within a traditional regional group θ 3 is needed to describe the diversity within the entire state

θ 3 - State θ 1 - Tribe θ 2 - Region

Caucasian are a lot more boring… Tend to be the same all over the world, with very little geographic substructure This means that a smaller theta can be used to cover the genetic diversity within Caucasians

θ 1 - World

Match

Probability

We need some way of taking this inbreeding into account, and we do this with the use of θ, a co-ancestry coefficient (also called theta, or FST) θ was most knowingly incorporated into a matching statistic in a 1994 paper by Balding and Nichols (Forensic Science International 64:125-140, 1994) Strictly speaking the definition of theta is “ The proportion of times that two alleles randomly chosen in a population will be Identical By Descent (IBD)

Matching statistics – Match Probability

IBD alleles are when two alleles of the same designation have originated from a common ancestor, rather than by mutation So we can now move away from profile frequencies and the Hardy Weinberg formulae (known as the product rule formulae) and onto a conditional probability called a Match Probability that incorporates θ More about this in a bit later

Matching statistics – Match Probability The Match Probability does not make the assumption of HWE and so is a more appropriate matching statistic to use for human populations. It tells us the probability of seeing a profile a second time given that we have already seen that profile once i.e. if the suspect is [A,A] (and we are assuming that the suspect is not the offender) then we want to know the probability of seeing [A,A] again (in the true offender). This is written as Pr(AA|AA) when the “|” means ‘ given that we’ve seen

Matching statistics – Match Probability

Image that the group of individuals we chose for the database were, by chance, different than those we actually chose

The allele frequencies we will differ from the allele frequencies calculated using the last database (which will both be different from the true allele frequencies in the population.

Population databases – confidence intervals

Take Allele ‘Z’ for example. We know that in the population pZ = 0. In our first database we did not observe ‘Z’ so: pZ = 0 In the database seen here: pZ = 0.

Population databases – confidence intervals

Now imagine databases being randomly chosen from the population many times frequencies of alleles will vary but will be distributed around the true value in the whole population.

Population databases – confidence intervals Using allele ‘Z’ as an example we would expect a distribution of allele frequencies for ‘Z’ that would centre around 0. (the true value in the population) but will deviate slightly either side. This is called sampling variation

Population databases – confidence intervals

0

0 0.02 0.04 0.06 0.08 0.1 0.12 0.

The distribution below shows the spread of allele frequencies that repeated sampling of databases has produced for allele ‘Z’.

Frequency

proportion of databases that showed ‘Z’ at the frequency

Population databases – confidence intervals

0

0 0.02 0.04 0.06 0.08 0.1 0.12 0. Frequency

0.016 Our match probability calculation: Pr(BY|BY)=0. But now take into account that sampling variation means that allele frequencies will differ depending on how the database is chosen

Population databases – confidence intervals

Using the distribution of allele frequencies (caused by sampling variation) we calculate a distribution for the match probability.

Match probability

The Match probability point estimate: Pr(BY|BY)=0. which is the apex of the graph. We choose a percentage of this curve to determine the confidence intervals that takes sampling variation into account

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Population databases – confidence intervals For example a 99% confidence interval takes the inner 99% of the area under this curve: Typically report the MP i.e. 0.6, as that is conceding doubt to the defendant NOTE: Using the allele frequencies of the whole population (PB=0.452 & PY=0.468) we would obtain: Pr(BY|BY)=0.42 (which is captured within the 99% confidence intervals)

0 0.1 0.2 0.3 Match probability 0.4 0.5 0.6 0.7 0.8 0.9 1

99% of the area

% of the area

% of the area

0.37 0.

Population databases – confidence intervals

From the database shown we have the following: frequency [B,Y] =0.486 (assuming HWE) Pr(BY|BY)=0.495 (taking co-ancestry into account) 99% Pr(BY|BY)=0.37 to 0.6 (with sampling variation)

Population databases – confidence intervals But the true rarity of [B,Y] (from a simple count of the green dots) is 0. Which is very different from any of our calculated values. no amount of mathematical adjustments are going to overcome a poorly constructed database (like the one seen here, which spans multiple populations). This is where database validation and Fisher’s exact test become VERY important

Population databases – confidence intervals

Validating

Databases

Validating databases

There are tests that we subject our population database to prior to using them. This is to ensure they are fit for forensic use. There are many many many forensic papers that describe population databases for countries and groups all over the world, all of which will have had some analyses undertaken on them.

Multi-testing problem For a p-value cut-off of 0.05 we would expect that if no dependencies existed then 5% of comparisons would show a significant departure for equilibrium by chance alone. The more tests we do the more significant p-values we will get and this is the multi-testing problem. With 15 Identifiler loci there end up being 120 comparisons being done so we would expect approximately 6 to have p- values < 0. Needs to be taken into account when assessing HWE/LE departures

Multi-testing problem

Two ways of dealing with multi-testing problem:

  • A graphical representation
  • A truncated product method The graphical method is appealing and visually easy to understand. It is based on the idea that for multiple Fisher’s exact tests the p-values (for a database with no dependencies) should be evenly spread over the range 0 to 1 This means that if all the p-values were ordered in ascending order then they should fall on a line with the equation y=x

Multi-testing problem Graph the ordered p-values against the expected values to produce a p-p plot

y=x line

Observed values vs expected

95% ‘envelope’

Multi-testing problem Databases that do not contain dependencies will fall within the 95% envelope, databases that do contain dependencies will fall outside Below are two pan-Australian databases for Aboriginal and Caucasian database

Multi-testing problem

The second method for assessing departures from HWE and LE is the truncated product method.

This method considered all the p-values together to see whether there is evidence that the results from the multiple tests, as a whole, show evidence for significance

Multi-testing problem

The truncated product method states that the sum of - 2ln(p- values) from ‘ t ’ independent tests should have a chi-squared distribution with 2t degrees of freedom If you are interested in reading about why this is then read the paper: Zaykin, D., Zhivotovsky, L. A. and Weir, B. S. (2002) Truncated product method for combining p-values. Genetics and Epidemiology 22: 170-185.

Multi-testing problem

=Chidist(50.66, 40)

This method is most easily carried out in Excel FGA is showing a significant p-value But overall data has no significant disequilibrium

2t d.o.f.

Locus p - value - HWE2ln( p ) CSF 0.53 1. D12 0.10 4. D13 0.26 2. D16 0.51 1. D18 0.25 2.

TPOX vWA 0.351.00 2.130. Sum[ p - value-2ln( p )] 50.0.12 66

Sum[-2ln( p )]

FGA 0.01 9.

Delving further into Fisher

You can delve a little further into dependencies by looking at the observed numbers of genotypes against the expected number at each locus (based on allele frequencies) Genotype [8,11] Genotype [9,14] Counts of alleles are shown in the right hand column and bottom row. Combined with the total these can be used to generate expected genotype frequencies

Delving further into Fisher

If in equilibrium these values should adhere to a chi with 1 degree of freedom, so for a 95% confidence interval the 1 d.o.f.-squared distribution chi-squared critical value is 3. This means that if we calculate the value:   Expected

ObservedExpected^2 Any values > 3.84 are a significant departure

Delving further into Fisher

Doing this can give you some further information that the p would. -values alone It can tell you that disequilibrium is being caused by a few rare genotypes Also if significant values are on the diagonal (indicating an excess of homozygotes) then this can indicate the database has substructure

This excess from grouped populations homozygosity is known as the effect Wahlund

A quick word

about Linkage

Gregor Johann Mendel (1822 – 1884) Austrian Augustinian monk and scientist Studied inheritance of certain traits in pea plants

Linkage

The law of segregation – Each individual has two ‘factors’ controlling a given characteristic, one being a copy of a corresponding factor in the father of the individual and one being a copy of the corresponding factor in the mother of the individual. Further, a copy of randomly selected one of the two factors is copied to each child, independently for different children and independently of the factor contributed by the spouse.

Linked loci will take longer to reach a level of equilibrium than unlinked loci.

The amount of linkage disequilibrium (D) in a population, ‘ n’ generations after an evolutionary event can be determined by:

D 0 = the level of linkage disequilibrium caused by the evolutionary event.

Dn ( 1  R ) nD 0

Linkage

The graph below shows the re-equilibration with an initial level of linkage disequilibrium, D = 0.1 with various levels of Recombination ( R).

Linkage

Completely linked loci (i.e. those where recombination never occurs) will never be able to recover from the evolutionary event

Completely unlinked loci half the amount of linkage disequilibrium each generation – But will still show some linkage disequilibrium This is the reason that linkage equilibrium has the addition required assumption (beyond HWE assumptions) that an infinite number of generations has elapsed since any disturbing force.

Linkage