











Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
This is governed by the 'rarity' of the components (alleles) ... Works for 'ideal' population i.e. one that is in Hardy Weinberg equilibrium.
Typology: Lecture notes
1 / 19
This page cannot be seen from the preview
Don't miss anything!
When a DNA profile obtained from a crime scene and there is a reference DNA profile to compare we need some way to assess the significance of the similar or different allelic information
A statistical weighting The way that this statistical weighting is generated has a lot of population genetics and statistics behind it
I will just scratch the surface of some of the key ideas
To apply a statistical weighting we must know the ‘rarity’ of the DNA profile we are examining.
This is governed by the ‘rarity’ of the components (alleles) that make up the reference DNA profile in the population of interest: p^2 +2pq+q^2 = Works for ‘ideal’ population i.e. one that is in Hardy Weinberg equilibrium
Genotype frequencies are constant between generations and all frequencies sum to 1
So for example – we wanted to calculate a profile frequency for a homozygous locus [A,A] where the frequency of [A] is 0.3: Profile frequency = p^2 = (0.3)^2 = 0. Or Heterozygous locus [A,B] where the frequency of [A] is 0. and [B] is 0.7: Profile frequency = 2pq = 2(0.3)(0.7) = 0.
Why is 2 out the front here for heterozygotes?
In our population where p = 0.3 and q = 0.7 we would expect genotypes in the following proportions: A B A AA (0.3^2 )=0.09 AB (0.3x0.7)=0. B AB (0.7x0.3)=0.21 BB (0.7^2 )=0.
A heterozygote individual who contains an [A] and a [B] could actually be [A,B] or [B,A] = pq + qp = 2pq
And p^2 +2pq+q^2 =(0.3)^2 +2(0.3)(0.7)+(0.7)^2 = 0.09+0.42+0.49 = 1
A B C D E F A 0.2^2 =0.04 0.2x0.3=0.06 0.2x0.2=0.04 0.2x0.2=0.04 0.2x0.05=0.01 0.2x0.05=0. B 0.2x0.3=0.06 0.3^2 =0.09 0.3x0.2=0.06 0.3x0.2=0.06 0.3x0.05=0.15 0.3x0.05=0. C 0.2x0.2=0.04 0.3x0.2=0.06 0.2^2 =.0.4 0.2x0.2=0.04 0.2x0.05=0.01 0.2x0.05=0. D 0.2x0.2=0.04 0.3x0.2=0.06 0.2x0.2=0.04 0.2^2 =0.04 0.2x0.05=0.01 0.05^2 =0. E 0.2x0.05=0.01 0.3x0.05=0.15 0.2x0.05=0.01 0.2x0.05=0.01 0.05^2 =0.0025 0.05^2 =0. F 0.2x0.05=0.01 0.3x0.05=0.15 0.2x0.05=0.01 0.2x0.05=0.01 0.05^2 =0.0025 0.05^2 =0.
This idea can be extended to multi population with frequencies A=0.2, B=0.3, C=0.2, D=0.2, E=0.05, F=0.05-allele scenarios, e.g. consider a
These genotype proportions are only valid under the assumption of Hardy Weinberg equilibrium. i.e. random mating, large population, no mutation, no selection and no migration. This island population is in Hardy Weinberg equilibrium (imagine there are an infinite number of dots) have the genotype [Y,Y] have the genotype [B,Y] have the genotype [B,B]
Hardy Weinberg Island
This is the same population after 50 generations
Because the population is in HWE genotype frequencies remain constant and the population remains the same
Hardy Weinberg Island
Oh No! Someone from West HWI has started a feud with someone from East HWI. The island is divided The population sub- structure now means that there is no more random mating (which violates one of the requirements for HWE)
Hardy Weinberg Island
After one generation there would be no difference So lets speed it up a little and power through a few more generations
Move forward by 1 generation
Hardy Weinberg Island
Frequency [Y,Y] = 0. So the report says “ The chance of seeing this genotype in another unrelated individual is 1 in 4 ”
Hardy Weinberg Island
But every individual on west HW island is [Y,Y] and so the chance of seeing this profile again in another east HW island is 1 in 1. By assuming the population is in HWE the strength of the evidence has been grossly overstated.
Hardy Weinberg Island
i.e. allele frequencies cannot be used to accurately determine genotype frequencies any longer
Reality Island Ideally we would recognise the fact that the population is not in HWE and we would calculate the statistic in a database of only west HW Island individuals. Reality differs from the ideal model in a number of ways. We will pick up the population from where we left off
Old boundaries forgotten – there may be some mating between populations at the border
Reality Island
People may migrate over the border and modern technology allows travel from one side of the island to the other
Reality Island
Rare mutations will occur, which will make some individuals very different from the general population
Reality Island
Selection comes into play when disease hits the island and it turns out one of the genotypes is susceptible
Reality Island
Migrations from far off lands will introduce exotic genotypes into the island
Reality Island
Now the picture is much more like reality. Clear boundaries do not exist but general trends can be seen
Reality Island
In reality we aren’t able to obtain the genotypes for every individual in the population. Practically we can only obtain samples from a small percentage of the population, and then use that to draw inferences about the whole population. The subsets of individuals are called population databases, and they are used to determine allele frequencies.
16 [Y,Y] 17 [B,B] 3 [B,Y] 1 [B,D] 37 In total
Population Databases – Allele frequencies
32 [Y] alleles 34 [B] alleles 3 [B] alleles and 3 [Y] alleles 1 [B] allele and 1 [D] allele
74 alleles in total^ Allele frequencies can be calculated by:^
Total # alleles e.g. for [Y] frequency (pY)= 32+3 74 = 0.473 Or 47.3% Similarly pB= 51.4% and pD = 1.3%
There is an obvious problem with this database It spans two populations and so contains substructure But how could you tell just from the data that was collected? We will come back to this a bit later
Population Databases
A better approach would be, once it is recognised that there is some structure in the population, to sample the two populations separately e.g.
Population Database 1
Population Database 2
Population Databases
These databases still aren’t in HWE, but will give closer estimates of genotype frequency from the allele frequencies. They may or may not show departures from HWE when tested for departures from equilibrium using the Fisher’s Exact test. The Fisher’s Exact test is quite weak in its ability to detect these departures
Population Databases
We expect human populations to depart from HWE (even if we don’t detect any) Why is this? Because we violate the assumptions required for HWE, i.e.
Population Databases (^) So if we can’t use the HWE formulae (p determine genotype^2 and 2pq) to frequencies then what can we do? We need some way to take into account the fact that people in a finitely sized population are distantly related and hence……inbred
But very distantly (^) This means that some alleles are more common (and by the inverse others would be rarer) than under HWE estimates
DNA profiles from current forensic profiling kits can have frequencies in the order of 1 in 10 22
This means that it each DNA profile is incredibly rare However given that we have seen a profile already, it makes it more likely to see that profile again
This is because populations are not infinitely large and the choice of mate is not random (violations of the HWE model)
Theta
population sub-population
family
These would be the only generations alive
Theta
population sub-population
family
Looking at the population that is alive at the present day we would see a picture more like that shown below Regardless of which individuals breed within a subpopulation there is going to be a distant level of relatedness between them (theta)
Theta
An allele from anyone on the left has 0 probability of being IBD to anyone on the right. No breeding between sub populations-
Theta This means if we have a profile from someone on the left it won’t give us any information about how common the profile is on the right
Theta
Person1 [A,B] Person1 [C,D]^ Person3 [A,B]^ Person4 [C,D]
[A,C]
[A,A] [A,A]
[A,D] [A,D] [A,A]
[A,C] [A,B] [B,C]
[B,D] [C,B] [B,B]
pA=0.25, pB=0.25, pC=0.25, pD=0. For both sub-pops
pA=1, pB=0, pC=0, pD=0 pA=0, pB=0.75, pC=0.25, pD=
Alleles randomly passed on to offspring cause allele frequencies between the two subpopulations to differ
Theta as a measure of distance between populations also explains why theta is higher in Aboriginal groups than for Caucasian groups.
population
family sub-population
θ (^1) θ 2 θ 3
θ 1 is needed to describe the diversity within subpopulation 1 θ 2 is needed to describe the diversity within subpopulation 2 θ 3 is needed to describe the diversity within the entire population
θ (^1) θ 2 θ 3
Now we need to look at the population structure of Aboriginal and Caucasian Australians (^) Horton map
θ 1 is needed to describe the diversity within a tribe θ 2 is needed to describe the diversity within a traditional regional group θ 3 is needed to describe the diversity within the entire state
θ 3 - State θ 1 - Tribe θ 2 - Region
Caucasian are a lot more boring… Tend to be the same all over the world, with very little geographic substructure This means that a smaller theta can be used to cover the genetic diversity within Caucasians
θ 1 - World
We need some way of taking this inbreeding into account, and we do this with the use of θ, a co-ancestry coefficient (also called theta, or FST) θ was most knowingly incorporated into a matching statistic in a 1994 paper by Balding and Nichols (Forensic Science International 64:125-140, 1994) Strictly speaking the definition of theta is “ The proportion of times that two alleles randomly chosen in a population will be Identical By Descent (IBD) ”
Matching statistics – Match Probability
IBD alleles are when two alleles of the same designation have originated from a common ancestor, rather than by mutation So we can now move away from profile frequencies and the Hardy Weinberg formulae (known as the product rule formulae) and onto a conditional probability called a Match Probability that incorporates θ More about this in a bit later
Matching statistics – Match Probability The Match Probability does not make the assumption of HWE and so is a more appropriate matching statistic to use for human populations. It tells us the probability of seeing a profile a second time given that we have already seen that profile once i.e. if the suspect is [A,A] (and we are assuming that the suspect is not the offender) then we want to know the probability of seeing [A,A] again (in the true offender). This is written as Pr(AA|AA) when the “|” means ‘ given that we’ve seen ’
Matching statistics – Match Probability
Image that the group of individuals we chose for the database were, by chance, different than those we actually chose
The allele frequencies we will differ from the allele frequencies calculated using the last database (which will both be different from the true allele frequencies in the population.
Population databases – confidence intervals
Take Allele ‘Z’ for example. We know that in the population pZ = 0. In our first database we did not observe ‘Z’ so: pZ = 0 In the database seen here: pZ = 0.
Population databases – confidence intervals
Now imagine databases being randomly chosen from the population many times frequencies of alleles will vary but will be distributed around the true value in the whole population.
Population databases – confidence intervals Using allele ‘Z’ as an example we would expect a distribution of allele frequencies for ‘Z’ that would centre around 0. (the true value in the population) but will deviate slightly either side. This is called sampling variation
Population databases – confidence intervals
0
0 0.02 0.04 0.06 0.08 0.1 0.12 0.
The distribution below shows the spread of allele frequencies that repeated sampling of databases has produced for allele ‘Z’.
Frequency
proportion of databases that showed ‘Z’ at the frequency
Population databases – confidence intervals
0
0 0.02 0.04 0.06 0.08 0.1 0.12 0. Frequency
0.016 Our match probability calculation: Pr(BY|BY)=0. But now take into account that sampling variation means that allele frequencies will differ depending on how the database is chosen
Population databases – confidence intervals
Using the distribution of allele frequencies (caused by sampling variation) we calculate a distribution for the match probability.
Match probability
The Match probability point estimate: Pr(BY|BY)=0. which is the apex of the graph. We choose a percentage of this curve to determine the confidence intervals that takes sampling variation into account
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Population databases – confidence intervals For example a 99% confidence interval takes the inner 99% of the area under this curve: Typically report the MP i.e. 0.6, as that is conceding doubt to the defendant NOTE: Using the allele frequencies of the whole population (PB=0.452 & PY=0.468) we would obtain: Pr(BY|BY)=0.42 (which is captured within the 99% confidence intervals)
0 0.1 0.2 0.3 Match probability 0.4 0.5 0.6 0.7 0.8 0.9 1
99% of the area
% of the area
% of the area
0.37 0.
Population databases – confidence intervals
From the database shown we have the following: frequency [B,Y] =0.486 (assuming HWE) Pr(BY|BY)=0.495 (taking co-ancestry into account) 99% Pr(BY|BY)=0.37 to 0.6 (with sampling variation)
Population databases – confidence intervals But the true rarity of [B,Y] (from a simple count of the green dots) is 0. Which is very different from any of our calculated values. no amount of mathematical adjustments are going to overcome a poorly constructed database (like the one seen here, which spans multiple populations). This is where database validation and Fisher’s exact test become VERY important
Population databases – confidence intervals
Validating databases
There are tests that we subject our population database to prior to using them. This is to ensure they are fit for forensic use. There are many many many forensic papers that describe population databases for countries and groups all over the world, all of which will have had some analyses undertaken on them.
Multi-testing problem For a p-value cut-off of 0.05 we would expect that if no dependencies existed then 5% of comparisons would show a significant departure for equilibrium by chance alone. The more tests we do the more significant p-values we will get and this is the multi-testing problem. With 15 Identifiler loci there end up being 120 comparisons being done so we would expect approximately 6 to have p- values < 0. Needs to be taken into account when assessing HWE/LE departures
Multi-testing problem
Two ways of dealing with multi-testing problem:
Multi-testing problem Graph the ordered p-values against the expected values to produce a p-p plot
y=x line
Observed values vs expected
95% ‘envelope’
Multi-testing problem Databases that do not contain dependencies will fall within the 95% envelope, databases that do contain dependencies will fall outside Below are two pan-Australian databases for Aboriginal and Caucasian database
Multi-testing problem
The second method for assessing departures from HWE and LE is the truncated product method.
This method considered all the p-values together to see whether there is evidence that the results from the multiple tests, as a whole, show evidence for significance
Multi-testing problem
The truncated product method states that the sum of - 2ln(p- values) from ‘ t ’ independent tests should have a chi-squared distribution with 2t degrees of freedom If you are interested in reading about why this is then read the paper: Zaykin, D., Zhivotovsky, L. A. and Weir, B. S. (2002) Truncated product method for combining p-values. Genetics and Epidemiology 22: 170-185.
=Chidist(50.66, 40)
This method is most easily carried out in Excel FGA is showing a significant p-value But overall data has no significant disequilibrium
2t d.o.f.
Locus p - value - HWE2ln( p ) CSF 0.53 1. D12 0.10 4. D13 0.26 2. D16 0.51 1. D18 0.25 2.
TPOX vWA 0.351.00 2.130. Sum[ p - value-2ln( p )] 50.0.12 66
Sum[-2ln( p )]
FGA 0.01 9.
You can delve a little further into dependencies by looking at the observed numbers of genotypes against the expected number at each locus (based on allele frequencies) Genotype [8,11] Genotype [9,14] Counts of alleles are shown in the right hand column and bottom row. Combined with the total these can be used to generate expected genotype frequencies
If in equilibrium these values should adhere to a chi with 1 degree of freedom, so for a 95% confidence interval the 1 d.o.f.-squared distribution chi-squared critical value is 3. This means that if we calculate the value: Expected
Observed Expected^2 Any values > 3.84 are a significant departure
Doing this can give you some further information that the p would. -values alone It can tell you that disequilibrium is being caused by a few rare genotypes Also if significant values are on the diagonal (indicating an excess of homozygotes) then this can indicate the database has substructure
This excess from grouped populations homozygosity is known as the effect Wahlund
Gregor Johann Mendel (1822 – 1884) Austrian Augustinian monk and scientist Studied inheritance of certain traits in pea plants
The law of segregation – Each individual has two ‘factors’ controlling a given characteristic, one being a copy of a corresponding factor in the father of the individual and one being a copy of the corresponding factor in the mother of the individual. Further, a copy of randomly selected one of the two factors is copied to each child, independently for different children and independently of the factor contributed by the spouse.
Linked loci will take longer to reach a level of equilibrium than unlinked loci.
The amount of linkage disequilibrium (D) in a population, ‘ n’ generations after an evolutionary event can be determined by:
D 0 = the level of linkage disequilibrium caused by the evolutionary event.
Dn ( 1 R ) nD 0
Linkage
The graph below shows the re-equilibration with an initial level of linkage disequilibrium, D = 0.1 with various levels of Recombination ( R).
Linkage
Completely linked loci (i.e. those where recombination never occurs) will never be able to recover from the evolutionary event
Completely unlinked loci half the amount of linkage disequilibrium each generation – But will still show some linkage disequilibrium This is the reason that linkage equilibrium has the addition required assumption (beyond HWE assumptions) that an infinite number of generations has elapsed since any disturbing force.
Linkage