Mathematics of Forensic DNA Identification World Trade Center Project Extracting Information from Kinships and Limited Profiles Jonathan Hoyle Gene Codes Corporation 2/17/03 Introduction - 2,795 people were killed in the World Trade Center attacks on September 11, 2001. - 20,000 remains were recovered, the vast majority of which would require DNA matching for identification. - Existing software tools for DNA identification proved wholly inadequate for the scope and magnitude of this project. Timeline - September 17: Armed Forces DNA Identification Lab [AFDIL] asks Gene Codes to update Sequencher(tm) for the Pentagon and Shanksville crashes. - September 28: Office of the Chief Medical Examiner [OCME] in New York City contacts us for new software. - October 15: Using the Extreme Programming [XP] methodology, software development is underway. - December 13: M-FISys (Mass-Fatality Identification System) has its first release to the OCME. - Since: Weekly releases personally delivered to the OCME, to accommodate rapidly changing requirements. Identification Technologies - Technologies used for Identification -- STR -- mtDNA -- SNP - Methods used: -- Direct Match to a Personal Effect -- Kinship Analysis STR: Short Tandem Repeats - A repeat of a short sequence of bases (4 or 5) - For example, at locus position D7S280, it is the four base sequence gata we look for: ...gatagatagatagatagatagatagatgtttatctc... - In the above example, gata is repeated 6 times with a 3-base partial repeat. - "6.3" is therefore assigned for this allele. - Being diploid, we have two alleles per locus, thus (up to) two values are stored, e.g. 6.3/8. STR Frequency - In 1997, the FBI standardized on 13 STR loci used in the national database, CoDIS. - Frequency data for each locus/allele value is available for various races. For example: Locus: D16S539 TPOX D3S1358 FGA D7S820 vWA D13S317 TH01 Allele: 11/13 8 15.2 21/13.2 10/11 15.2 11/12 9.3 Freq: 8.55% 39.4% 0.099% 0.796% 14.6% 0.099% 18.2% 9.21% - Since STR loci are independent, these frequencies can be multiplied: 5.6 x 10^-13 - Likelihood = 1 / Frequency = 1.8 x 10^12 STR Profiles - M-FISys STR profile contains 16 elements: -- Amelogenin (Gender) -- 13 CoDIS Core Loci -- 2 PowerPlex Loci: - Penta D - Penta E - Minimum Likelihood: 7.6 x 10^15 STR Likelihood Threshold - OCME wants a minimum likelihood for identification which ensures a chance of a mismatch to be less than 1 in a million. - Assuming a population of 5000, what is the smallest n such that a 10n min likelihood yields a mismatch prob < 10^-6 ? - Since likelihood is the inverse of probability, p = 1 / 10^n - The probability of no mismatch is q = 1 - p = 1 - 1 / 10^n - The prob. of no mismatch in 5000 = 1 - q^5000 = 1-(1-1/10^n)^5000 - Thus we have the inequality: 1 - (1 - 1 / 10^n )^5000 < 1 / 1,000,000 - Solving for n we get: n > -log10(1 - (1 - 10^-6 )^(1/5000)) = 9.699 - Therefore we set n = 10. Direct STR Identification - A victim remain (called a disaster sample) can be identified by direct match if its profile is either: -- complete and matches Personal Effects (2 modalities) -- partial with no mismatches, with a likelihood >= 10^10 amongst common loci - A sample was further investigated if its STR profile likelihood >= 10^10 and with either: -- a single mismatch only, supported by Kinship -- mismatches due only to allelic dropout Partial Profiles - All STR profiles containing at least 11 CoDIS markers or more will have likelihoods ³ 10^10 - 70% of the victim samples yielded partial profiles (missing at least one CoDIS marker) - 25% of these partial profiles had likelihood values >= 10^10 - Leaving half of victim samples which cannot be identified through STR means alone (using these parameters). STR Likelihood: Locus Probability - Likelihood = 1 / Probability Frequency - OCME has locus-allele frequency data - Locus Probability can be first approximated by ignoring population structure and using the Hardy-Weinberg proportions: p^2 for homozygous alleles: p = frequency of allele 2pq for heterozygous alleles: p,q = frequency of each allele - Above assumes an infinite population with random mating STR Likelihood: F - Because the population is finite, we introduce the inbreeding coefficient F - Factoring this into the H-W equations: p2 + p(1-p)F for homozygous alleles 2pq(1-F) for heterozygous alleles - Because F is very small, 1-F is close to 1, we round it to remain conservative: p2 + p(1-p)F for homozygous alleles 2pq for heterozygous alleles - OCME chooses the standard F = 0.03 STR Likelihood: Profiles - Once we have calculated the probability frequency for each locus, we can calculate the likelihood of the entire profile: - If Pk (Ak) is the probability of allele A at locus k, we can define the likelihood of STR profile S as: L(S) = Product(k in Alleles)[1 / Pk (Ak)] - Note that this works even for partial profiles STR Likelihood: Race - OCME has frequency values for four population groups: Asian, Black, Caucasian & Hispanic - Cannot always rely on reported race, and the race is unknown for a disaster sample - M-FISys computes the Likelihood value across all four races and chooses the lowest value, just to be on the safer, more conservative side. STR: Kinship Analysis - Many times there was not sufficient data to perform an STR direct match. - Cheek swabs from family members of missing persons are taken, and a pedigree tree in M-FISys can be generated. - Likelihoods are calculated on victim samples to determine to which pedigree(s) they belong. - Kinship Analysis was not performed if more than one relative was in the victim list. Kinship Analysis: Likelihood - As with direct STR, Kinship Likelihood is: -- the product of Locus Likelihoods over common loci -- the Likelihood Ratio >= 10^6 -- calculated across all four races, using the lowest, most conservative value -- uses frequency data from the OCME - Analysis was performed for these relations: Parent-Child, Full Sibling, Half Sibling Kinship Algorithm - M-FISys uses the Kinship algorithm as implemented by Dr. George Carmody of Carleton University - Kinship Locus Likelihood defined as: k = r2x2 + r1x1 + r0x0 - where the ri's are relationship proportions: Parent-Child: r2 = 0 r1 = 1 r0 = 0 Full Sibling: r2 = 1/4 r1 = 1/2 r0 = 1/4 Half Sibling: r2 = 1/2 r1 = 1/2 r0 = 0 First Cousin: r2 = 3/4 r1 = 1/4 r0 = 0 - and with p & q the frequencies of the high & low alleles resp., the xi's are defined as: X2 = p^2 if victim is homozygous and matches an allele = 2pq otherwise X1 = 0 if relative & victim share no common allele = p if relative homozygous & shares low allele = q if relative homozygous & shares high allele = p/2 if relative heterozygous & shares low allele = q/2 if relative heterozygous & shares high allele = (p+q)/2 if relative & victim are identical X0 = 1 if relative & victim alleles are identical = 0 otherwise Mitochondrial DNA Analysis - Some victim samples were so degraded that sufficient STR data was not available for either direct STR match or Kinship analysis. - mtDNA is hardier material, surviving under conditions which nuclear DNA degrades - mtDNA is a 16,569-based circular genome. - It is maternally inherited, and thus not unique. - 5% of the Caucasian population share the same common mitotype. mtDNA Analysis - Mito-typing involves direct sequencing of two highly variable regions of mtDNA. - The two areas used for mitotyping (HV1 & HV2) are not in a coding region. - Only a sample's differences from the Anderson Sequence (an internationally accepted standard) need be tracked. - However, 25% of the WTC victims had no maternally-related kin samples. Mito Likelihood - To determine likelihood for a given mitotype, we begin by counting its frequency x in the FBI mtCoDIS data of size n. - The 95% confidence interval for a population proportion with Binomial distribution is estimated by the formula: [ m - 1.96s/sqrt(n), m + 1.96s/sqrt(n) ] where m is the mean and s is the standard deviation. - Since the probability p is just the number database hits, we set p = x/n, and so we have m = p and s = Ãp(1-p) . - Thus we have as the upper bound: x/n + sqrt(x(n-x))/n . - If there are no database entries, we use: 1 - a^(1/n) with a = 0.05 - Likelihood = 1 / Frequency Introduction to SNP's - Single Nucleotide Polymorphisms - Represents single base differences - Work pioneered by the GeneScreen division of Orchid Biosciences - By being able to collect data from very short sequences, this technology offers a great deal of hope for the identification of badly degraded samples SNP Selection - SNP's occur on average every 100-300 bases within the human genome. - 2 out of every 3 SNP's involve replacing a C with a T. - Of these, there is a panel of 70 which are chosen, specifically those in which C and T are equally likely. SNP Likelihood - A complete profile of 70 SNP's each with an independent probability of 1/2 would yield a likelihood of match at 2^70 = 10^21. - The probabilities are independent if the SNP's are unlinked, which we define to be at least 50MB apart. - Unfortunately, it is not possible to have 70 SNP's 50MB apart in a 3GB genome. SNP Independence - A study by Dr. Ranajit Chakraborty of the Center for Genome Information concluded: -- Allelic dependence is very low: 5.71% as compared to 5% expected by chance alone -- Average heterozygosity of 46% across three population groups: Causian, Black, Hispanic -- Despite lack of theoretically independent loci, his study supports the use of this 70 SNP panel for identification purposes Non-Equiprobable SNP's - Conservative likelihoods can be calculated even without the assumption of equi-probability. - All bi-allelic heterozygous alleles have a minimum likelihood of 2, regardless of frequency: f = 2pq = 2p(1-p) <= 0.5 for all p in [0,1]; Thus L = 1/f >= 2 - The minimum likelihood of a SNP profile containing n heterozygous alleles is thus 2^n. - As Forensic Mathematician Charles Brenner notes, even if the SNP frequencies were 0.1 and 0.9, 99% of cases will have 10 heterozygous loci out of 100. M-FISys SNP Form Combining Technologies for Partial Profiles - The M-FISys software package is designed for rapid cross-pollination of STR, Kinship, mtDNA and SNP data of DNA samples. - Consistent or conflicting data in one technology can help determine experimental errors resulting in another technology. - M-FISys also generates Quality Control reports for finding such inconsistencies. Combining SNP's & STR's - By selectively choosing SNP's which are unlinked to each other and existing STR loci, independent likelihoods can be multiplied. - With the exception of CSF1PO & D5S818, all STR loci are on different chromosomes. - Thus any unlinked SNP's on an unused chromosome can be included in likelihood calculations. - STR profiles below threshold are missing >= 3 loci - Even if only 10 SNP's are used, the likelihood can be increased by 3 orders of magnitude! (2^10 = 10^+3) More Information Gene Codes Forensics 775 Technology Drive, Suite 100A Ann Arbor, MI 48108 (734) 769-7249 http://www.genecodes.com Updated Slides: http://www.jonhoyle.com/GeneCodes Thank You!