the fundamental problem of forensic statistics - the evidential value of a rare y-haplotype match
DESCRIPTION
Two new estimators of the evidential value of a rare haplotype match for a Y-STR dna profile are proposed. One is based on the Good-Turing coverage estimator, the other based on Orlitsky et al. MLE of the spectrum of a probability distribution (vector of probabilities ordered from large to small) based on the observed spectrum (vector of observed frequencies ordered from large to small). These two are compared to a model-based approach "discrete Laplace mixture model". "Less is more": it can pay off to discard some of the information (the actual haplotypes) ... the true likelihood ratio is lower, but the precision with which it can be estimated is much, much better.TRANSCRIPT
![Page 1: The fundamental problem of forensic statistics - the evidential value of a rare Y-haplotype match](https://reader035.vdocuments.us/reader035/viewer/2022070302/547df7ebb479598e508b4a90/html5/thumbnails/1.jpg)
The fundamental problem of forensic statistics
Richard Gill, Giulia Cereda
ICFIS 2014
Taking account of three levels of uncertainty !
When less can be more
![Page 2: The fundamental problem of forensic statistics - the evidential value of a rare Y-haplotype match](https://reader035.vdocuments.us/reader035/viewer/2022070302/547df7ebb479598e508b4a90/html5/thumbnails/2.jpg)
If you are a cat / dog …
(thanks to Paul Roberts)
statisticians/lawyers = cats/dogs/people?
![Page 3: The fundamental problem of forensic statistics - the evidential value of a rare Y-haplotype match](https://reader035.vdocuments.us/reader035/viewer/2022070302/547df7ebb479598e508b4a90/html5/thumbnails/3.jpg)
If you are a cat / dog …• Google “Ben Geen”
• http://thejusticegap.com/2014/08/become-convicted-serial-killer-without-killing-anyone/
• How to become a convicted serial killer (without killing anyone), by yours truly
• http://bengeen.wordpress.com/
![Page 4: The fundamental problem of forensic statistics - the evidential value of a rare Y-haplotype match](https://reader035.vdocuments.us/reader035/viewer/2022070302/547df7ebb479598e508b4a90/html5/thumbnails/4.jpg)
The fundamental problem of forensic statistics
Richard Gill, Giulia Cereda
ICFIS 2014
Taking account of three levels of uncertainty
![Page 5: The fundamental problem of forensic statistics - the evidential value of a rare Y-haplotype match](https://reader035.vdocuments.us/reader035/viewer/2022070302/547df7ebb479598e508b4a90/html5/thumbnails/5.jpg)
Fundamental problem of forensic mathematics—The evidential value of a rarehaplotype
Charles H. Brenner a,b,*a School of Public Health, Forensic Science Group, U.C. Berkeley, Berkeley, CA United Statesb DNA!VIEW, 6801 Thornhill Drive, Oakland, CA 94611-1336, United States
1. Introduction
1.1. Mitochondrial DNA and Y-chromosomes
In recent years there has been increasing interest in using Y-chromosomal haplotypes [1–5] or mtDNA for forensic identifica-tion. These haplotype systems are also much used for bodyidentification, especially for old graves [6,7]. The advantages forsome kinds of problems are considerable. Both methods aredesirable for attacking kinship problems involving remoterelatives because sex-linked traits are not diluted 50% eachgeneration. Both methods are useful for some scant crime stains –mtDNA because of the high copy number, the Y-chromosomebecause it can be detected and amplified unambiguously evenwhen it is a minor component compared to the female victim in a
rape sample. However, an mtDNA or a Y-chromosomal haplotypemust be treated mathematically as a single indivisible (‘‘atomic’’)trait; so unlike those traditional DNA methods which examineseveral traits that are approximately independent of one another,no multiplication of probabilities is possible. Therefore it is vital tohave a sound fundamental understanding of atomic trait matchingprobabilities in order to make a reasonable assessment of thestrength of identification evidence.
1.2. The problem
Evidence – such as a Y-haplotype – is identical between a crimescene and a suspected donor. How strong is the evidence that thesuspect is the donor? In particular, this paper discusses the criticalmatching probability question, ‘‘What is the probability that arandom non-donor would by chance match a newly observedtype?’’
Such interesting questions and complications as possibledependence among traits and suitability of the reference popula-tion sample are outside the domain of this paper, essential though
Forensic Science International: Genetics 4 (2010) 281–291
A R T I C L E I N F O
Article history:Received 16 June 2009Received in revised form 20 October 2009Accepted 21 October 2009
Keywords:HaplotypeStain matchingmtDNAY-haplotypeForensic mathematicsLikelihood ratioMatching probability
A B S T R A C T
Y-chromosomal and mitochondrial haplotyping offer special advantages for criminal (and other)identification. For different reasons, each of them is sometimes detectable in a crime stain for whichautosomal typing fails. But they also present special problems, including a fundamental mathematicalone: When a rare haplotype is shared between suspect and crime scene, how strong is the evidencelinking the two? Assume a reference population sample is available which contains n " 1 haplotypes.The most interesting situation as well as the most common one is that the crime scene haplotype wasnever observed in the population sample. The traditional tools of product rule and sample frequency arenot useful when there are no components to multiply and the sample frequency is zero. A useful statisticis the fraction k of the population sample that consists of ‘‘singletons’’ – of once-observed types. A simpleargument shows that the probability for a random innocent suspect to match a previously unobservedcrime scene type is (1 " k)/n – distinctly less than 1/n, likely ten times less. The robust validity of thismodel is confirmed by testing it against a range of population models.
This paper hinges above all on one key insight: probability is not frequency. The common buterroneous ‘‘frequency’’ approach adopts population frequency as a surrogate for matching probabilityand attempts the intractable problem of guessing how many instances exist of the specific haplotype at acertain crime. Probability, by contrast, depends by definition only on the available data. Hence if differenthaplotypes but with the same data occur in two different crimes, although the frequencies are different(and are hopelessly elusive), the matching probabilities are the same, and are not hard to find.
! 2009 Elsevier Ireland Ltd. All rights reserved.
* Correspondence address: 6801 Thornhill Drive, Oakland CA 94611-1336. USA.Tel.: +l 510 339 1911; fax: +l 510 339 1181.
E-mail address: [email protected].
Contents lists available at ScienceDirect
Forensic Science International: Genetics
journa l homepage: www.e lsev ier .com/ locate / fs ig
1872-4973/$ – see front matter ! 2009 Elsevier Ireland Ltd. All rights reserved.doi:10.1016/j.fsigen.2009.10.013
(2014)
(2010)
Forensic Population Genetics - Original Research
Understanding Y haplotype matching probability
Charles H. Brenner a,b,*a Human Rights Center, U.C. Berkeley, Berkeley, CA, United Statesb DNA!VIEW, 6801 Thornhill Drive, Oakland, CA 94611-1336, United States
1. Introduction – understanding Y haplotypes
For understanding the forensic use of Y-haplotype evidence,rather than adapt the methods and habits that have evolved for theanalysis of autosomal DNA evidence it is more appropriate andproductive to start over from the beginning. Evidence is quantifiedby a likelihood ratio built from the probability for a coincidentalmatch by an innocent suspect; that fact remains. All else is up forgrabs.
The genetic rules for the Y haplotype are different in severalways from the autosomal rules and these differences havepopulation genetic consequences, which in turn affect matchingprobabilities. The most important genetic difference is of course
the lack of recombination. One obvious consequence everyoneknows: a matching probability can’t be obtained by multiplicationacross loci. The haplotype must be treated as a unit. A Y-haplotypethus seems analogous to an autosomal allele, but we will see thatcopying the treatment of autosomal loci would be to fall into a trap,to err in several respects. For a start, the idea that sample frequencyapproximates population frequency approximates matchingprobability is too careless when most haplotypes in the populationare completely unrepresented in the sample, as is the usual casewith Y-haplotypes composed of multiple STR loci. Another habitfrom autosomal practice that doesn’t apply sensibly to Y-haplotypes is the treatment of u – the Cockerham/NRC II [4,5]notation for allele sharing by common descent. The unexpectedreason why it is not is explained in Section 2.4.
In the autosomal case, a simple model considers only alleleprobabilities (sometimes carelessly called ‘‘frequencies’’ – seeSection 4) and taking u into account is a refinement whose
Forensic Science International: Genetics 8 (2014) 233–243
A R T I C L E I N F O
Article history:Received 21 March 2013Received in revised form 6 October 2013Accepted 19 October 2013
Keywords:HaplotypeY-haplotypeLikelihood ratioWeight of evidence calculationProbabilityModel
A B S T R A C T
The Y haplotype population-genetic terrain is better explored from a fresh perspective rather than byanalogy with the more familiar autosomal ideas. For haplotype matching probabilities, versus forautosomal matching probabilities, explicit attention to modelling – such as how evolution got us wherewe are – is much more important while consideration of population frequency is much less so. This paperexplores, extends, and explains some of the concepts of ‘‘Fundamental problem of forensic mathematics– the evidential strength of a rare haplotype match’’ [1]. That earlier paper presented and validated a‘‘kappa method’’ formula for the evidential strength when a suspect matches a previously unseenhaplotype (such as a Y-haplotype) at the crime scene. Mathematical implications of the kappa methodare intuitive and reasonable. Suspicions to the contrary raised in [2] rest on elementary errors.
Critical to deriving the kappa method or any sensible evidential calculation is understanding thatthinking about haplotype population frequency is a red herring; the pivotal question is one of matchingprobability. But confusion between the two is unfortunately institutionalized in much of the forensicworld. Examples make clear why (matching) probability is not (population) frequency and whyuncertainty intervals on matching probabilities are merely confused thinking. Forensic matchingcalculations should be based on a model, on stipulated premises. The model inevitably onlyapproximates reality, and any error in the results comes only from error in the model, the inexactnessof the approximation. Sampling variation does not measure that inexactness and hence is not helpful inexplaining evidence and is in fact an impediment.
Alternative haplotype matching probability approaches that various authors have considered arereviewed. Some are based on no model and cannot be taken seriously. For the others, some evaluation ofthe models is discussed. Recent evidence supports the adequacy of the simple exchangability model onwhich the kappa method rests. However, to make progress toward forensic calculation of Y haplotypemixture evidence a different tack is needed. The ‘‘Laplace distribution’’ model of Andersen et al. [3] whichestimates haplotype frequencies by identifying haplotype clusters in population data looks useful.
! 2013 Published by Elsevier Ireland Ltd.
* Correspondence to: 6801 Thornhill Drive, Oakland, CA 94611, United States.E-mail address: [email protected]
Contents lists available at ScienceDirect
Forensic Science International: Genetics
jou r nal h o mep ag e: w ww .e lsev ier . co m / loc ate / fs ig
1872-4973/$ – see front matter ! 2013 Published by Elsevier Ireland Ltd.http://dx.doi.org/10.1016/j.fsigen.2013.10.007
a distribution very strongly favoring very small values of f(Fig. 4(b)). Next, in the real world of forensic genetics there isusually a population sample – e.g. D as in Section 3.1.1 – that can beincorporated via Bayes’ theorem. For example, let p, p ! 1, denotethe number of observations of the haplotype of interest in D. That’san event whose probability of occurrence, c(p, f), is easily writtendown:
cð p; f Þ ¼ Prð p observationsjpopulation frequency ¼ f Þ
¼ f pð1 % f ÞjDj% p jDjp
! "
and by Bayes’ theorem the matching probability m(p) is
mð pÞ ¼Z 1
f ¼0wð f Þcð p; f Þdf : (8)
Note that this formula works even with no reference data. In thatcase p = 1 and D= {T} (because of ‘‘extending’’ the database with thecrime scene observation per Section 3.1.1), c(1, f) = f and formula(8) looks very like formula (6).
For the case of wð f Þ as in formula (7), via standard propertiesof the beta distribution m(p) has the simple form given inSection A.2.4 of [1]:
mð pÞ ¼ pjDj þ 8799
:
And as above that’s the matching probability, end. Any impulseto decorate a probability by credible or confidence intervals isonly from momentum and confusion, not from mathematicalreasoning.
Which is not to say the number is scientifically exact. It’s not; itwill be to some extent wrong. But the source of error is uncertaintlyabout the model and the premises, not sampling variation. If Ddoesn’t represent the relevant population, that can be a problem –the magnitude of which has nothing to do with sampling variation.Since the infinite alleles model isn’t an accurate description ofevolution, the result cannot be accurate. In this regard having alarger D may paper over deficiencies in the model. But even so theextent of that isn’t measured by credible/confidence intervals orsampling variation.
5. Concluding discussion
This paper has two related aims. First is to clarify thecorrectness of the k method [1] in the face of published criticism[2]. The criticism is unsound, resting on misunderstanding anderrors as Section 3.6.3 shows. The k method (4) holds up well for sosimple a model. Recent work, especially concordant results in [3],confirm that the method is right within its assumptions, andmoreover shows that the model simplification sacrifices someaccuracy but usually not very much. An additional benefit of theLaplace method [3], and to me a more important one, is that it givesplausible probabilities for never-observed haplotypes (which mustbe considered as part of analyzing Y mixtures) and for multiply-observed haplotypes.
The second aim grew out of the first. Understanding the basis ofthe k approach requires understanding the Y haplotype terrain.And that in turn means casting aside preconceptions that we holdfrom autosomal experience and resisting the impulse to assumeobvious seeming parallels between the two. There are a great manyimportant differences.
1. Obviously the product rule is out for Y haplotypes – nothingnew, everyone knew that.
2. Sample frequency as an estimate for matching probability (i.e.‘‘blind counting’’, Section 3.1) is not bad for autosomal work
although the augmented count (Section 3.1.1) is clearlybetter. For Y haplotype work even the latter leaves a lot on thetable; sample frequency severely overestimates matchingprobability.
3. To clear the table better, it helps a lot in the Y case to look at thewhole reference database, not just observances of the targethaplotype. For autosomal work we never thought to do that (forthe good reason that it wouldn’t help much).
4. Confidence or credible intervals represent careless thinking butare mostly harmless in the autosomal arena. For Y, that carelessthinking is an impediment to understanding and to expressingthe actual evidential strength.
5. The ‘‘unrelated person’’ concept is an acceptable approxima-tion for autosomal work; a fatal one for understanding Yhaplotypes.
6. The autosomal idea of a u correction is almost backwards fromthe reality of the Y haplotype matching world where u is thestarting point, and where literal application of ‘‘standard’’autosomal theory gives negative probabilities.
7. Thinking about models is vital for understanding and develop-ing a Y haplotype matching probability approach. Someautosomal practice acknowledges population substructure –good, but a mere academic exercise by comparison with theimportance of theory for Y.
8. Finally there is another distinction outside the ambit of thispaper, the common concern about possible geographicalclustering of Y haplotypes. To what extent clustering compli-cates evidential calculation and what to do about it are topics toexplore.
Acknowledgements
I thank David DeGusta for invaluable advice with themanuscript and the referees for prodding me to greater clarity.
References
[1] C. Brenner, Fundamental problem of forensic mathematics – the evidential valueof a rare haplotype, Forensic Sci. Int. Genet. 4 (2010) 281–291.
[2] J. Buckleton, M. Krawczak, B. Weir, The interpretation of lineage markers inforensic DNA testing, Forensic Sci. Int. Genet. 5 (2011) 78–83.
[3] M.M. Andersen, P.S. Eriksen, N. Morling, The discrete Laplace exponentialfamily and estimation of Y-STR haplotype frequencies, J. Theor. Biol. 329 (2013)39–51.
[4] C. Cockerham, Variance of gene frequencies, Evolution 23 (1) (1969) 72–84.[5] J.F. Crow, et al., The Evaluation of Forensic DNA Evidence, National Research
Council, 1996.[6] Applied Biosystems, http://www6.appliedbiosystems.com/yfilerdatabase/, 2009.[7] M. Slatkin, An exact test for neutrality based on the Ewens sampling distribution,
Genet. Res. 64 (1994) 71–74.[8] Wolfram Inc., Polya’s Random Walk Constants, http://mathworld.wolfram.com/
PolyasRandomWalkConstants.html.[9] Long Bing, et al., Population genetics for 17 Y-STR loci (AmpFISTRY-filer
TM) in Luzhou Han ethnic group, Forensic Sci. Int. Genet. 7 (2) (2013)e23–e26.
[10] M. Slatkin, A correction to the exact test based on the Ewens sampling distribu-tion, Genet. Res. 68 (1996) 259–260.
[11] W. Ewens, The sampling theory of selectively neutral alleles, Theor. Popul. Biol. 3(1972) 87–112.
[12] SWGDAM, SWGDAM Haplotype Frequencies, http://www.fbi.gov/about-us/lab/forensic-science-communications/fsc/oct2009/standards/2009_01_stan-dards01.htm/, 2009.
[13] L. Roewer, M. Kayser, P. de Knijff, K. Anslinger, A. Betz, A. Caglia, D. Corach, S. Fredi,L. Henke, M. Hidding, H. Krgel, R. Lessig, M. Nagy, V. Pascali, W. Parson, B. Rolf, C.Schmitt, R. Szibor, J. Teifel-Greding, M. Krawczak, A new method for the evalua-tion of matches in non-recombining genomes: to Y-chromosomal short tandemrepeat (STR) haplotypes in European males, Forensic Sci. Int. 114 (1) (2000)31–43.
[14] M. Krawczak, Forensic evaluation of Y-STR haplotype matches: a comment,Forensic Sci. Int. 118 (2) (2001) 114–115.
[15] S. Willuweit, A. Caliebe, M.M. Andersen, L. Roewer, Y-STR frequency surveyingmethod: a critical reappraisal, Forensic Sci. Int. Genet. 5 (2001) 84–90.
[16] C. Brenner, The ‘‘frequency surveying’’ approach cannot work, http://dna-view.-com/downloads/documents/RareHaplotypes/Critique/%20haplotype/%20A4.pdf.
C.H. Brenner / Forensic Science International: Genetics 8 (2014) 233–243242
![Page 6: The fundamental problem of forensic statistics - the evidential value of a rare Y-haplotype match](https://reader035.vdocuments.us/reader035/viewer/2022070302/547df7ebb479598e508b4a90/html5/thumbnails/6.jpg)
Three levels of uncertainty
• Who did it? (prosecution vs. defence)
• The probability model
• The probabilities in the model
![Page 7: The fundamental problem of forensic statistics - the evidential value of a rare Y-haplotype match](https://reader035.vdocuments.us/reader035/viewer/2022070302/547df7ebb479598e508b4a90/html5/thumbnails/7.jpg)
The dogma• E = Evidence
• B = Background
• H, K = hypotheses of prosecution and defence
• LR = Pr( E | B, H) / Pr( E | B, K)!
• Pr = Probability
![Page 8: The fundamental problem of forensic statistics - the evidential value of a rare Y-haplotype match](https://reader035.vdocuments.us/reader035/viewer/2022070302/547df7ebb479598e508b4a90/html5/thumbnails/8.jpg)
The dogma
• The court informs the expert what is E, what is B, what are H and K
• The expert simply computes LR, ie the expert “knows” Pr
![Page 9: The fundamental problem of forensic statistics - the evidential value of a rare Y-haplotype match](https://reader035.vdocuments.us/reader035/viewer/2022070302/547df7ebb479598e508b4a90/html5/thumbnails/9.jpg)
The practice• The expert selects from all the information given to him,
on the basis of his own expertise (common practice), what he is going to take as E, B etc
• She uses background knowledge (convention, genetics) to specify a family of possible probabilities, in other words a model M = ( Pr( … | 𝜽 ) : 𝜽 in 𝚹 )
• She uses a data-base D to estimate 𝜽, giving estimate 𝜽
• She “computes” LR= Pr( E | B, H, 𝜽) / Pr( E | B, K, 𝜽)ˆ
ˆ
ˆˆApologies to all Bayesians here for my shamelessly frequentist point of view
![Page 10: The fundamental problem of forensic statistics - the evidential value of a rare Y-haplotype match](https://reader035.vdocuments.us/reader035/viewer/2022070302/547df7ebb479598e508b4a90/html5/thumbnails/10.jpg)
Different experts make different choices, get different answers
• In practice, an expert’s choices are all made simultaneously (convention, convenience, tractability, … )
• The fact that M is only a model and 𝜽 only a “best guess” is swept under the carpet
• Sometimes that could be reasonable (i.e., not misleading)
ˆ
![Page 11: The fundamental problem of forensic statistics - the evidential value of a rare Y-haplotype match](https://reader035.vdocuments.us/reader035/viewer/2022070302/547df7ebb479598e508b4a90/html5/thumbnails/11.jpg)
My thesis: let’s make this bug into a feature
• Depending on the size and quality of the data-base, different choices of E and B, different choices of M, and different ways to estimate 𝜽 might give results which are more or less reliable
• “Less is more”: (sometimes) by only taking account of less of the evidence, the final result may be much more reliable
![Page 12: The fundamental problem of forensic statistics - the evidential value of a rare Y-haplotype match](https://reader035.vdocuments.us/reader035/viewer/2022070302/547df7ebb479598e508b4a90/html5/thumbnails/12.jpg)
Extreme example: the rare haplotype problem
• Database - let’s pretend it is a random sample of Y-STR haplotypes from a given population
• Evidence - match of haplotype suspect, perpetrator
• Rare haplotype problem: it’s a new haplotype
• Prosecution - the suspect is the perpetrator
• Defence - the suspect and the perpetrator are different; the match due purely by chance
![Page 13: The fundamental problem of forensic statistics - the evidential value of a rare Y-haplotype match](https://reader035.vdocuments.us/reader035/viewer/2022070302/547df7ebb479598e508b4a90/html5/thumbnails/13.jpg)
# Supplementary data to the following publica7on
# Roewer L, Croucher PJ, Willuweit S, Lu TT, Kayser M, Lessig R,
# de Knijff P, Jobling MA, Tyler-‐Smith C, Krawczak M. 'Signature of
# recent historical events in the European Y-‐chromosomal STR haplotype
# distribu7on.' (2005) Hum Genet. 2005 Jan 20;
# LINK: hYp://dx.doi.org/10.1007/s00439-‐004-‐1201-‐z
# Licensed under the Academic Free License v. 2.1
# LINK: hYp://www.opensource.org/licenses/afl-‐2.1.php
pop dys19 dys389i dys389ii dys390 dys391 dys392 dys3931 Albania 12 13 30 24 10 11 132 Albania 12 13 30 24 10 11 143 Albania 13 12 30 24 10 11 134 Albania 13 13 29 23 10 11 135 Albania 13 13 29 24 10 11 146 Albania 13 13 29 24 11 13 138 Albania 13 13 30 24 10 11 139 Albania 13 13 30 24 10 11 13
10 Albania 13 13 30 24 10 11 13
12734 Anatolia, Turkey 16 14 31 24 10 11 1312735 Anatolia, Turkey 16 14 32 24 9 11 1312736 Anatolia, Turkey 17 12 28 21 10 11 1512737 Anatolia, Turkey 17 13 30 21 9 11 1312738 Anatolia, Turkey 17 13 30 25 10 11 13
Example (Ex. 1)
![Page 14: The fundamental problem of forensic statistics - the evidential value of a rare Y-haplotype match](https://reader035.vdocuments.us/reader035/viewer/2022070302/547df7ebb479598e508b4a90/html5/thumbnails/14.jpg)
●
●●●
●
●
●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0 500 1000 1500 2000 2500
010
020
030
040
050
060
0
Y−chromosome data−base, N=12727
index
orde
red
frequ
encyN=12797 # distinct profiles: 2489 # singletons: 1397 # pairs: 379
![Page 15: The fundamental problem of forensic statistics - the evidential value of a rare Y-haplotype match](https://reader035.vdocuments.us/reader035/viewer/2022070302/547df7ebb479598e508b4a90/html5/thumbnails/15.jpg)
Dogma & current best practice
• LR = 1 / Pr(haplotype)
• Enormous literature with all kinds of ways to estimate Pr(haplotype)
• Current hi-tech solution: Andersen et al. (2013)
• Model: mixture of subpopulations; within each subpopulation, all locus repeat numbers independent discrete Laplace distributed
• Given # subpopulations, estimate model parameters by EM
• Number of subpopulations estimated by BIC
• LR calculated by plug-in
![Page 16: The fundamental problem of forensic statistics - the evidential value of a rare Y-haplotype match](https://reader035.vdocuments.us/reader035/viewer/2022070302/547df7ebb479598e508b4a90/html5/thumbnails/16.jpg)
Less is more• Consider the data-base as part of the evidence
• Forget the haplotypes !!!
• We are throwing away very relevant data but we don’t know how to use it precisely (esp. when data-base is small, case haplotype is new)
• Discrete Laplace model is “just” a model
• By increasing the number of subpopulations we can in the limit approximate any distribution … finding a good fit to the data in the database is not the same as finding a good estimate of the probability of an unusual new datapoint
![Page 17: The fundamental problem of forensic statistics - the evidential value of a rare Y-haplotype match](https://reader035.vdocuments.us/reader035/viewer/2022070302/547df7ebb479598e508b4a90/html5/thumbnails/17.jpg)
Forget the haplotypes
• Database reduces to ordered list (from large to small) of frequencies of different observed haplotypes
• Evidence: database plus fact of a new haplotype observed once (prosecution), twice (defence)
Database D is reduced to “spectrum of frequencies”
Unknown parameter 𝜽 = “spectrum of probabilities”
![Page 18: The fundamental problem of forensic statistics - the evidential value of a rare Y-haplotype match](https://reader035.vdocuments.us/reader035/viewer/2022070302/547df7ebb479598e508b4a90/html5/thumbnails/18.jpg)
Good-Turing type estimators
• LR = ratio of
• Pr( on extending sample from size N to N+1, we see one new species)
• Pr( on extending sample from size N to N+2, we see one new species, twice)
![Page 19: The fundamental problem of forensic statistics - the evidential value of a rare Y-haplotype match](https://reader035.vdocuments.us/reader035/viewer/2022070302/547df7ebb479598e508b4a90/html5/thumbnails/19.jpg)
Good-Turing type estimators• What happens when we increase database from size N
to size N+1 (or to N+2) has almost same probability as when increasing from N–1 (or N – 2) to N
• Ignoring that distinction:
• All permutations of the original data-base were equally likely to have occurred, so # singletons / N is an unbiased estimator of the prosecution’s probability
• (2 # duplets / N) x (1 / (N–1)) is an unbiased estimator of the defence’s probability
![Page 20: The fundamental problem of forensic statistics - the evidential value of a rare Y-haplotype match](https://reader035.vdocuments.us/reader035/viewer/2022070302/547df7ebb479598e508b4a90/html5/thumbnails/20.jpg)
My Brenner-inspired LR
N ·N(1)2 ·N(2)
P1k=1(1� pk)
NpkP1k=1(1� pk)Np
2k
N = size of database, N(1) = # singletons, N(2) = # duplets
ˆ
![Page 21: The fundamental problem of forensic statistics - the evidential value of a rare Y-haplotype match](https://reader035.vdocuments.us/reader035/viewer/2022070302/547df7ebb479598e508b4a90/html5/thumbnails/21.jpg)
Another option
• Estimate spectrum of haplotype probabilities from spectrum of observed haplotype frequencies by MLE – Orlitsky et al. (2004); Anevksi, Gill, Zohren (2013)!
• Then estimate prosecution, defence probabilities by plug-in from MLE
N ·N(1)2 ·N(2)
LR =
P1k=1(1� pk)
NpkP1k=1(1� pk)Np
2k
dLR =P1k=1(1� bpk)
N bpkP1k=1(1� bpk)N bp
2k
N ·N(1)2 ·N(2)
LR =
P1k=1(1� pk)
NpkP1k=1(1� pk)Np
2k
dLR =P1k=1(1� bpk)
N bpkP1k=1(1� bpk)N bp
2k
![Page 22: The fundamental problem of forensic statistics - the evidential value of a rare Y-haplotype match](https://reader035.vdocuments.us/reader035/viewer/2022070302/547df7ebb479598e508b4a90/html5/thumbnails/22.jpg)
Example (Ex. 2)
• We (visitors from Mars) go on safari. We see N = 3 mammals. Two are of one species and one of another (e.g., two lions, one tiger)
• *Naive estimator* of spectrum of probabilities is (2/3, 1/3, 0, 0, …)
• MLE is (1/2, 1/2, 0, 0, …)
N ·N(1)2 ·N(2)
LR =
P1k=1(1� pk)
NpkP1k=1(1� pk)Np
2k
dLR =P1k=1(1� bpk)
N bpkP1k=1(1� bpk)N bp
2k
lik(p1, p2, p3, ...) = p21p2 + p
22p1 + . . . ; p1 � p2 � p3 � . . . ,
X
k
pk = 1
Orlitsky et al. (2004) Anevski et al. (2013)
![Page 23: The fundamental problem of forensic statistics - the evidential value of a rare Y-haplotype match](https://reader035.vdocuments.us/reader035/viewer/2022070302/547df7ebb479598e508b4a90/html5/thumbnails/23.jpg)
frequency 27 23 16 14 13 12 11 10 9 8 7 6 5 4 3 2 1replicates 1 1 2 1 1 1 2 2 1 3 2 6 11 33 71 253 1434
Biometrika (2002), 89, 3, pp. 669-681 ? 2002 Biometrika Trust Printed in Great Britain
A Poisson model for the coverage problem with a genomic application
BY CHANG XUAN MAO Interdepartmental Group in Biostatistics, University of California, Berkeley,
367 Evans Hall, Berkeley, California 94720-3860, U.S.A. [email protected]
AND BRUCE G. LINDSAY Department of Statistics, Pennsylvania State University, University Park,
Pennsylvania 16802-2111, U.S.A. [email protected]
SUMMARY Suppose a population has infinitely many individuals and is partitioned into unknown
N disjoint classes. The sample coverage of a random sample from the population is the total proportion of the classes observed in the sample. This paper uses a nonparametric Poisson mixture model to give new understanding and results for inference on the sample coverage. The Poisson mixture model provides a simplified framework for inferring any general abundance-f coverage, the sum of the proportions of those classes that contribute exactly k individuals in the sample for some k in *, with * being a set of nonnegative integers. A new moment-based derivation of the well-known Turing estimators is pre- sented. As an application, a gene-categorisation problem in genomic research is addressed. Since Turing's approach is a moment-based method, maximum likelihood estimation and minimum distance estimation are indicated as alternatives for the coverage problem. Finally, it will be shown that any Turing estimator is asymptotically fully efficient.
Some key words: Digital gene expression; Poisson mixture; Sample coverage; Species.
1. INTRODUCTION
Consider a population composed of infinitely many individuals, which can be considered as an approximation of real populations with finitely many individuals under specific situ- ations, in particular when the number of individuals in a target population is very large. The population has been partitioned into N disjoint classes indexed by i=1,2,..., N, with ic being the proportion of the ith class. The identity of each class and the parameter N are assumed to be unknown prior to the experiment. The unknown m,'s are subject to the constraint 7 ~•{ 1 = 1. A random sample of individuals is taken from the population. Let X, be the number of individuals from the ith class, called the frequency in the sample. If X, = 0, then the ith class is not observed in the sample. It will be assumed that these zero frequencies are 'missed' random variables so that N, the 'sample size' of all the Xi's, is unknown. Let nk be the number of classes with frequency k and s be the number of
Example (Ex. 3)
Biometrika (2002), 89, 3, pp. 669-681 ? 2002 Biometrika Trust Printed in Great Britain
A Poisson model for the coverage problem with a genomic application
BY CHANG XUAN MAO Interdepartmental Group in Biostatistics, University of California, Berkeley,
367 Evans Hall, Berkeley, California 94720-3860, U.S.A. [email protected]
AND BRUCE G. LINDSAY Department of Statistics, Pennsylvania State University, University Park,
Pennsylvania 16802-2111, U.S.A. [email protected]
SUMMARY Suppose a population has infinitely many individuals and is partitioned into unknown
N disjoint classes. The sample coverage of a random sample from the population is the total proportion of the classes observed in the sample. This paper uses a nonparametric Poisson mixture model to give new understanding and results for inference on the sample coverage. The Poisson mixture model provides a simplified framework for inferring any general abundance-f coverage, the sum of the proportions of those classes that contribute exactly k individuals in the sample for some k in *, with * being a set of nonnegative integers. A new moment-based derivation of the well-known Turing estimators is pre- sented. As an application, a gene-categorisation problem in genomic research is addressed. Since Turing's approach is a moment-based method, maximum likelihood estimation and minimum distance estimation are indicated as alternatives for the coverage problem. Finally, it will be shown that any Turing estimator is asymptotically fully efficient.
Some key words: Digital gene expression; Poisson mixture; Sample coverage; Species.
1. INTRODUCTION
Consider a population composed of infinitely many individuals, which can be considered as an approximation of real populations with finitely many individuals under specific situ- ations, in particular when the number of individuals in a target population is very large. The population has been partitioned into N disjoint classes indexed by i=1,2,..., N, with ic being the proportion of the ith class. The identity of each class and the parameter N are assumed to be unknown prior to the experiment. The unknown m,'s are subject to the constraint 7 ~•{ 1 = 1. A random sample of individuals is taken from the population. Let X, be the number of individuals from the ith class, called the frequency in the sample. If X, = 0, then the ith class is not observed in the sample. It will be assumed that these zero frequencies are 'missed' random variables so that N, the 'sample size' of all the Xi's, is unknown. Let nk be the number of classes with frequency k and s be the number of
– naive – mle
– naive!– mle
LR = 2200!!
LR = 7400!Good-Turing = 7300
N = 2586; 2.1 x 107 iterations of SA-MH-EM (right panel: y-axis logarithmic)
ˆˆ
(mle has mass 0.3 spread thinly beyond 2000 – for optimisation, taken infinitely thin, inf. far)
(Tomato genes ordered by frequency of expression pk = probability gene k is being expressed)
0 500 1000 1500 2000
0.000
0.002
0.004
0.006
0.008
0.010
0 500 1000 1500 2000
1e−04
2e−04
5e−04
1e−03
2e−03
5e−03
1e−02
![Page 24: The fundamental problem of forensic statistics - the evidential value of a rare Y-haplotype match](https://reader035.vdocuments.us/reader035/viewer/2022070302/547df7ebb479598e508b4a90/html5/thumbnails/24.jpg)
An experiment
• Compare DISCLAP and Good-Turing
• *Known* population
• Small database
• A new rare-haplotype match case
![Page 25: The fundamental problem of forensic statistics - the evidential value of a rare Y-haplotype match](https://reader035.vdocuments.us/reader035/viewer/2022070302/547df7ebb479598e508b4a90/html5/thumbnails/25.jpg)
# Supplementary data to the following publica7on
# Roewer L, Croucher PJ, Willuweit S, Lu TT, Kayser M, Lessig R,
# de Knijff P, Jobling MA, Tyler-‐Smith C, Krawczak M. 'Signature of
# recent historical events in the European Y-‐chromosomal STR haplotype
# distribu7on.' (2005) Hum Genet. 2005 Jan 20;
# LINK: hYp://dx.doi.org/10.1007/s00439-‐004-‐1201-‐z
# Licensed under the Academic Free License v. 2.1
# LINK: hYp://www.opensource.org/licenses/afl-‐2.1.php
pop dys19 dys389i dys389ii dys390 dys391 dys392 dys3931 Albania 12 13 30 24 10 11 132 Albania 12 13 30 24 10 11 143 Albania 13 12 30 24 10 11 134 Albania 13 13 29 23 10 11 135 Albania 13 13 29 24 10 11 146 Albania 13 13 29 24 11 13 138 Albania 13 13 30 24 10 11 139 Albania 13 13 30 24 10 11 13
10 Albania 13 13 30 24 10 11 13
12734 Anatolia, Turkey 16 14 31 24 10 11 1312735 Anatolia, Turkey 16 14 32 24 9 11 1312736 Anatolia, Turkey 17 12 28 21 10 11 1512737 Anatolia, Turkey 17 13 30 21 9 11 1312738 Anatolia, Turkey 17 13 30 25 10 11 13
Example (Ex. 1)
![Page 26: The fundamental problem of forensic statistics - the evidential value of a rare Y-haplotype match](https://reader035.vdocuments.us/reader035/viewer/2022070302/547df7ebb479598e508b4a90/html5/thumbnails/26.jpg)
●
●●●
●
●
●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0 500 1000 1500 2000 2500
010
020
030
040
050
060
0
Y−chromosome data−base, N=12727
index
orde
red
frequ
encyN=12797 # distinct profiles: 2489 # singletons: 1397 # pairs: 379
![Page 27: The fundamental problem of forensic statistics - the evidential value of a rare Y-haplotype match](https://reader035.vdocuments.us/reader035/viewer/2022070302/547df7ebb479598e508b4a90/html5/thumbnails/27.jpg)
An experiment• Truth = 2005 Roewer et al. 7 locus database with
each of the 12738 duplicated the same, v. large number of times
• Data-base = random sample of size 100
• Case = a new random haplotype different from all in data-base
• Repeat this, 1000 times
![Page 28: The fundamental problem of forensic statistics - the evidential value of a rare Y-haplotype match](https://reader035.vdocuments.us/reader035/viewer/2022070302/547df7ebb479598e508b4a90/html5/thumbnails/28.jpg)
• The DISCLAP model is not “true”, except as an ever better approximation with larger and larger number of subpopulations
• So it suffers from both further levels of uncertainty: “wrong model”; “unknown model parameters”
• The Good-Turing model is “true”, so only suffers from one level of uncertainty: “unknown model parameters”
• The database is very small (N = 100)
• We were “dumb users” - used DISCLAP out of the box, default parameters
The experiment is biased against DISCLAP
![Page 29: The fundamental problem of forensic statistics - the evidential value of a rare Y-haplotype match](https://reader035.vdocuments.us/reader035/viewer/2022070302/547df7ebb479598e508b4a90/html5/thumbnails/29.jpg)
Performance of Good-Turing
2.6
2.3
3.6
y-axis: log10 LR; x-axis: replicate
11 Simulations N=100 17
Min 1st Qu. Median Mean 3rd Qu. Max sd
log10 ˆLRg 2.261 2.547 2.656 2.678 2.780 3.558 0.175log10 LRg 2.603 2.603 2.603 2.603 2.603 2.603 0
Tab. 1: log10 ˆLRg against the true value log10 LRg
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●●●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
0 200 400 600 800 1000
2.4
2.6
2.8
3.0
3.2
3.4
3.6
Index
log10(LR
g_est)
(a)
●●
●
●
●●
●
●●
●●
●●
●
2.4
2.6
2.8
3.0
3.2
3.4
3.6
log10(LR
g_est)
(b)
Fig. 1: Scatter plot (s) and box plot (b) of the values of log10 ˆLRg. In blue the true value of log10 LRg
Histogram of y
log10(LRg_est)
Frequency
2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6
050
100
150
200
250
Fig. 2: Histogram of log10 ˆLRg. In blue the true values of log10 LRg
![Page 30: The fundamental problem of forensic statistics - the evidential value of a rare Y-haplotype match](https://reader035.vdocuments.us/reader035/viewer/2022070302/547df7ebb479598e508b4a90/html5/thumbnails/30.jpg)
Performance of DISCLAP
estimated log10 LR true log10 LR
<< true log10 LR Good-Turing
Superimposed Histograms
x
Frequency
2 4 6 8 10
050
100
150
200
250
![Page 31: The fundamental problem of forensic statistics - the evidential value of a rare Y-haplotype match](https://reader035.vdocuments.us/reader035/viewer/2022070302/547df7ebb479598e508b4a90/html5/thumbnails/31.jpg)
Comparison DISCLAP : Good-Turing
y-axis: estimated minus true log10 LR x-axis: 1000 replicates
●
●●●●●●●
●●●
●●●
●●
●●
●●
●●
●●●●●
●
●●●●●●
●●●●●●●●●●●●●●
●
●●
●●●●●●●●●●
●●●●
●●●●●●●●●●●
●
●
●
●●●●●●●
●●●●●●●
●●●
●●
●
●●●●●●●●
●●●
●●●
●●●
●
●●●●●●●●●●
●●●●●●●●
●●●●●●●
●●●●
●●
●●●●
●
●
●●●●●●●●
●
●●
●●
●
●●●●●
●
●
●●
●●
●
●
●
●●●●
●●
●●
●●
●●●●●
●
●●
●●●●●
●●●
●
●●●●●●●●●●●●
●
●●●●●●●●●●●●
●
●
●
●●●●●●●●●●
●
●●●●●●●●
●
●●●
●
●
●●●●●●
●●●●
●●
●●
●●
●
●●●●●●●●●●
●
●●●●
●
●
●●
●
●●●
●
●●●●●●●●●
●●
●●●●●●●●
●
●●
●●
●
●●●●
●●
●●●
●
●
●●
●●●
●
●
●
●
●●
●
●
●●
●
●
●●●●●●
●●●
●●●
●●
●●
●
●●●●●
●
●●
●●●●●●●
●●
●
●
●●●●●
●
●●
●●
●
●
●
●
●●●
●●●●●●
●●
●
●
●
●●●
●●●●
●●●●
●
●
●●●●
●●
●
●●●●
●
●●●
●●●
●●
●
●
●●●
●●
●
●●●
●●
●
●●
●●●●
●
●●
●●●●●●●●
●●●●
●●
●
●●●●●●●●●●●
●●●
●●●●
●●●
●●●●●●●●●●
●
●
●●●
●
●
●●●
●
●●●●●●
●
●●●●●●●●
●●●●
●
●●●
●●●
●
●●
●●●
●●
●●●●●●●
●●●●●●●●●●
●
●●
●
●
●
●●●
●●●●●●●
●●
●●●
●●●●
●
●●
●
●
●
●●●●
●
●●●
●
●●
●
●
●●●●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●●
●●
●●●●●●●●●
●
●
●
●●
●●●●
●●●●●●●●●●●●●●●●●●●●
●
●
●●●
●●
●●●●●
●
●
●
●
●●●●
●
●●
●
●●●
●
●●●●●
●●●●
●●
●●●●
●●
●●
●
●●●●●●
●●
●
●●
●●●●●●●●●
●
●●
●●●
●●
●●●●
●
●●
●●●●
●●
●●●●●●
●
●●●●
●●●
●
●●●●
●
●
●
●●●●●●
●●●●●●●●
●●●
●
●●
●
●
●
●●
●
●●●●●●
●●●●
●●●
●
●
●●
●●
●●
●
●
●●●●●●●●●
●
●
●
●●●
●●●●●●
●
●
●
●●
●
●
●●●
●●
●●●●
●●●●●●●●
●
●●●●●●●●●●
●●●●
●
●
●●
●
●●●
●
●●
●●
●
●●●●●
●
●
●
●
●●●●●●●●●●●
●●
●
●●
●
●
●●●●●●
●●
●●●●
●●
●
●
●
●
●
●●●
●
●●●●●●●●●
●●●
●
●
●●●●●
●
●●●●●
●
0 200 400 600 800 1000
02
46
Index
log10(Ratio_G
ill)●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●●
●●
●●
●●
●
●
●
●●
●●
●●●
●
●
●
●
●●●●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●●
●●
●●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●●
●
●
●
●
●●●
●●
●
●●
●●
●
●
●
●
●
●●
●
●●●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●●
●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●●
●●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●●●
●
●
●
●
●●
●
●●●
●
●
●
●
●●
●
●
●
0 200 400 600 800 1000
02
46
Index
log10(Ratio_Andersen)
![Page 32: The fundamental problem of forensic statistics - the evidential value of a rare Y-haplotype match](https://reader035.vdocuments.us/reader035/viewer/2022070302/547df7ebb479598e508b4a90/html5/thumbnails/32.jpg)
Conclusions (1)• Throwing away the haplotype information reduces “true”
weight of evidence (log10 LR) typically from about 3.5 to about 2.5
• Good-Turing is fairly unbiased and has rather small variance (typical error about 0.2, rarely 0.5)
• DISCLAP “out of the box” is biased in favour of the prosecution by an order of magnitude (error in log10 LR of about 1.0) and the errors are heavily skewed to the right
• The defence could therefore object to use of DISCLAP (in this way…) and their objection ought to be accepted
![Page 33: The fundamental problem of forensic statistics - the evidential value of a rare Y-haplotype match](https://reader035.vdocuments.us/reader035/viewer/2022070302/547df7ebb479598e508b4a90/html5/thumbnails/33.jpg)
Conclusions (2)• Much more work to do …
• Publications in preparation (Giulia Cereda and me)
• For MLE work on spectrum (cf. Mao example…)
!
!
• Something completely different, for the cats/dogs here: Google “Ben Geen”
http://arxiv.org/abs/1312.1200 Estimating a probability mass function with unknown labels Dragi Anevski, Richard D. Gill, Stefan Zohren !(Building on Orlitsky et al., 2004)