the fundamental problem of forensic statistics - the evidential value of a rare y-haplotype match

The fundamental problem of forensic statistics

Richard Gill, Giulia Cereda

ICFIS 2014

Taking account of three levels of uncertainty !

When less can be more

If you are a cat / dog …

(thanks to Paul Roberts)

statisticians/lawyers = cats/dogs/people?

If you are a cat / dog …• Google “Ben Geen”

• http://thejusticegap.com/2014/08/become-convicted-serial-killer-without-killing-anyone/

• How to become a convicted serial killer (without killing anyone), by yours truly

• http://bengeen.wordpress.com/

http://thejusticegap.com/2014/08/become-convicted-serial-killer-without-killing-anyone/

http://bengeen.wordpress.com/

The fundamental problem of forensic statistics

Richard Gill, Giulia Cereda

ICFIS 2014

Taking account of three levels of uncertainty

Fundamental problem of forensic mathematics—The evidential value of a rarehaplotype

Charles H. Brenner a,b,*a School of Public Health, Forensic Science Group, U.C. Berkeley, Berkeley, CA United Statesb DNA!VIEW, 6801 Thornhill Drive, Oakland, CA 94611-1336, United States

1. Introduction

1.1. Mitochondrial DNA and Y-chromosomes

In recent years there has been increasing interest in using Y-chromosomal haplotypes [1–5] or mtDNA for forensic identifica-tion. These haplotype systems are also much used for bodyidentification, especially for old graves [6,7]. The advantages forsome kinds of problems are considerable. Both methods aredesirable for attacking kinship problems involving remoterelatives because sex-linked traits are not diluted 50% eachgeneration. Both methods are useful for some scant crime stains –mtDNA because of the high copy number, the Y-chromosomebecause it can be detected and amplified unambiguously evenwhen it is a minor component compared to the female victim in a

rape sample. However, an mtDNA or a Y-chromosomal haplotypemust be treated mathematically as a single indivisible (‘‘atomic’’)trait; so unlike those traditional DNA methods which examineseveral traits that are approximately independent of one another,no multiplication of probabilities is possible. Therefore it is vital tohave a sound fundamental understanding of atomic trait matchingprobabilities in order to make a reasonable assessment of thestrength of identification evidence.

1.2. The problem

Evidence – such as a Y-haplotype – is identical between a crimescene and a suspected donor. How strong is the evidence that thesuspect is the donor? In particular, this paper discusses the criticalmatching probability question, ‘‘What is the probability that arandom non-donor would by chance match a newly observedtype?’’

Such interesting questions and complications as possibledependence among traits and suitability of the reference popula-tion sample are outside the domain of this paper, essential though

Forensic Science International: Genetics 4 (2010) 281–291

A R T I C L E I N F O

Article history:Received 16 June 2009Received in revised form 20 October 2009Accepted 21 October 2009

Keywords:HaplotypeStain matchingmtDNAY-haplotypeForensic mathematicsLikelihood ratioMatching probability

A B S T R A C T

Y-chromosomal and mitochondrial haplotyping offer special advantages for criminal (and other)identification. For different reasons, each of them is sometimes detectable in a crime stain for whichautosomal typing fails. But they also present special problems, including a fundamental mathematicalone: When a rare haplotype is shared between suspect and crime scene, how strong is the evidencelinking the two? Assume a reference population sample is available which contains n " 1 haplotypes.The most interesting situation as well as the most common one is that the crime scene haplotype wasnever observed in the population sample. The traditional tools of product rule and sample frequency arenot useful when there are no components to multiply and the sample frequency is zero. A useful statisticis the fraction k of the population sample that consists of ‘‘singletons’’ – of once-observed types. A simpleargument shows that the probability for a random innocent suspect to match a previously unobservedcrime scene type is (1 " k)/n – distinctly less than 1/n, likely ten times less. The robust validity of thismodel is confirmed by testing it against a range of population models.

This paper hinges above all on one key insight: probability is not frequency. The common buterroneous ‘‘frequency’’ approach adopts population frequency as a surrogate for matching probabilityand attempts the intractable problem of guessing how many instances exist of the specific haplotype at acertain crime. Probability, by contrast, depends by definition only on the available data. Hence if differenthaplotypes but with the same data occur in two different crimes, although the frequencies are different(and are hopelessly elusive), the matching probabilities are the same, and are not hard to find.

! 2009 Elsevier Ireland Ltd. All rights reserved.

* Correspondence address: 6801 Thornhill Drive, Oakland CA 94611-1336. USA.Tel.: +l 510 339 1911; fax: +l 510 339 1181.

E-mail address: [email protected].

Contents lists available at ScienceDirect

Forensic Science International: Genetics

journa l homepage: www.e lsev ier .com/ locate / fs ig

1872-4973/$ – see front matter ! 2009 Elsevier Ireland Ltd. All rights reserved.doi:10.1016/j.fsigen.2009.10.013

(2014)

(2010)

Forensic Population Genetics - Original Research

Understanding Y haplotype matching probability

Charles H. Brenner a,b,*a Human Rights Center, U.C. Berkeley, Berkeley, CA, United Statesb DNA!VIEW, 6801 Thornhill Drive, Oakland, CA 94611-1336, United States

1. Introduction – understanding Y haplotypes

For understanding the forensic use of Y-haplotype evidence,rather than adapt the methods and habits that have evolved for theanalysis of autosomal DNA evidence it is more appropriate andproductive to start over from the beginning. Evidence is quantifiedby a likelihood ratio built from the probability for a coincidentalmatch by an innocent suspect; that fact remains. All else is up forgrabs.

The genetic rules for the Y haplotype are different in severalways from the autosomal rules and these differences havepopulation genetic consequences, which in turn affect matchingprobabilities. The most important genetic difference is of course

the lack of recombination. One obvious consequence everyoneknows: a matching probability can’t be obtained by multiplicationacross loci. The haplotype must be treated as a unit. A Y-haplotypethus seems analogous to an autosomal allele, but we will see thatcopying the treatment of autosomal loci would be to fall into a trap,to err in several respects. For a start, the idea that sample frequencyapproximates population frequency approximates matchingprobability is too careless when most haplotypes in the populationare completely unrepresented in the sample, as is the usual casewith Y-haplotypes composed of multiple STR loci. Another habitfrom autosomal practice that doesn’t apply sensibly to Y-haplotypes is the treatment of u – the Cockerham/NRC II [4,5]notation for allele sharing by common descent. The unexpectedreason why it is not is explained in Section 2.4.

In the autosomal case, a simple model considers only alleleprobabilities (sometimes carelessly called ‘‘frequencies’’ – seeSection 4) and taking u into account is a refinement whose

Forensic Science International: Genetics 8 (2014) 233–243

A R T I C L E I N F O

Article history:Received 21 March 2013Received in revised form 6 October 2013Accepted 19 October 2013

Keywords:HaplotypeY-haplotypeLikelihood ratioWeight of evidence calculationProbabilityModel

A B S T R A C T

The Y haplotype population-genetic terrain is better explored from a fresh perspective rather than byanalogy with the more familiar autosomal ideas. For haplotype matching probabilities, versus forautosomal matching probabilities, explicit attention to modelling – such as how evolution got us wherewe are – is much more important while consideration of population frequency is much less so. This paperexplores, extends, and explains some of the concepts of ‘‘Fundamental problem of forensic mathematics– the evidential strength of a rare haplotype match’’ [1]. That earlier paper presented and validated a‘‘kappa method’’ formula for the evidential strength when a suspect matches a previously unseenhaplotype (such as a Y-haplotype) at the crime scene. Mathematical implications of the kappa methodare intuitive and reasonable. Suspicions to the contrary raised in [2] rest on elementary errors.

Critical to deriving the kappa method or any sensible evidential calculation is understanding thatthinking about haplotype population frequency is a red herring; the pivotal question is one of matchingprobability. But confusion between the two is unfortunately institutionalized in much of the forensicworld. Examples make clear why (matching) probability is not (population) frequency and whyuncertainty intervals on matching probabilities are merely confused thinking. Forensic matchingcalculations should be based on a model, on stipulated premises. The model inevitably onlyapproximates reality, and any error in the results comes only from error in the model, the inexactnessof the approximation. Sampling variation does not measure that inexactness and hence is not helpful inexplaining evidence and is in fact an impediment.

Alternative haplotype matching probability approaches that various authors have considered arereviewed. Some are based on no model and cannot be taken seriously. For the others, some evaluation ofthe models is discussed. Recent evidence supports the adequacy of the simple exchangability model onwhich the kappa method rests. However, to make progress toward forensic calculation of Y haplotypemixture evidence a different tack is needed. The ‘‘Laplace distribution’’ model of Andersen et al. [3] whichestimates haplotype frequencies by identifying haplotype clusters in population data looks useful.

! 2013 Published by Elsevier Ireland Ltd.

* Correspondence to: 6801 Thornhill Drive, Oakland, CA 94611, United States.E-mail address: [email protected]

Contents lists available at ScienceDirect

Forensic Science International: Genetics

jou r nal h o mep ag e: w ww .e lsev ier . co m / loc ate / fs ig

1872-4973/$ – see front matter ! 2013 Published by Elsevier Ireland Ltd.http://dx.doi.org/10.1016/j.fsigen.2013.10.007

a distribution very strongly favoring very small values of f(Fig. 4(b)). Next, in the real world of forensic genetics there isusually a population sample – e.g. D as in Section 3.1.1 – that can beincorporated via Bayes’ theorem. For example, let p, p ! 1, denotethe number of observations of the haplotype of interest in D. That’san event whose probability of occurrence, c(p, f), is easily writtendown:

cð p; f Þ ¼ Prð p observationsjpopulation frequency ¼ f Þ

¼ f pð1 % f ÞjDj% p jDjp

! "

and by Bayes’ theorem the matching probability m(p) is

mð pÞ ¼Z 1

f ¼0wð f Þcð p; f Þdf : (8)

Note that this formula works even with no reference data. In thatcase p = 1 and D= {T} (because of ‘‘extending’’ the database with thecrime scene observation per Section 3.1.1), c(1, f) = f and formula(8) looks very like formula (6).

For the case of wð f Þ as in formula (7), via standard propertiesof the beta distribution m(p) has the simple form given inSection A.2.4 of [1]:

mð pÞ ¼ pjDj þ 8799

:

And as above that’s the matching probability, end. Any impulseto decorate a probability by credible or confidence intervals isonly from momentum and confusion, not from mathematicalreasoning.

Which is not to say the number is scientifically exact. It’s not; itwill be to some extent wrong. But the source of error is uncertaintlyabout the model and the premises, not sampling variation. If Ddoesn’t represent the relevant population, that can be a problem –the magnitude of which has nothing to do with sampling variation.Since the infinite alleles model isn’t an accurate description ofevolution, the result cannot be accurate. In this regard having alarger D may paper over deficiencies in the model. But even so theextent of that isn’t measured by credible/confidence intervals orsampling variation.

5. Concluding discussion

This paper has two related aims. First is to clarify thecorrectness of the k method [1] in the face of published criticism[2]. The criticism is unsound, resting on misunderstanding anderrors as Section 3.6.3 shows. The k method (4) holds up well for sosimple a model. Recent work, especially concordant results in [3],confirm that the method is right within its assumptions, andmoreover shows that the model simplification sacrifices someaccuracy but usually not very much. An additional benefit of theLaplace method [3], and to me a more important one, is that it givesplausible probabilities for never-observed haplotypes (which mustbe considered as part of analyzing Y mixtures) and for multiply-observed haplotypes.

The second aim grew out of the first. Understanding the basis ofthe k approach requires understanding the Y haplotype terrain.And that in turn means casting aside preconceptions that we holdfrom autosomal experience and resisting the impulse to assumeobvious seeming parallels between the two. There are a great manyimportant differences.

1. Obviously the product rule is out for Y haplotypes – nothingnew, everyone knew that.

2. Sample frequency as an estimate for matching probability (i.e.‘‘blind counting’’, Section 3.1) is not bad for autosomal work

although the augmented count (Section 3.1.1) is clearlybetter. For Y haplotype work even the latter leaves a lot on thetable; sample frequency severely overestimates matchingprobability.

3. To clear the table better, it helps a lot in the Y case to look at thewhole reference database, not just observances of the targethaplotype. For autosomal work we never thought to do that (forthe good reason that it wouldn’t help much).

4. Confidence or credible intervals represent careless thinking butare mostly harmless in the autosomal arena. For Y, that carelessthinking is an impediment to understanding and to expressingthe actual evidential strength.

5. The ‘‘unrelated person’’ concept is an acceptable approxima-tion for autosomal work; a fatal one for understanding Yhaplotypes.

6. The autosomal idea of a u correction is almost backwards fromthe reality of the Y haplotype matching world where u is thestarting point, and where literal application of ‘‘standard’’autosomal theory gives negative probabilities.

7. Thinking about models is vital for understanding and develop-ing a Y haplotype matching probability approach. Someautosomal practice acknowledges population substructure –good, but a mere academic exercise by comparison with theimportance of theory for Y.

8. Finally there is another distinction outside the ambit of thispaper, the common concern about possible geographicalclustering of Y haplotypes. To what extent clustering compli-cates evidential calculation and what to do about it are topics toexplore.

Acknowledgements

I thank David DeGusta for invaluable advice with themanuscript and the referees for prodding me to greater clarity.

References

[1] C. Brenner, Fundamental problem of forensic mathematics – the evidential valueof a rare haplotype, Forensic Sci. Int. Genet. 4 (2010) 281–291.

[2] J. Buckleton, M. Krawczak, B. Weir, The interpretation of lineage markers inforensic DNA testing, Forensic Sci. Int. Genet. 5 (2011) 78–83.

[3] M.M. Andersen, P.S. Eriksen, N. Morling, The discrete Laplace exponentialfamily and estimation of Y-STR haplotype frequencies, J. Theor. Biol. 329 (2013)39–51.

[4] C. Cockerham, Variance of gene frequencies, Evolution 23 (1) (1969) 72–84.[5] J.F. Crow, et al., The Evaluation of Forensic DNA Evidence, National Research

Council, 1996.[6] Applied Biosystems, http://www6.appliedbiosystems.com/yfilerdatabase/, 2009.[7] M. Slatkin, An exact test for neutrality based on the Ewens sampling distribution,

Genet. Res. 64 (1994) 71–74.[8] Wolfram Inc., Polya’s Random Walk Constants, http://mathworld.wolfram.com/

PolyasRandomWalkConstants.html.[9] Long Bing, et al., Population genetics for 17 Y-STR loci (AmpFISTRY-filer

TM) in Luzhou Han ethnic group, Forensic Sci. Int. Genet. 7 (2) (2013)e23–e26.

[10] M. Slatkin, A correction to the exact test based on the Ewens sampling distribu-tion, Genet. Res. 68 (1996) 259–260.

[11] W. Ewens, The sampling theory of selectively neutral alleles, Theor. Popul. Biol. 3(1972) 87–112.

[12] SWGDAM, SWGDAM Haplotype Frequencies, http://www.fbi.gov/about-us/lab/forensic-science-communications/fsc/oct2009/standards/2009_01_stan-dards01.htm/, 2009.

[13] L. Roewer, M. Kayser, P. de Knijff, K. Anslinger, A. Betz, A. Caglia, D. Corach, S. Fredi,L. Henke, M. Hidding, H. Krgel, R. Lessig, M. Nagy, V. Pascali, W. Parson, B. Rolf, C.Schmitt, R. Szibor, J. Teifel-Greding, M. Krawczak, A new method for the evalua-tion of matches in non-recombining genomes: to Y-chromosomal short tandemrepeat (STR) haplotypes in European males, Forensic Sci. Int. 114 (1) (2000)31–43.

[14] M. Krawczak, Forensic evaluation of Y-STR haplotype matches: a comment,Forensic Sci. Int. 118 (2) (2001) 114–115.

[15] S. Willuweit, A. Caliebe, M.M. Andersen, L. Roewer, Y-STR frequency surveyingmethod: a critical reappraisal, Forensic Sci. Int. Genet. 5 (2001) 84–90.

[16] C. Brenner, The ‘‘frequency surveying’’ approach cannot work, http://dna-view.-com/downloads/documents/RareHaplotypes/Critique/%20haplotype/%20A4.pdf.

C.H. Brenner / Forensic Science International: Genetics 8 (2014) 233–243242

Three levels of uncertainty

• Who did it? (prosecution vs. defence)

• The probability model

• The probabilities in the model

The dogma• E = Evidence

• B = Background

• H, K = hypotheses of prosecution and defence

• LR = Pr( E | B, H) / Pr( E | B, K)!

• Pr = Probability

The dogma

• The court informs the expert what is E, what is B, what are H and K

• The expert simply computes LR, ie the expert “knows” Pr

The practice• The expert selects from all the information given to him,

on the basis of his own expertise (common practice), what he is going to take as E, B etc

• She uses background knowledge (convention, genetics) to specify a family of possible probabilities, in other words a model M = ( Pr( … | 𝜽 ) : 𝜽 in 𝚹 )

• She uses a data-base D to estimate 𝜽, giving estimate 𝜽

• She “computes” LR= Pr( E | B, H, 𝜽) / Pr( E | B, K, 𝜽)ˆ

ˆ

ˆˆApologies to all Bayesians here for my shamelessly frequentist point of view

Different experts make different choices, get different answers

• In practice, an expert’s choices are all made simultaneously (convention, convenience, tractability, … )

• The fact that M is only a model and 𝜽 only a “best guess” is swept under the carpet

• Sometimes that could be reasonable (i.e., not misleading)

ˆ

My thesis: let’s make this bug into a feature

• Depending on the size and quality of the data-base, different choices of E and B, different choices of M, and different ways to estimate 𝜽 might give results which are more or less reliable

• “Less is more”: (sometimes) by only taking account of less of the evidence, the final result may be much more reliable

Extreme example: the rare haplotype problem

• Database - let’s pretend it is a random sample of Y-STR haplotypes from a given population

• Evidence - match of haplotype suspect, perpetrator

• Rare haplotype problem: it’s a new haplotype

• Prosecution - the suspect is the perpetrator

• Defence - the suspect and the perpetrator are different; the match due purely by chance

# Supplementary data to the following publica7on

# Roewer L, Croucher PJ, Willuweit S, Lu TT, Kayser M, Lessig R,

# de Knijff P, Jobling MA, Tyler-‐Smith C, Krawczak M. 'Signature of

# recent historical events in the European Y-‐chromosomal STR haplotype

# distribu7on.' (2005) Hum Genet. 2005 Jan 20;

# LINK: hYp://dx.doi.org/10.1007/s00439-‐004-‐1201-‐z

# Licensed under the Academic Free License v. 2.1

# LINK: hYp://www.opensource.org/licenses/afl-‐2.1.php

pop dys19 dys389i dys389ii dys390 dys391 dys392 dys3931 Albania 12 13 30 24 10 11 132 Albania 12 13 30 24 10 11 143 Albania 13 12 30 24 10 11 134 Albania 13 13 29 23 10 11 135 Albania 13 13 29 24 10 11 146 Albania 13 13 29 24 11 13 138 Albania 13 13 30 24 10 11 139 Albania 13 13 30 24 10 11 13

10 Albania 13 13 30 24 10 11 13

12734 Anatolia, Turkey 16 14 31 24 10 11 1312735 Anatolia, Turkey 16 14 32 24 9 11 1312736 Anatolia, Turkey 17 12 28 21 10 11 1512737 Anatolia, Turkey 17 13 30 21 9 11 1312738 Anatolia, Turkey 17 13 30 25 10 11 13

Example (Ex. 1)

●

●●●

●

●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0 500 1000 1500 2000 2500

010

020

030

040

050

060

0

Y−chromosome data−base, N=12727

index

orde

red

frequ

encyN=12797 # distinct profiles: 2489 # singletons: 1397 # pairs: 379

Dogma & current best practice

• LR = 1 / Pr(haplotype)

• Enormous literature with all kinds of ways to estimate Pr(haplotype)

• Current hi-tech solution: Andersen et al. (2013)

• Model: mixture of subpopulations; within each subpopulation, all locus repeat numbers independent discrete Laplace distributed

• Given # subpopulations, estimate model parameters by EM

• Number of subpopulations estimated by BIC

• LR calculated by plug-in

Less is more• Consider the data-base as part of the evidence

• Forget the haplotypes !!!

• We are throwing away very relevant data but we don’t know how to use it precisely (esp. when data-base is small, case haplotype is new)

• Discrete Laplace model is “just” a model

• By increasing the number of subpopulations we can in the limit approximate any distribution … finding a good fit to the data in the database is not the same as finding a good estimate of the probability of an unusual new datapoint

Forget the haplotypes

• Database reduces to ordered list (from large to small) of frequencies of different observed haplotypes

• Evidence: database plus fact of a new haplotype observed once (prosecution), twice (defence)

Database D is reduced to “spectrum of frequencies”

Unknown parameter 𝜽 = “spectrum of probabilities”

Good-Turing type estimators

• LR = ratio of

• Pr( on extending sample from size N to N+1, we see one new species)

• Pr( on extending sample from size N to N+2, we see one new species, twice)

Good-Turing type estimators• What happens when we increase database from size N

to size N+1 (or to N+2) has almost same probability as when increasing from N–1 (or N – 2) to N

• Ignoring that distinction:

• All permutations of the original data-base were equally likely to have occurred, so # singletons / N is an unbiased estimator of the prosecution’s probability

• (2 # duplets / N) x (1 / (N–1)) is an unbiased estimator of the defence’s probability

My Brenner-inspired LR

N ·N(1)2 ·N(2)

P1k=1(1� pk)

NpkP1k=1(1� pk)Np

2k

N = size of database, N(1) = # singletons, N(2) = # duplets

ˆ

Another option

• Estimate spectrum of haplotype probabilities from spectrum of observed haplotype frequencies by MLE – Orlitsky et al. (2004); Anevksi, Gill, Zohren (2013)!

• Then estimate prosecution, defence probabilities by plug-in from MLE

N ·N(1)2 ·N(2)

LR =

P1k=1(1� pk)

NpkP1k=1(1� pk)Np

2k

dLR =P1k=1(1� bpk)

N bpkP1k=1(1� bpk)N bp

2k

N ·N(1)2 ·N(2)

LR =

P1k=1(1� pk)

NpkP1k=1(1� pk)Np

2k



2k

Example (Ex. 2)

• We (visitors from Mars) go on safari. We see N = 3 mammals. Two are of one species and one of another (e.g., two lions, one tiger)

• *Naive estimator* of spectrum of probabilities is (2/3, 1/3, 0, 0, …)

• MLE is (1/2, 1/2, 0, 0, …)

N ·N(1)2 ·N(2)

LR =

P1k=1(1� pk)

NpkP1k=1(1� pk)Np

2k



2k

lik(p1, p2, p3, ...) = p21p2 + p

22p1 + . . . ; p1 � p2 � p3 � . . . ,

X

k

pk = 1

Orlitsky et al. (2004) Anevski et al. (2013)

frequency 27 23 16 14 13 12 11 10 9 8 7 6 5 4 3 2 1replicates 1 1 2 1 1 1 2 2 1 3 2 6 11 33 71 253 1434

Biometrika (2002), 89, 3, pp. 669-681 ? 2002 Biometrika Trust Printed in Great Britain

A Poisson model for the coverage problem with a genomic application

BY CHANG XUAN MAO Interdepartmental Group in Biostatistics, University of California, Berkeley,

367 Evans Hall, Berkeley, California 94720-3860, U.S.A. [email protected]

AND BRUCE G. LINDSAY Department of Statistics, Pennsylvania State University, University Park,

Pennsylvania 16802-2111, U.S.A. [email protected]

SUMMARY Suppose a population has infinitely many individuals and is partitioned into unknown

N disjoint classes. The sample coverage of a random sample from the population is the total proportion of the classes observed in the sample. This paper uses a nonparametric Poisson mixture model to give new understanding and results for inference on the sample coverage. The Poisson mixture model provides a simplified framework for inferring any general abundance-f coverage, the sum of the proportions of those classes that contribute exactly k individuals in the sample for some k in *, with * being a set of nonnegative integers. A new moment-based derivation of the well-known Turing estimators is presented. As an application, a gene-categorisation problem in genomic research is addressed. Since Turing's approach is a moment-based method, maximum likelihood estimation and minimum distance estimation are indicated as alternatives for the coverage problem. Finally, it will be shown that any Turing estimator is asymptotically fully efficient.

Some key words: Digital gene expression; Poisson mixture; Sample coverage; Species.

1. INTRODUCTION

Consider a population composed of infinitely many individuals, which can be considered as an approximation of real populations with finitely many individuals under specific situ- ations, in particular when the number of individuals in a target population is very large. The population has been partitioned into N disjoint classes indexed by i=1,2,..., N, with ic being the proportion of the ith class. The identity of each class and the parameter N are assumed to be unknown prior to the experiment. The unknown m,'s are subject to the constraint 7 ~•{ 1 = 1. A random sample of individuals is taken from the population. Let X, be the number of individuals from the ith class, called the frequency in the sample. If X, = 0, then the ith class is not observed in the sample. It will be assumed that these zero frequencies are 'missed' random variables so that N, the 'sample size' of all the Xi's, is unknown. Let nk be the number of classes with frequency k and s be the number of

Example (Ex. 3)

Biometrika (2002), 89, 3, pp. 669-681 ? 2002 Biometrika Trust Printed in Great Britain

A Poisson model for the coverage problem with a genomic application

BY CHANG XUAN MAO Interdepartmental Group in Biostatistics, University of California, Berkeley,

367 Evans Hall, Berkeley, California 94720-3860, U.S.A. [email protected]

AND BRUCE G. LINDSAY Department of Statistics, Pennsylvania State University, University Park,

Pennsylvania 16802-2111, U.S.A. [email protected]

SUMMARY Suppose a population has infinitely many individuals and is partitioned into unknown

N disjoint classes. The sample coverage of a random sample from the population is the total proportion of the classes observed in the sample. This paper uses a nonparametric Poisson mixture model to give new understanding and results for inference on the sample coverage. The Poisson mixture model provides a simplified framework for inferring any general abundance-f coverage, the sum of the proportions of those classes that contribute exactly k individuals in the sample for some k in *, with * being a set of nonnegative integers. A new moment-based derivation of the well-known Turing estimators is presented. As an application, a gene-categorisation problem in genomic research is addressed. Since Turing's approach is a moment-based method, maximum likelihood estimation and minimum distance estimation are indicated as alternatives for the coverage problem. Finally, it will be shown that any Turing estimator is asymptotically fully efficient.

Some key words: Digital gene expression; Poisson mixture; Sample coverage; Species.

1. INTRODUCTION

Consider a population composed of infinitely many individuals, which can be considered as an approximation of real populations with finitely many individuals under specific situ- ations, in particular when the number of individuals in a target population is very large. The population has been partitioned into N disjoint classes indexed by i=1,2,..., N, with ic being the proportion of the ith class. The identity of each class and the parameter N are assumed to be unknown prior to the experiment. The unknown m,'s are subject to the constraint 7 ~•{ 1 = 1. A random sample of individuals is taken from the population. Let X, be the number of individuals from the ith class, called the frequency in the sample. If X, = 0, then the ith class is not observed in the sample. It will be assumed that these zero frequencies are 'missed' random variables so that N, the 'sample size' of all the Xi's, is unknown. Let nk be the number of classes with frequency k and s be the number of

– naive – mle

– naive!– mle

LR = 2200!!

LR = 7400!Good-Turing = 7300

N = 2586; 2.1 x 107 iterations of SA-MH-EM (right panel: y-axis logarithmic)

ˆˆ

(mle has mass 0.3 spread thinly beyond 2000 – for optimisation, taken infinitely thin, inf. far)

(Tomato genes ordered by frequency of expression pk = probability gene k is being expressed)

0 500 1000 1500 2000

0.000

0.002

0.004

0.006

0.008

0.010

0 500 1000 1500 2000

1e−04

2e−04

5e−04

1e−03

2e−03

5e−03

1e−02

An experiment

• Compare DISCLAP and Good-Turing

• *Known* population

• Small database

• A new rare-haplotype match case

# Supplementary data to the following publica7on

# Roewer L, Croucher PJ, Willuweit S, Lu TT, Kayser M, Lessig R,

# de Knijff P, Jobling MA, Tyler-‐Smith C, Krawczak M. 'Signature of

# recent historical events in the European Y-‐chromosomal STR haplotype

# distribu7on.' (2005) Hum Genet. 2005 Jan 20;

# LINK: hYp://dx.doi.org/10.1007/s00439-‐004-‐1201-‐z

# Licensed under the Academic Free License v. 2.1

# LINK: hYp://www.opensource.org/licenses/afl-‐2.1.php

pop dys19 dys389i dys389ii dys390 dys391 dys392 dys3931 Albania 12 13 30 24 10 11 132 Albania 12 13 30 24 10 11 143 Albania 13 12 30 24 10 11 134 Albania 13 13 29 23 10 11 135 Albania 13 13 29 24 10 11 146 Albania 13 13 29 24 11 13 138 Albania 13 13 30 24 10 11 139 Albania 13 13 30 24 10 11 13

10 Albania 13 13 30 24 10 11 13

12734 Anatolia, Turkey 16 14 31 24 10 11 1312735 Anatolia, Turkey 16 14 32 24 9 11 1312736 Anatolia, Turkey 17 12 28 21 10 11 1512737 Anatolia, Turkey 17 13 30 21 9 11 1312738 Anatolia, Turkey 17 13 30 25 10 11 13

Example (Ex. 1)

●

●●●

●

●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0 500 1000 1500 2000 2500

010

020

030

040

050

060

0

Y−chromosome data−base, N=12727

index

orde

red

frequ

encyN=12797 # distinct profiles: 2489 # singletons: 1397 # pairs: 379

An experiment• Truth = 2005 Roewer et al. 7 locus database with

each of the 12738 duplicated the same, v. large number of times

• Data-base = random sample of size 100

• Case = a new random haplotype different from all in data-base

• Repeat this, 1000 times

• The DISCLAP model is not “true”, except as an ever better approximation with larger and larger number of subpopulations

• So it suffers from both further levels of uncertainty: “wrong model”; “unknown model parameters”

• The Good-Turing model is “true”, so only suffers from one level of uncertainty: “unknown model parameters”

• The database is very small (N = 100)

• We were “dumb users” - used DISCLAP out of the box, default parameters

The experiment is biased against DISCLAP

Performance of Good-Turing

2.6

2.3

3.6

y-axis: log10 LR; x-axis: replicate

11 Simulations N=100 17

Min 1st Qu. Median Mean 3rd Qu. Max sd

log10 ˆLRg 2.261 2.547 2.656 2.678 2.780 3.558 0.175log10 LRg 2.603 2.603 2.603 2.603 2.603 2.603 0

Tab. 1: log10 ˆLRg against the true value log10 LRg

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●●

●

●

●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●●

●

●●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●●

●

●●

●

●●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●●

●

●

●

●

●

●●●

●●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●●

●

●

●

●

●

●

●●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

0 200 400 600 800 1000

2.4

2.6

2.8

3.0

3.2

3.4

3.6

Index

log10(LR

g_est)

(a)

●●

●

●

●●

●

●●

●●

●●

●

2.4

2.6

2.8

3.0

3.2

3.4

3.6

log10(LR

g_est)

(b)

Fig. 1: Scatter plot (s) and box plot (b) of the values of log10 ˆLRg. In blue the true value of log10 LRg

Histogram of y

log10(LRg_est)

Frequency

2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6

050

100

150

200

250

Fig. 2: Histogram of log10 ˆLRg. In blue the true values of log10 LRg

Performance of DISCLAP

estimated log10 LR true log10 LR

<< true log10 LR Good-Turing

Superimposed Histograms

x

Frequency

2 4 6 8 10

050

100

150

200

250

Comparison DISCLAP : Good-Turing

y-axis: estimated minus true log10 LR x-axis: 1000 replicates

●

●●●●●●●

●●●

●●●

●●

●●

●●

●●

●●●●●

●

●●●●●●

●●●●●●●●●●●●●●

●

●●

●●●●●●●●●●

●●●●

●●●●●●●●●●●

●

●

●

●●●●●●●

●●●●●●●

●●●

●●

●

●●●●●●●●

●●●

●●●

●●●

●

●●●●●●●●●●

●●●●●●●●

●●●●●●●

●●●●

●●

●●●●

●

●

●●●●●●●●

●

●●

●●

●

●●●●●

●

●

●●

●●

●

●

●

●●●●

●●

●●

●●

●●●●●

●

●●

●●●●●

●●●

●

●●●●●●●●●●●●

●

●●●●●●●●●●●●

●

●

●

●●●●●●●●●●

●

●●●●●●●●

●

●●●

●

●

●●●●●●

●●●●

●●

●●

●●

●

●●●●●●●●●●

●

●●●●

●

●

●●

●

●●●

●

●●●●●●●●●

●●

●●●●●●●●

●

●●

●●

●

●●●●

●●

●●●

●

●

●●

●●●

●

●

●

●

●●

●

●

●●

●

●

●●●●●●

●●●

●●●

●●

●●

●

●●●●●

●

●●

●●●●●●●

●●

●

●

●●●●●

●

●●

●●

●

●

●

●

●●●

●●●●●●

●●

●

●

●

●●●

●●●●

●●●●

●

●

●●●●

●●

●

●●●●

●

●●●

●●●

●●

●

●

●●●

●●

●

●●●

●●

●

●●

●●●●

●

●●

●●●●●●●●

●●●●

●●

●

●●●●●●●●●●●

●●●

●●●●

●●●

●●●●●●●●●●

●

●

●●●

●

●

●●●

●

●●●●●●

●

●●●●●●●●

●●●●

●

●●●

●●●

●

●●

●●●

●●

●●●●●●●

●●●●●●●●●●

●

●●

●

●

●

●●●

●●●●●●●

●●

●●●

●●●●

●

●●

●

●

●

●●●●

●

●●●

●

●●

●

●

●●●●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●●

●●

●●●●●●●●●

●

●

●

●●

●●●●

●●●●●●●●●●●●●●●●●●●●

●

●

●●●

●●

●●●●●

●

●

●

●

●●●●

●

●●

●

●●●

●

●●●●●

●●●●

●●

●●●●

●●

●●

●

●●●●●●

●●

●

●●

●●●●●●●●●

●

●●

●●●

●●

●●●●

●

●●

●●●●

●●

●●●●●●

●

●●●●

●●●

●

●●●●

●

●

●

●●●●●●

●●●●●●●●

●●●

●

●●

●

●

●

●●

●

●●●●●●

●●●●

●●●

●

●

●●

●●

●●

●

●

●●●●●●●●●

●

●

●

●●●

●●●●●●

●

●

●

●●

●

●

●●●

●●

●●●●

●●●●●●●●

●

●●●●●●●●●●

●●●●

●

●

●●

●

●●●

●

●●

●●

●

●●●●●

●

●

●

●

●●●●●●●●●●●

●●

●

●●

●

●

●●●●●●

●●

●●●●

●●

●

●

●

●

●

●●●

●

●●●●●●●●●

●●●

●

●

●●●●●

●

●●●●●

●

0 200 400 600 800 1000

02

46

Index

log10(Ratio_G

ill)●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●●●

●●

●●

●●

●

●

●

●●

●●

●●●

●

●

●

●

●●●●

●●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●●●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●

●

●●

●●

●●●●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●●

●

●

●

●●

●

●

●

●●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●●

●●●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●●

●

●

●

●

●●●

●●

●

●●

●●

●

●

●

●

●

●●

●

●●●●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●●

●●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●●●

●●

●

●

●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●●

●●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●●●●

●

●

●

●

●●

●

●●●

●

●

●

●

●●

●

●

●

0 200 400 600 800 1000

02

46

Index

log10(Ratio_Andersen)

Conclusions (1)• Throwing away the haplotype information reduces “true”

weight of evidence (log10 LR) typically from about 3.5 to about 2.5

• Good-Turing is fairly unbiased and has rather small variance (typical error about 0.2, rarely 0.5)

• DISCLAP “out of the box” is biased in favour of the prosecution by an order of magnitude (error in log10 LR of about 1.0) and the errors are heavily skewed to the right

• The defence could therefore object to use of DISCLAP (in this way…) and their objection ought to be accepted

Conclusions (2)• Much more work to do …

• Publications in preparation (Giulia Cereda and me)

• For MLE work on spectrum (cf. Mao example…)

!

!

• Something completely different, for the cats/dogs here: Google “Ben Geen”

http://arxiv.org/abs/1312.1200 Estimating a probability mass function with unknown labels Dragi Anevski, Richard D. Gill, Stefan Zohren !(Building on Orlitsky et al., 2004)

http://arxiv.org/abs/1312.1200