mle, confidence intervals, and hypothesis...

33
Ryan Reece [email protected] http://rreece.github.io/ Feb 16, 2018 Insight Data Science - AI Fellows Workshop Primer on statistics: MLE, Confidence Intervals, and Hypothesis Testing

Upload: others

Post on 30-Mar-2021

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MLE, Confidence Intervals, and Hypothesis Testingrreece.github.io/talks/pdf/2018-02-16-RReece-statistics-workshop-insight.pdfFeb 16, 2018  · Primer on statistics: MLE, Confidence

Ryan [email protected]://rreece.github.io/

Feb 16, 2018Insight Data Science - AI Fellows Workshop

Primer on statistics:

MLE, Confidence Intervals, and Hypothesis Testing

Page 2: MLE, Confidence Intervals, and Hypothesis Testingrreece.github.io/talks/pdf/2018-02-16-RReece-statistics-workshop-insight.pdfFeb 16, 2018  · Primer on statistics: MLE, Confidence

Ryan Reece

Outline

1. Maximum likelihood estimators

2. Variance of MLE → Confidence intervals

3.

4. Efficiency example

5. A/B testing

2

) V [x] =1

N2

0

@X

i6=j

µ2 +

X

i

(�2 + µ2)

1

A� µ2

=1

N2

�(N2 �N)µ2 + N(�2 + µ

2)�� µ

2

=�

2

N

Which verifies the result we got from calculating derivatives of thelikelihood function.

V [✓] = �✓

@2 lnL

@✓2

����✓

◆�1

In practice, one usually doesn’t calculate this analytically, but instead:

calculates the derivatives numerically, or

uses the � lnL or ��2 method, described now

Ryan D. Reece (Penn) Likelihood Functions for SUSY [email protected] 13 / 24

Page 3: MLE, Confidence Intervals, and Hypothesis Testingrreece.github.io/talks/pdf/2018-02-16-RReece-statistics-workshop-insight.pdfFeb 16, 2018  · Primer on statistics: MLE, Confidence

Ryan Reece

4

[GeV]γγm100 110 120 130 140 150 160

Even

ts /

GeV

100

200

300

400

500

600

700

800 Data 2011

Total background

=125 GeV, 1 x SMHm

ATLAS

-1 Ldt = 4.9 fb∫ = 7 TeV, s

γγ→H

(a)

[GeV]llllm0 100 200 300 400 500 600

Even

ts /

10 G

eV

0

2

4

6

8

10

12

14Data 2011

Total background

=125 GeV, 1 x SMHm

ATLAS

-1 Ldt = 4.8 fb∫ = 7 TeV, s

4l→(*)ZZ→H

(b)

[GeV]llllm100 120 140 160 180 200 220 240

Even

ts /

5 G

eV

0

2

4

6

8

10

12 Data 2011

Total background

=125 GeV, 1 x SMHm

ATLAS

-1 Ldt = 4.8 fb∫ = 7 TeV, s

4l→(*)ZZ→H

(c)

[GeV]Tm200 300 400 500 600 700 800 900

Even

ts /

50 G

eV

0102030405060708090

Data 2011

Total background

=350 GeV, 1 x SMHm

ATLAS -1 Ldt = 4.7 fb∫ = 7 TeV, sννll→ZZ→H

(d)

[GeV]llbbm200 300 400 500 600 700 800

Even

ts /

18 G

eV

1

2

3

4

5

6

7

8Data 2011

Total background

=350 GeV, 1 x SMHm

ATLAS

-1 Ldt = 4.7 fb∫ = 7 TeV, s

llbb→ZZ→H

(e)

[GeV]lljjm200 300 400 500 600 700 800

Even

ts /

20 G

eV

020406080

100120140160180200

Data 2011

Total background

=350 GeV, 1 x SMHm

ATLAS

-1 Ldt = 4.7 fb∫ = 7 TeV, s

llqq→ZZ→H

(f)

[GeV]Tm60 80 100 120 140 160 180 200 220 240

Even

ts /

10 G

eV

0102030405060708090

100

-1 Ldt = 4.7 fb∫ = 7 TeV, s

ATLASData 2011

=125 GeV, 1 x SMHm

Total background

+0jνlνl→WW→H

(g)

[GeV]Tm60 80 100 120 140 160 180 200 220 240

Even

ts /

10 G

eV

0

5

10

15

20

25

30

-1 Ldt = 4.7 fb∫ = 7 TeV, s

ATLAS Data 2011

=125 GeV, 1xSMHm

Total background

+1jνlνl→WW→H

(h)

[GeV]Tm50 100 150 200 250 300 350 400 450

Even

ts /

10 G

eV

00.5

11.5

22.5

33.5

44.5

5

-1 Ldt = 4.7 fb∫ = 7 TeV, s

ATLAS Data 2011

=125 GeV, 1 x SMHm

Total background

+2jνlνl→WW→HNot final selection

(i)

[GeV]jjνlm300 400 500 600 700 800 900 1000

Even

ts /

10 G

eV

0

20

40

60

80

100

120

Data 2011

Total background

=350 GeV, 1 x SMHm

ATLAS

-1 Ldt = 4.7 fb∫ = 7 TeV, s

qq+0jνl→WW→H

(j)

[GeV]jjνlm300 400 500 600 700 800 900 1000

Even

ts /

10 G

eV

0

20

40

60

80

100

Data 2011

Total background

=350 GeV, 1 x SMHm

ATLAS

-1 Ldt = 4.7 fb∫ = 7 TeV, s

qq+1jνl→WW→H

(k)

[GeV]jjνlm300 400 500 600 700 800 900 1000

Even

ts /

10 G

eV

0

5

10

15

20

25

30

35

Data 2011

Total background

=350 GeV, 1 x SMHm

ATLAS

-1 Ldt = 4.7 fb∫ = 7 TeV, s

qq+2jνl→WW→H

(l)

FIG. 1. Invariant or transverse mass distributions for the selected candidate events, the total background and the signal expectedin the following channels: (a) H → γγ, (b) H → ZZ(∗) → ℓ+ℓ−ℓ+ℓ− in the entire mass range, (c) H → ZZ(∗) → ℓ+ℓ−ℓ+ℓ− inthe low mass range, (d) H → ZZ → ℓ+ℓ−νν, (e) b-tagged selection and (f) untagged selection for H → ZZ → ℓ+ℓ−qq, (g) H →WW (∗) → ℓ+νℓ−ν+0-jets, (h) H → WW (∗) → ℓ+νℓ−ν+1-jet, (i) H → WW (∗) → ℓ+νℓ−ν+2-jets, (j) H → WW → ℓνqq′+0-jets, (k) H → WW → ℓνqq′+1-jet and (l) H → WW → ℓνqq′+2-jets. The H → WW (∗) → ℓ+νℓ−ν+2-jets distribution isshown before the final selection requirements are applied.

Is this significant?

3[arxiv:1207.0319]

• How can we calculate the best-fit estimate of some parameter?

‣ Point estimation and confidence intervals

• How can we be precise and rigorous about how confident we are that a model is wrong?

‣ Hypothesis testing

3 events

Has a local p0 of ≈ 2%

Statistical questions:

Page 4: MLE, Confidence Intervals, and Hypothesis Testingrreece.github.io/talks/pdf/2018-02-16-RReece-statistics-workshop-insight.pdfFeb 16, 2018  · Primer on statistics: MLE, Confidence

Ryan Reece

Confidence Intervals

4

• A frequentist confidence interval is constructed such that, given the model, if the experiment were repeated, each time creating an interval, 95% (or other CL) of the intervals would contain the true population parameter (i.e. the interval has ≈95% coverage).

‣ They can be one-sided exclusions, e.g. X > 100.2 at 95% CL

‣ Two-sided measurements, e.g. X = 125.1 ± 0.2 at 68% CL

‣ Contours in 2 or more parameters

• This is not the same as saying “There is a 95% probability that the true parameter is in my interval”. Any probability assigned to a parameter strictly involves a Bayesian prior probability.

• Bayes’ theorem: P(Theory | Data) ∝ P(Data | Theory) P(Theory)

�2 Fits and Confidence Contours

�2 =

X

i

(Ci � Pi)2

�(Ci)2 + �(Pi)2

+ �2(search constraints)

+X

i

(fobsSMi � f

fitSMi)

2

�(fSMi)20 200 400 600 800 1000 1200

0

200

400

600

800

1000

1200

with WMAP

without WMAP

m1/2

[GeV

]

m0 [GeV]

[arXiv:0808.4128v1]

We’ll discuss how one goes from the statistic on the left to the plot on theright.

Ryan D. Reece (Penn) Likelihood Functions for SUSY [email protected] 4 / 24“likelihood” “prior”

Page 5: MLE, Confidence Intervals, and Hypothesis Testingrreece.github.io/talks/pdf/2018-02-16-RReece-statistics-workshop-insight.pdfFeb 16, 2018  · Primer on statistics: MLE, Confidence

Maximum likelihood method

Page 6: MLE, Confidence Intervals, and Hypothesis Testingrreece.github.io/talks/pdf/2018-02-16-RReece-statistics-workshop-insight.pdfFeb 16, 2018  · Primer on statistics: MLE, Confidence

Ryan Reece

Maximum likelihood

6

Primer on the Maximum Likelihood Method

Consider a Gaussian distributed measurement:

f1(x|µ,�) =1p

2⇡�2exp

✓�(x� µ)2

2�2

If we repeat the measurement, the joint PDF is just a product:

f(~x|µ,�) =Y

i

f1(xi|µ,�)

The likelihood function is the same function as the PDF, only thought ofas a function of the parameters, given the data. The experiment is over.

L(µ,�|~x) = f(~x|µ,�)

The likelihood principle states that the best estimate of the trueparameters are the values which maximize the likelihood.

Ryan D. Reece (Penn) Likelihood Functions for SUSY [email protected] 5 / 24

Page 7: MLE, Confidence Intervals, and Hypothesis Testingrreece.github.io/talks/pdf/2018-02-16-RReece-statistics-workshop-insight.pdfFeb 16, 2018  · Primer on statistics: MLE, Confidence

Ryan Reece

Maximum likelihood

7

It is often more convenient to consider the log likelihood, which has thesame maximum.

lnL = lnY

f1 =X

ln f1

=X

i

✓�1

2ln(2⇡�

2)� (xi � µ)2

2�2

Maximize:@ lnL

@µ=

X

i

xi � µ

�2= 0

)X

i

(xi � µ) = 0, ) µ =1N

X

i

xi = x

Which agrees with our intuition that the best estimate of the mean of aGaussian is the sample mean.

Ryan D. Reece (Penn) Likelihood Functions for SUSY [email protected] 6 / 24

Page 8: MLE, Confidence Intervals, and Hypothesis Testingrreece.github.io/talks/pdf/2018-02-16-RReece-statistics-workshop-insight.pdfFeb 16, 2018  · Primer on statistics: MLE, Confidence

Ryan Reece

Maximum likelihood

8

Note that in the case of a Gaussian PDF, maximizing likelihood isequivalent to minimizing �

2.

lnL =X

i

✓�1

2ln(2⇡�

2)� (x� µ)2

2�2

is maximized when

�2 =

X

i

(x� µ)2

�2

is minimized.

This was a simple example of what statisticians call point estimation.Now we would like to quantify our error on this estimate.

Ryan D. Reece (Penn) Likelihood Functions for SUSY [email protected] 7 / 24

Page 9: MLE, Confidence Intervals, and Hypothesis Testingrreece.github.io/talks/pdf/2018-02-16-RReece-statistics-workshop-insight.pdfFeb 16, 2018  · Primer on statistics: MLE, Confidence

Variance of MLEs

Page 10: MLE, Confidence Intervals, and Hypothesis Testingrreece.github.io/talks/pdf/2018-02-16-RReece-statistics-workshop-insight.pdfFeb 16, 2018  · Primer on statistics: MLE, Confidence

Ryan Reece

Variance

10

Primer on the Maximum Likelihood Method

Consider a Gaussian distributed measurement:

f1(x|µ,�) =1p

2⇡�2exp

✓�(x� µ)2

2�2

If we repeat the measurement, the joint PDF is just a product:

f(~x|µ,�) =Y

i

f1(xi|µ,�)

The likelihood function is the same function as the PDF, only thought ofas a function of the parameters, given the data. The experiment is over.

L(µ,�|~x) = f(~x|µ,�)

The likelihood principle states that the best estimate of the trueparameters are the values which maximize the likelihood.

Ryan D. Reece (Penn) Likelihood Functions for SUSY [email protected] 5 / 24

It is often more convenient to consider the log likelihood, which has thesame maximum.

lnL = lnY

f1 =X

ln f1

=X

i

✓�1

2ln(2⇡�

2)� (xi � µ)2

2�2

Maximize:@ lnL

@µ=

X

i

xi � µ

�2= 0

)X

i

(xi � µ) = 0, ) µ =1N

X

i

xi = x

Which agrees with our intuition that the best estimate of the mean of aGaussian is the sample mean.

Ryan D. Reece (Penn) Likelihood Functions for SUSY [email protected] 6 / 24

“True parameter”

Maximum Likelihood Estimator (MLE)

It is often more convenient to consider the log likelihood, which has thesame maximum.

lnL = lnY

f1 =X

ln f1

=X

i

✓�1

2ln(2⇡�

2)� (xi � µ)2

2�2

Maximize:@ lnL

@µ=

X

i

xi � µ

�2= 0

)X

i

(xi � µ) = 0, ) µ =1N

X

i

xi = x

Which agrees with our intuition that the best estimate of the mean of aGaussian is the sample mean.

Ryan D. Reece (Penn) Likelihood Functions for SUSY [email protected] 6 / 24

Primer on the Maximum Likelihood Method

Consider a Gaussian distributed measurement:

f1(x|µ,�) =1p

2⇡�2exp

✓�(x� µ)2

2�2

If we repeat the measurement, the joint PDF is just a product:

f(~x|µ,�) =Y

i

f1(xi|µ,�)

The likelihood function is the same function as the PDF, only thought ofas a function of the parameters, given the data. The experiment is over.

L(µ,�|~x) = f(~x|µ,�)

The likelihood principle states that the best estimate of the trueparameters are the values which maximize the likelihood.

Ryan D. Reece (Penn) Likelihood Functions for SUSY [email protected] 5 / 24

It is often more convenient to consider the log likelihood, which has thesame maximum.

lnL = lnY

f1 =X

ln f1

=X

i

✓�1

2ln(2⇡�

2)� (xi � µ)2

2�2

Maximize:@ lnL

@µ=

X

i

xi � µ

�2= 0

)X

i

(xi � µ) = 0, ) µ =1N

X

i

xi = x

Which agrees with our intuition that the best estimate of the mean of aGaussian is the sample mean.

Ryan D. Reece (Penn) Likelihood Functions for SUSY [email protected] 6 / 24

Assuming this model, how confident are we that is close to ? ➡ What is the variance of ?

?

Page 11: MLE, Confidence Intervals, and Hypothesis Testingrreece.github.io/talks/pdf/2018-02-16-RReece-statistics-workshop-insight.pdfFeb 16, 2018  · Primer on statistics: MLE, Confidence

Ryan Reece

Variance

11

One would think that if the likelihood function varies rather slowly nearthe peak, then there is a wide range of values of the parameters that areconsistent with the data, and thus the estimate should have a large error.

To see the behavior of the likelihood function near the peak, consider theTaylor expansion of a general lnL of some parameter ✓, near its maximumlikelihood estimate ✓:

lnL(✓) = ln L(✓) +⇢

⇢⇢

⇢⇢>0

@ lnL

@✓

����✓

(✓ � ✓) +12!

@2 lnL

@✓2

����✓| {z }

�1/s2

(✓ � ✓)2 + · · ·

Dropping the remaining terms would imply that

L(✓) = L(✓) exp

�(✓ � ✓)2

2s2

!

Note that the lnL(✓) is parabolic.

Ryan D. Reece (Penn) Likelihood Functions for SUSY [email protected] 8 / 24

Page 12: MLE, Confidence Intervals, and Hypothesis Testingrreece.github.io/talks/pdf/2018-02-16-RReece-statistics-workshop-insight.pdfFeb 16, 2018  · Primer on statistics: MLE, Confidence

Ryan Reece

Variance

12

So if this were a good approximation, we would expect that the variance of✓ would be given by

V [✓] ⌘ �2✓

= s2 = �

✓@

2 lnL

@✓2

����✓

◆�1

It turns out that there is more truth to this than you would think, given byan important theorem in statistics, the Cramer-Rao Inequality:

V [✓] �✓

1 +@b

@✓

◆2

/E

� @

2 lnL

@✓2

����✓

An estimator’s e�ciency is defined to measure to what extent thisinequality is equivalent:

"[✓] ⌘1/E

h� @2 ln L

@✓2

���✓

i

V [✓]

Ryan D. Reece (Penn) Likelihood Functions for SUSY [email protected] 9 / 24

Page 13: MLE, Confidence Intervals, and Hypothesis Testingrreece.github.io/talks/pdf/2018-02-16-RReece-statistics-workshop-insight.pdfFeb 16, 2018  · Primer on statistics: MLE, Confidence

Ryan Reece

Variance

13

It can be shown that in the large sample limit:

Maximum likelihood estimators are unbiased and 100% e�cient.

Therefore, in principle, one can calculate the variance of an ML estimatorwith

V [✓] = �✓

E

@

2 lnL

@✓2

����✓

�◆�1

Calculating the expectation value would involve an analytic integrationover the PDFs of all our possible measurements, or a Monte Carlosimulation of it. In practice, one usually uses the observed maximumlikelihood estimate as the expectation.

V [✓] = �✓

@2 lnL

@✓2

����✓

◆�1

Ryan D. Reece (Penn) Likelihood Functions for SUSY [email protected] 10 / 24

Page 14: MLE, Confidence Intervals, and Hypothesis Testingrreece.github.io/talks/pdf/2018-02-16-RReece-statistics-workshop-insight.pdfFeb 16, 2018  · Primer on statistics: MLE, Confidence

Ryan Reece

Variance

14

Let’s go back to our simple example of a Gaussian likelihood to test thismethod of calculating the ML estimator’s variance.

V [µ] = �

@2 lnL

@µ2

����µ

!�1

@2 lnL

@µ2=

@2

@µ2

X

i

✓�1

2ln(2⇡�

2)� (xi � µ)2

2�2

=@

X

i

xi � µ

�2=X

i

�1�2

=�N

�2

) V [µ] =�

2

N) �µ =

�pN

Which many of you will recognize as the proper error on the sample mean.If you are unfamiliar with it, we can actually derive it analytically in thiscase.

Ryan D. Reece (Penn) Likelihood Functions for SUSY [email protected] 11 / 24

Page 15: MLE, Confidence Intervals, and Hypothesis Testingrreece.github.io/talks/pdf/2018-02-16-RReece-statistics-workshop-insight.pdfFeb 16, 2018  · Primer on statistics: MLE, Confidence

Ryan Reece

Variance

15

V [x] = E[x2]�����*µ

E[x]2

= E

2

4

1N

X

i

xi

!0

@ 1N

X

j

xj

1

A

3

5� µ2

=1

N2E

2

4X

i6=j

xi xj +X

i

x2i

3

5� µ2

=1

N2

0

@X

i6=j����*µ

2

E[x]2 +X

i

E[x2]

1

A� µ2

To find E[x2], consider

V [x] = �2 = E[x2]� E[x]2 = E[x2]� µ

2

) E[x2] = �2 + µ

2

Ryan D. Reece (Penn) Likelihood Functions for SUSY [email protected] 12 / 24

Analytic variance of a gaussian:

Primer on the Maximum Likelihood Method

Consider a Gaussian distributed measurement:

f1(x|µ,�) =1p

2⇡�2exp

✓�(x� µ)2

2�2

If we repeat the measurement, the joint PDF is just a product:

f(~x|µ,�) =Y

i

f1(xi|µ,�)

The likelihood function is the same function as the PDF, only thought ofas a function of the parameters, given the data. The experiment is over.

L(µ,�|~x) = f(~x|µ,�)

The likelihood principle states that the best estimate of the trueparameters are the values which maximize the likelihood.

Ryan D. Reece (Penn) Likelihood Functions for SUSY [email protected] 5 / 24

Page 16: MLE, Confidence Intervals, and Hypothesis Testingrreece.github.io/talks/pdf/2018-02-16-RReece-statistics-workshop-insight.pdfFeb 16, 2018  · Primer on statistics: MLE, Confidence

Ryan Reece

Variance

16

) V [x] =1

N2

0

@X

i6=j

µ2 +

X

i

(�2 + µ2)

1

A� µ2

=1

N2

�(N2 �N)µ2 + N(�2 + µ

2)�� µ

2

=�

2

N

Which verifies the result we got from calculating derivatives of thelikelihood function.

V [✓] = �✓

@2 lnL

@✓2

����✓

◆�1

In practice, one usually doesn’t calculate this analytically, but instead:

calculates the derivatives numerically, or

uses the � lnL or ��2 method, described now

Ryan D. Reece (Penn) Likelihood Functions for SUSY [email protected] 13 / 24

Page 17: MLE, Confidence Intervals, and Hypothesis Testingrreece.github.io/talks/pdf/2018-02-16-RReece-statistics-workshop-insight.pdfFeb 16, 2018  · Primer on statistics: MLE, Confidence

Ryan Reece

) V [x] =1

N2

0

@X

i6=j

µ2 +

X

i

(�2 + µ2)

1

A� µ2

=1

N2

�(N2 �N)µ2 + N(�2 + µ

2)�� µ

2

=�

2

N

Which verifies the result we got from calculating derivatives of thelikelihood function.

V [✓] = �✓

@2 lnL

@✓2

����✓

◆�1

In practice, one usually doesn’t calculate this analytically, but instead:

calculates the derivatives numerically, or

uses the � lnL or ��2 method, described now

Ryan D. Reece (Penn) Likelihood Functions for SUSY [email protected] 13 / 24

Variance by

17

Back to our Taylor expansion of lnL:

lnL(✓) = ln L(✓) +12!

@2 lnL

@✓2

����✓| {z }

�1/�2✓

(✓ � ✓)2 + · · ·

Let � lnL(✓) ⌘ lnL(✓)� lnL(✓)

� lnL(✓) ' �(✓ � ✓)2

2�2✓

✓ ! ✓ ± n�✓

� lnL(✓ ± n�✓) = �(±n�✓)

2

2�2✓

� lnL(✓ ± n�✓) = �n2

2

Ryan D. Reece (Penn) Likelihood Functions for SUSY [email protected] 14 / 24

Page 18: MLE, Confidence Intervals, and Hypothesis Testingrreece.github.io/talks/pdf/2018-02-16-RReece-statistics-workshop-insight.pdfFeb 16, 2018  · Primer on statistics: MLE, Confidence

Ryan Reece

) V [x] =1

N2

0

@X

i6=j

µ2 +

X

i

(�2 + µ2)

1

A� µ2

=1

N2

�(N2 �N)µ2 + N(�2 + µ

2)�� µ

2

=�

2

N

Which verifies the result we got from calculating derivatives of thelikelihood function.

V [✓] = �✓

@2 lnL

@✓2

����✓

◆�1

In practice, one usually doesn’t calculate this analytically, but instead:

calculates the derivatives numerically, or

uses the � lnL or ��2 method, described now

Ryan D. Reece (Penn) Likelihood Functions for SUSY [email protected] 13 / 24

Variance by

18

� lnL(✓ ± n�✓) = �n2

2

This is the most common definition of the 68% and 95% confidenceintervals:

68%/ 1 �✓: � lnL = �12

95%/ 2 �✓: � lnL = �2

Ryan D. Reece (Penn) Likelihood Functions for SUSY [email protected] 15 / 24

Page 19: MLE, Confidence Intervals, and Hypothesis Testingrreece.github.io/talks/pdf/2018-02-16-RReece-statistics-workshop-insight.pdfFeb 16, 2018  · Primer on statistics: MLE, Confidence

Ryan Reece

) V [x] =1

N2

0

@X

i6=j

µ2 +

X

i

(�2 + µ2)

1

A� µ2

=1

N2

�(N2 �N)µ2 + N(�2 + µ

2)�� µ

2

=�

2

N

Which verifies the result we got from calculating derivatives of thelikelihood function.

V [✓] = �✓

@2 lnL

@✓2

����✓

◆�1

In practice, one usually doesn’t calculate this analytically, but instead:

calculates the derivatives numerically, or

uses the � lnL or ��2 method, described now

Ryan D. Reece (Penn) Likelihood Functions for SUSY [email protected] 13 / 24

Variance by

19

Recall that in the case that the PDF is Gaussian, the lnL is just the �2

statistic.

lnL = ��2

2, �

2 =X (x� ✓)2

�2

� lnL(✓ ± n�✓) = lnL(✓ ± n�✓)� lnLmax = �n2

2

) �12

⇣�

2(✓ ± n�✓)� �2min

⌘= �n

2

2

��2(✓ ± n�✓) = n

2

68%/ 1 �✓: ��2 = 1

95%/ 2 �✓: ��2 = 4

Ryan D. Reece (Penn) Likelihood Functions for SUSY [email protected] 16 / 24

(3.84 for 2σ)

Page 20: MLE, Confidence Intervals, and Hypothesis Testingrreece.github.io/talks/pdf/2018-02-16-RReece-statistics-workshop-insight.pdfFeb 16, 2018  · Primer on statistics: MLE, Confidence

Ryan Reece

) V [x] =1

N2

0

@X

i6=j

µ2 +

X

i

(�2 + µ2)

1

A� µ2

=1

N2

�(N2 �N)µ2 + N(�2 + µ

2)�� µ

2

=�

2

N

Which verifies the result we got from calculating derivatives of thelikelihood function.

V [✓] = �✓

@2 lnL

@✓2

����✓

◆�1

In practice, one usually doesn’t calculate this analytically, but instead:

calculates the derivatives numerically, or

uses the � lnL or ��2 method, described now

Ryan D. Reece (Penn) Likelihood Functions for SUSY [email protected] 13 / 24

Variance by

20

Qc

c n=1 n=2 n=30.683 1.00 2.30 3.530.95 3.84 5.99 7.820.99 6.63 9.21 11.3

A note about confidence contours/intervals

A 95% confidence contour doesn’t mean that the true value of theparameter is in the contour with 95% probabilty. That would be aBayesian probabilty (a probability of belief). It means that if the model iscorrect, we have properly estimated our errors, and if we were to repeatthe experiment over and over again, each time creating a new contour,then the contour would cover the true parameter in 95% of theexperiments (a frequentist probability).

Ryan D. Reece (Penn) Likelihood Functions for SUSY [email protected] 20 / 24

) V [x] =1

N2

0

@X

i6=j

µ2 +

X

i

(�2 + µ2)

1

A� µ2

=1

N2

�(N2 �N)µ2 + N(�2 + µ

2)�� µ

2

=�

2

N

Which verifies the result we got from calculating derivatives of thelikelihood function.

V [✓] = �✓

@2 lnL

@✓2

����✓

◆�1

In practice, one usually doesn’t calculate this analytically, but instead:

calculates the derivatives numerically, or

uses the � lnL or ��2 method, described now

Ryan D. Reece (Penn) Likelihood Functions for SUSY [email protected] 13 / 24

=

Multi-dimensional case

Page 21: MLE, Confidence Intervals, and Hypothesis Testingrreece.github.io/talks/pdf/2018-02-16-RReece-statistics-workshop-insight.pdfFeb 16, 2018  · Primer on statistics: MLE, Confidence

Example: measuring an

efficiency& A/B testing

Page 22: MLE, Confidence Intervals, and Hypothesis Testingrreece.github.io/talks/pdf/2018-02-16-RReece-statistics-workshop-insight.pdfFeb 16, 2018  · Primer on statistics: MLE, Confidence

Ryan Reece 22

EfficiencyBayesian Statistics: Mode Vs Mean

Louise Heelan Calc Efficiency Uncertainties 14

Out of n trials I measure k “conversions”

Estimate the conversionrate its precision/confidence.

Without a precision, we cannotknow if observed changes aresignificant.

Page 23: MLE, Confidence Intervals, and Hypothesis Testingrreece.github.io/talks/pdf/2018-02-16-RReece-statistics-workshop-insight.pdfFeb 16, 2018  · Primer on statistics: MLE, Confidence

Ryan Reece 23

EfficiencyMeasuring Occupancy ⇠ Measuring a Coin’sFairness

Binomial Distribution

f(k;n, p) =✓

n

k

◆pk (1� p)n�k

�k2 = n p (1� p)

�o2 =

⇣�k

n

⌘2=

p (1� p)n

p ⇡ 0.01 ) �o2 =

(0.01) (0.99)n

⇡ 0.01n

�o =1

10p

n

Ryan D. Reece (Penn) TRT Low Threshold Optimization [email protected] 19 / 28

MLE:

"

" =n

k

"

"If

Number of Events Needed for some Precision

for p = 0.01

n �o �o/p

400 0.0050 50%1,000 0.0032 32%2,000 0.0022 22%

10,000 0.0010 10%106 0.0001 1%

Ryan D. Reece (Penn) TRT Low Threshold Optimization [email protected] 20 / 28

" "

Page 24: MLE, Confidence Intervals, and Hypothesis Testingrreece.github.io/talks/pdf/2018-02-16-RReece-statistics-workshop-insight.pdfFeb 16, 2018  · Primer on statistics: MLE, Confidence

Ryan Reece 24

A/B testingExample taken from here: https://www.blitzresults.com/en/ab-tests/

Recall that in the case that the PDF is Gaussian, the lnL is just the �2

statistic.

lnL = ��2

2, �

2 =X (x� ✓)2

�2

� lnL(✓ ± n�✓) = lnL(✓ ± n�✓)� lnLmax = �n2

2

) �12

⇣�

2(✓ ± n�✓)� �2min

⌘= �n

2

2

��2(✓ ± n�✓) = n

2

68%/ 1 �✓: ��2 = 1

95%/ 2 �✓: ��2 = 4

Ryan D. Reece (Penn) Likelihood Functions for SUSY [email protected] 16 / 24

Data like a 1-dim, 4-bin histogram

Construct

�2 =(N 0

A �NA ")2

NA "+

((NA �N 0A)�NA (1� "))2

NA (1� ")

+ (A ! B)

" =N 0

A

NA' N 0

B

NB' N 0

B +N 0B

NA +NB

Assume conversion rateunchanged from A to B:

Page 25: MLE, Confidence Intervals, and Hypothesis Testingrreece.github.io/talks/pdf/2018-02-16-RReece-statistics-workshop-insight.pdfFeb 16, 2018  · Primer on statistics: MLE, Confidence

Ryan Reece

A/B testing

25

= 7.52

Remember:

) V [x] =1

N2

0

@X

i6=j

µ2 +

X

i

(�2 + µ2)

1

A� µ2

=1

N2

�(N2 �N)µ2 + N(�2 + µ

2)�� µ

2

=�

2

N

Which verifies the result we got from calculating derivatives of thelikelihood function.

V [✓] = �✓

@2 lnL

@✓2

����✓

◆�1

In practice, one usually doesn’t calculate this analytically, but instead:

calculates the derivatives numerically, or

uses the � lnL or ��2 method, described now

Ryan D. Reece (Penn) Likelihood Functions for SUSY [email protected] 13 / 24

1σ (68%) : 12σ (95%) : 3.843σ (99%) : 6.63

→ >99% CL → significant improvement in B model

Plugin above values gives:

Page 26: MLE, Confidence Intervals, and Hypothesis Testingrreece.github.io/talks/pdf/2018-02-16-RReece-statistics-workshop-insight.pdfFeb 16, 2018  · Primer on statistics: MLE, Confidence

Ryan Reece 26

Hypothesis test

Page 27: MLE, Confidence Intervals, and Hypothesis Testingrreece.github.io/talks/pdf/2018-02-16-RReece-statistics-workshop-insight.pdfFeb 16, 2018  · Primer on statistics: MLE, Confidence

Ryan Reece

Take-aways• MLEs are the best

• Don’t just calculate MLE, find its variance!

• Quantify significance with a confidence interval

• Under many common assumptionsand one can calculate to determine confidence intervals/contours.

• In the simplest case of measuring an efficiency, A/B-testing amounts to a 4-term that can be calculated by hand.

• If the is large, the change is significant!27

Recall that in the case that the PDF is Gaussian, the lnL is just the �2

statistic.

lnL = ��2

2, �

2 =X (x� ✓)2

�2

� lnL(✓ ± n�✓) = lnL(✓ ± n�✓)� lnLmax = �n2

2

) �12

⇣�

2(✓ ± n�✓)� �2min

⌘= �n

2

2

��2(✓ ± n�✓) = n

2

68%/ 1 �✓: ��2 = 1

95%/ 2 �✓: ��2 = 4

Ryan D. Reece (Penn) Likelihood Functions for SUSY [email protected] 16 / 24

) V [x] =1

N2

0

@X

i6=j

µ2 +

X

i

(�2 + µ2)

1

A� µ2

=1

N2

�(N2 �N)µ2 + N(�2 + µ

2)�� µ

2

=�

2

N

Which verifies the result we got from calculating derivatives of thelikelihood function.

V [✓] = �✓

@2 lnL

@✓2

����✓

◆�1

In practice, one usually doesn’t calculate this analytically, but instead:

calculates the derivatives numerically, or

uses the � lnL or ��2 method, described now

Ryan D. Reece (Penn) Likelihood Functions for SUSY [email protected] 13 / 24

) V [x] =1

N2

0

@X

i6=j

µ2 +

X

i

(�2 + µ2)

1

A� µ2

=1

N2

�(N2 �N)µ2 + N(�2 + µ

2)�� µ

2

=�

2

N

Which verifies the result we got from calculating derivatives of thelikelihood function.

V [✓] = �✓

@2 lnL

@✓2

����✓

◆�1

In practice, one usually doesn’t calculate this analytically, but instead:

calculates the derivatives numerically, or

uses the � lnL or ��2 method, described now

Ryan D. Reece (Penn) Likelihood Functions for SUSY [email protected] 13 / 24

) V [x] =1

N2

0

@X

i6=j

µ2 +

X

i

(�2 + µ2)

1

A� µ2

=1

N2

�(N2 �N)µ2 + N(�2 + µ

2)�� µ

2

=�

2

N

Which verifies the result we got from calculating derivatives of thelikelihood function.

V [✓] = �✓

@2 lnL

@✓2

����✓

◆�1

In practice, one usually doesn’t calculate this analytically, but instead:

calculates the derivatives numerically, or

uses the � lnL or ��2 method, described now

Ryan D. Reece (Penn) Likelihood Functions for SUSY [email protected] 13 / 24

1σ (68%) : 12σ (95%) : 3.843σ (99%) : 6.63

) V [x] =1

N2

0

@X

i6=j

µ2 +

X

i

(�2 + µ2)

1

A� µ2

=1

N2

�(N2 �N)µ2 + N(�2 + µ

2)�� µ

2

=�

2

N

Which verifies the result we got from calculating derivatives of thelikelihood function.

V [✓] = �✓

@2 lnL

@✓2

����✓

◆�1

In practice, one usually doesn’t calculate this analytically, but instead:

calculates the derivatives numerically, or

uses the � lnL or ��2 method, described now

Ryan D. Reece (Penn) Likelihood Functions for SUSY [email protected] 13 / 24

Page 28: MLE, Confidence Intervals, and Hypothesis Testingrreece.github.io/talks/pdf/2018-02-16-RReece-statistics-workshop-insight.pdfFeb 16, 2018  · Primer on statistics: MLE, Confidence

Back up slides

Page 29: MLE, Confidence Intervals, and Hypothesis Testingrreece.github.io/talks/pdf/2018-02-16-RReece-statistics-workshop-insight.pdfFeb 16, 2018  · Primer on statistics: MLE, Confidence

Ryan Reece 29

EfficiencyResults

o example of different techniques for my top trigger analysis (105200, r635, ≈30kevents)o two examples of a particular bin (low and high statistics)

Method Numerator Denominator Mean (Mode) Variance Uncertainty σ

Poisson 1 45 0.0222 0.00050 0.02246Binomial 1 45 0.0222 0.00048 0.02197Bayesian 1 45 0.04255 (0.0222) 0.00085 0.02913

Method Numerator Denominator Mean (Mode) Variance Uncertainty σ

Poisson 100 106 0.9433 0.01729 0.13151Binomial 100 106 0.9433 0.00050 0.02244Bayesian 100 106 0.9352 (0.9433) 0.00056 0.02358

Louise Heelan Calc Efficiency Uncertainties 15

Other examples with numbers:

Page 30: MLE, Confidence Intervals, and Hypothesis Testingrreece.github.io/talks/pdf/2018-02-16-RReece-statistics-workshop-insight.pdfFeb 16, 2018  · Primer on statistics: MLE, Confidence

Ryan Reece

Hypothesis testing

30

• Null hypothesis, H0: the SM

• Alternative hypothesis, H1: some new physics

• Type-I error: false positive rate (α)

• Type-II error: false negative rate (β)

• Power: 1-β• Want to maximize power for a fixed false positive rate

• Particle physics has a tradition of claiming discovery at 5σ ⇒ p0 = 2.9⨉10-7 = 1 in 3.5 million, and presents exclusion with p0 = 5%, (95% CL “coverage”).

• Neyman-Pearson lemma (1933): the most powerful test for fixed α is the likelihood ratio:

Kyle Cranmer (NYU) CERN Academic Training, Feb 2-5, 2009 76

The Neyman-Pearson Lemma

The region W that minimizes the probability of wrongly acceptingthe H0 is just a contour of the Likelihood Ratio:

L(x|H0)

L(x|H1)> kα

This is the goal!

The problem is we don’t have access to L(x|H0) & L(x|H1)

April 11, 2005

EFI High Energy Physics Seminar

Modern Data Analysis Techniques

for High Energy Physics (page 7)

Kyle Cranmer

Brookhaven National Laboratory

The Neyman-Pearson Lemma

Prediction via Monte Carlo Simulation

The enormous detectors are still being constructed, but we have detailedsimulations of the detectors response.

L(x|H0) =W

W

Hµ+

µ−

The advancements in theoretical predictions, detector simulation, tracking,calorimetry, triggering, and computing set the bar high for equivalentadvances in our statistical treatment of the data.

September 13, 2005

PhyStat2005, OxfordStatistical Challenges of the LHC (page 6) Kyle Cranmer

Brookhaven National Laboratory

Tuesday, February 3, 2009

Neyman

Page 31: MLE, Confidence Intervals, and Hypothesis Testingrreece.github.io/talks/pdf/2018-02-16-RReece-statistics-workshop-insight.pdfFeb 16, 2018  · Primer on statistics: MLE, Confidence

Ryan Reece 31

Power & Significance

Page 32: MLE, Confidence Intervals, and Hypothesis Testingrreece.github.io/talks/pdf/2018-02-16-RReece-statistics-workshop-insight.pdfFeb 16, 2018  · Primer on statistics: MLE, Confidence

Ryan Reece

) V [x] =1

N2

0

@X

i6=j

µ2 +

X

i

(�2 + µ2)

1

A� µ2

=1

N2

�(N2 �N)µ2 + N(�2 + µ

2)�� µ

2

=�

2

N

Which verifies the result we got from calculating derivatives of thelikelihood function.

V [✓] = �✓

@2 lnL

@✓2

����✓

◆�1

In practice, one usually doesn’t calculate this analytically, but instead:

calculates the derivatives numerically, or

uses the � lnL or ��2 method, described now

Ryan D. Reece (Penn) Likelihood Functions for SUSY [email protected] 13 / 24

Variance by

32

What if L(✓) is not Gaussian, i.e. lnL(✓) is not parabolic?

Likelihood functions have an invariance property, such that if g(x) is amonotonic function, then the maximum likelihood estimate of g(✓) is g(✓).In principle, one can find a change of variables function g(✓), for which thelnL(g(✓)) is parabolic as a function of g(✓). Therefore, using theinvariance of the likelihood function, one can make inferences about aparameter of a non-Gaussian likelihood function without actually findingsuch a transformation [James p. 234].

Ryan D. Reece (Penn) Likelihood Functions for SUSY [email protected] 17 / 24

Page 33: MLE, Confidence Intervals, and Hypothesis Testingrreece.github.io/talks/pdf/2018-02-16-RReece-statistics-workshop-insight.pdfFeb 16, 2018  · Primer on statistics: MLE, Confidence

Ryan Reece

Systematics

33from: http://philosophy-in-figures.tumblr.com/