statistical methods in lhc data analysis part i.2 luca lista infn napoli
TRANSCRIPT
Statistical methods in LHC data analysis
part I.2
Statistical methods in LHC data analysis
part I.2
Luca Lista
INFN Napoli
Luca Lista Statistical methods in LHC data analysis 2
ContentsContents
• Hypothesis testing
• Neyman-Pearson lemma and likelihood ratio
• Multivariate analysis (elements)
• Chi-square fits and goodness-of-fit
• Confidence intervals
• Feldman-Cousins ordering
Luca Lista Statistical methods in LHC data analysis 3
Multivariate discriminationMultivariate discrimination
• The problem:– A signal and a background are characterized by n variables
with different distributions for the two cases: signal, and background • Generalization to more than two cases is an easy extension
– Given a measurement (event) of n variables with some discriminating power, (x1, …, xn), identify (discriminate) the event as signal or background
• Property of discriminator:– Selection efficiency: probability of right answer– Misidentification probability (for background)– Purity: fraction of signal in a positively identified sample
• Depends on the signal and background composition! It is not a property of the discriminator only
– Fake rate: fraction of background in a positively identified sample, = 1 Purity
Luca Lista Statistical methods in LHC data analysis 4
Test of Hypothesis terminologyTest of Hypothesis terminology
• Naming by statisticians is usually less natural for physics applications than previous slide
• H0 = null hypothesis– E.g.: a sample contains only background; a particle is a pion;
etc.
• H1 = alternative hypothesis– E.g.: a sample contains background + signal; or a particle is a
muon; etc.
= significance level: probability to reject H1 if true (error of first kind), i.e. assuming H1 = 1 – efficiency for signal
= probability to reject H0 if true (error of second kind), i.e. assuming H0
– = efficiency for background
Luca Lista Statistical methods in LHC data analysis 5
Cut analysisCut analysis
• Cut on one (or more) variables:– If x xcut signal
– Else, if x xcut background
xcut x
Efficiency (1-)
Mis-id probability()
Luca Lista Statistical methods in LHC data analysis 6
Efficiency vs mis-idEfficiency vs mis-id
• Varying the cut both the efficiency and mis-id change
Mis-id
Eff
icie
ncy
0 1
1
0
Luca Lista Statistical methods in LHC data analysis 7
Variations on cut analysesVariations on cut analyses
• Cut on multiple variables– AND/OR of single cuts
• Multi-dimensional cuts:– Linear cuts– Piece-wise linear cuts– Non-linear combinations
• At some point, hard to find optimal cut values, or too many cuts required– How to determine the cuts, looking at control samples?– Control samples could be MC, or selected data decays– Note: cut selection must be done a-priori, before looking at
data, to avoid biases!
Luca Lista Statistical methods in LHC data analysis 8
Straight cuts or something else?Straight cuts or something else?
• Straight cuts may not be optimal in all cases
Luca Lista Statistical methods in LHC data analysis 9
Neyman-Pearson lemmaNeyman-Pearson lemma
• Fixing the signal efficiency (1 ), a selection based on the likelihood ratio gives the lowest possible mis-id probability ():
(x) = L(x|H1) / L(x|H0) > k
• If we can’t use the likelihood ratio, we can choose other discriminators, or “test statistics”:
• A test statistic is any function of x (like (x)) that allows to discriminate the two hypotheses
Luca Lista Statistical methods in LHC data analysis 10
Likelihood ratio discriminatorLikelihood ratio discriminator
• We make the ratio of likelihoods defined in the two hypotheses:
• Q may also depend on a number of unknown parameters (1,…,N)
• Best discriminator, if the multi-dimensional likelihood is perfectly known (Neyman-Pearson lemma)
• Great effort in getting the correct ratio– E.g.: Matrix Element Tecnhniques for top mass and single-top
at Tevatron
Luca Lista Statistical methods in LHC data analysis 11
Likelihood factorizationLikelihood factorization• We make the ratio of likelihoods defined in the two hypotheses
assuming PDF factorized as product of 1-D PDF:
• Approximate in case of non perfectly factorized PDF– E.g.: correlation
• A rotation or other judicious transformations in the variables’ space may be used to remove the correlation– Sometimes even different for s and b hypotheses
If PDF can be factorizedinto independent components
Luca Lista Statistical methods in LHC data analysis 12
Building projective PDF’sBuilding projective PDF’s
• PDF’s for likelihood discriminator– If not uncorrelated, need to find uncorrelated
variables first, otherwise plain PDF product is suboptimal
Luca Lista Statistical methods in LHC data analysis 13
Likelihood ratio outputLikelihood ratio output
• Good separation achieved in this case
L > 0.5
TMVA
Luca Lista Statistical methods in LHC data analysis 14
Fisher discriminatorFisher discriminator
• Combine a number of variables into a single discriminator
• Equivalent to project the distribution along a line
• Use the linear combination of inputs that maximizes the distance of the means of the two classes while minimizing the variance within each class:
• The maximization problem can be solved with linear algebra
Sir Ronald Aylmer Fisher (1890-1962)
Luca Lista Statistical methods in LHC data analysis 15
Rewriting Fisher discriminantRewriting Fisher discriminant
• m1, m2 are the two samples’ average vectors
1, 2 are the two samples’ covariance matrices
• Transform with linear vector of coefficients w– w is normal to the discriminator hyperplane
“between classes scatter matrix”
“within classes scatter matrix”
Luca Lista Statistical methods in LHC data analysis 16
Maximizing the Fisher discriminantMaximizing the Fisher discriminant
• Either compute derivatives w.r.t. wi
• Equivalent to solve the eigenvalues problem:
Luca Lista Statistical methods in LHC data analysis 17
Fisher in the previous exampleFisher in the previous example
• Not always optimal: it’s linear cut, after all…!
F > 0
Luca Lista Statistical methods in LHC data analysis 18
Other discriminator methodsOther discriminator methods
• Artificial Neural Networks• Boosted Decision Trees
• Those topics are beyond the scope of this tutorial– A brief sketch will be given just for
completeness
• More details in TMVA package– http://tmva.sourceforge.net/
Luca Lista Statistical methods in LHC data analysis 19
Artificial Neural NetworksArtificial Neural Networks
• Artificial simplified model of how neurons work
Input layer
Output layerHidden layers
x0
x1
x2
xp
…
y
w01
w02
w0nw1n
w2n
w11
w12 w21
w22
()
Activation function
Luca Lista Statistical methods in LHC data analysis 20
Network vs other discriminatorsNetwork vs other discriminators• Artificial neural network with a single hidden layer may
approximate any analytical function within a given approximation if the number of neurons is sufficiently high
• Adding more hidden layers can make the approximation more efficient– i.e.: smaller total number of neurons
• Demonstration in: – H. N. Mhaskar, Neural Computation, Vol. 8, No. 1, Pages 164-177
(1996), Neural Networks for Optimal Approximation of Smooth and Analytic Functions:
“We prove that neural networks with a single hidden layer are capable of providing an optimal order of approximation for functions assumed to possess a given number of derivatives, if the activation function evaluated by each principal element satisfies certain technical conditions”
Luca Lista Statistical methods in LHC data analysis 21
(Boosted) Decision Trees(Boosted) Decision Trees• Select as usual a set of
discriminating variables• Progressively split the
sample according to subsequent cuts o single discriminating variables
• Optimize the splitting cuts in order to obtain the best signal/background separation
• Repeat splitting until the sample contains mostly signal or background, and the statistics on the split samples is too low to continue
• Many different trees are need to be combined for a robust and effective discrimination (“forest”)
Leaf
Leaf
Leaf
Leaf
Branch
Branch
Branch
Decision tree
Luca Lista Statistical methods in LHC data analysis 22
A strongly non linear caseA strongly non linear case
x
y
Luca Lista Statistical methods in LHC data analysis 23
Classifiers separationClassifiers separation
Fisher Projective Likelihood ratio
Neural Network
BDT
Luca Lista Statistical methods in LHC data analysis 24
Cutting on classifiers output (I)Cutting on classifiers output (I)
Fisher > 0 L > 0.5
Luca Lista Statistical methods in LHC data analysis 25
Cutting on classifiers output (II)Cutting on classifiers output (II)
NN > 0 BDT > 0
Luca Lista Statistical methods in LHC data analysis 26
Jerzy Neyman’s confidence intervalsJerzy Neyman’s confidence intervals• Scan an unknown
parameter • Given , compute the
interval [x1, x2] that contain x with a probability C.L. = 1
• Ordering rule needed!• Invert the confidence belt,
and find the interval [1, 2] for a given experimental outcome of x
• A fraction 1 of the experiments will produce x such that the corresponding interval [1, 2] contains the true value of (coverage probability)
• Note that the random variables are [1, 2], not
From PDG statistics review
RooStats::NeymanConstructionRooStats::NeymanConstruction
Luca Lista Statistical methods in LHC data analysis 27
Ordering ruleOrdering rule
• Different possible choices of the interval giving the same are 1 are possible
• For fixed = 0 we can have different choices
x
f(x|0)
1x
f(x|0)
1/2
/2
Upper limit choice Central interval
Luca Lista Statistical methods in LHC data analysis 28
Feldman-Cousins orderingFeldman-Cousins ordering
• Find the contour of the likelihood ratio that gives an area
• R = {x : L(x|θ) / L(x|θbest) > k}
x
f(x|0)
1
f(x|0)/f(x| best(x))
RooStats::FeldmanCousinsRooStats::FeldmanCousins
Luca Lista Statistical methods in LHC data analysis 29
“Flip-flopping”“Flip-flopping”
• When to quote a central value or upper limit?• E.g.:
– “Quote a 90% C.L. upper limit of the measurement is below 3; quote a central value otherwise”
• Upper limit central interval decided according to observed data
• This produces incorrect coverage!• Feldman-Cousins interval ordering
guarantees the correct coverage
Luca Lista Statistical methods in LHC data analysis 30
“Flip-flopping” with Gaussian PDF“Flip-flopping” with Gaussian PDF
• Assume Gaussian with a fixed width: =1
x
3
< x + 1.28155x
90%
x
= x 1.64485
10% 5%10%
90%
5%5%
From Feldman and Cousins’ paper
Coverage is 85% for low !
Luca Lista Statistical methods in LHC data analysis 31
Feldman-Cousins approachFeldman-Cousins approach
• Define range such that:
– P(x|) / P(x|best(x)) > k
x
best = max(x, 0)
Asymmetric errors
Upper limits
Usual errors
Will see more when talking about upper limits…
best = x for x 0
Solution can be found numerically
Binomial Confidence IntervalBinomial Confidence Interval• Using the proper Neyman belt inversion, e.g.
Feldman Cousins method, avoids odd problems, likenull errors when estimating efficiencies equal to 0 or 1,that would occurusing the centrallimit formula:
Luca Lista Statistical methods in LHC data analysis 32
Luca Lista Statistical methods in LHC data analysis 33
Binned fits: minimum2Binned fits: minimum2
• Bin entries can be approximated by Gaussian for sufficiently large number of entries with r.m.s. equal to ni (Neyman):
• The expected number of entries i is often approximated as the value of a continuous function f at the center xi of the bin:
• Denominator ni could be replaced by i=f(xi; 1, …, n) (Pearson)• Usually simpler to implement than un-binned ML fits• Analytic solution exists for linear and other simple problems • Un-binned ML fits unpractical for large sample size• Binned fits can give poor results for small number of entries
Luca Lista Statistical methods in LHC data analysis 34
Fit qualityFit quality• The value of the Maximum Likelihood obtained in a fit w.r.t its
expected distributions don’t give any information about the goodness of the fit
• Chi-square test– The2 of a fit with a Gaussian underlying model should be
distributed according to a known PDF
– Sometimes this is not the case, if the model can’t be sufficiently approximated with a Gaussian
– The integral of the right-most tail (P(2>X)=) is one example of so-called ‘p-value’
• Beware! p-values are not the “probability of the fit hypothesis”– This would be a Bayesian probability, with a different meaning, and
should be computed in a different way ( next lecture)!
n is the number ofdegrees of freedom
Luca Lista Statistical methods in LHC data analysis 35
Binned likelihoodBinned likelihood
• Assume our sample is a binned histogram from an event counting experiment (obeying Poissonian statistics), with no need of a Gaussian approximation
• We can build a likelihood function multiplying Poisson distributions for the number of entries in each bin, {ni} having expected number of entries depending on some unknown parameters: i(1, …k)
• We can minimize the following quantity:
Luca Lista Statistical methods in LHC data analysis 36
• A better alternative to the (Gaussian-inspired, Neyman and Pearson’s) 2 has been proposed by Baker and Cousins using the likelihood ratio:
• Same minimum value as previous slide, since a constant term has been added to the log-likelihood
• It also provides a goodness-of-fit information, and asymptotically obeys chi-squared distribution with kn degrees of freedom (Wilks’ theorem)
Binned likelihood ratioBinned likelihood ratio
S. Baker and R. Cousins, Clarification of the Use of Chi-square and Likelihood Functions in Fits to Histograms, NIM 221:437 (1984)
Luca Lista Statistical methods in LHC data analysis 37
Combining measurements with2Combining measurements with2
• Two measurements with different uncorrelated (Gaussian) errors:
• Build 2:
• Minimize 2:
• Estimate m as:
• Error estimate:
Luca Lista Statistical methods in LHC data analysis 38
Covariance and cov. matrixCovariance and cov. matrix
• Definitions:– Covariance:
– Correlation:
• Correlated n-dimensional Gaussian:
• where:
Luca Lista Statistical methods in LHC data analysis 39
Two-dimensional GaussianTwo-dimensional Gaussian
• Product of two independent Gaussians with different
• Rotation in the (x, y) plane
Luca Lista Statistical methods in LHC data analysis 40
Two-dimensional Gaussian (cont.)Two-dimensional Gaussian (cont.)
• Rotation preserves the metrics:
• Covariance in rotated coordinates:
Luca Lista Statistical methods in LHC data analysis 41
Two-dimensional Gaussian (cont.)Two-dimensional Gaussian (cont.)
• A pictorial view of an iso-probability contour
x
y
x
y
x
y
Luca Lista Statistical methods in LHC data analysis 42
1D projections1D projections
x
y
1
2
12
P1D P2D
1 0.6827 0.3934
2 0.9545 0.8647
3 0.9973 0.9889
1.515 0.8702 0.6827
2.486 0.9871 0.9545
3.439 0.9994 0.9973
• PDF projections are (1D) Gaussians:• Areas of 1 and 2
contours differin 1D and 2D!
Luca Lista Statistical methods in LHC data analysis 43
Generalization of 2 to n dimensionsGeneralization of 2 to n dimensions
• If we have n measurements, (m1, …, mn) with a nn covariance matrix (Cij) , the chi-squared can be generalized as follows:
• More details on the PDG statistics review
Luca Lista Statistical methods in LHC data analysis 44
Combining correlated measurementsCombining correlated measurements
• Correlation coefficient 0:
• Build 2 including correlation terms:
• Minimization gives:
Luca Lista Statistical methods in LHC data analysis 45
Correlated errorsCorrelated errors
• The “common error” C is defined as:
• Using error propagation, this also implies that:
• The previous formulas now become:
H. Greenlee, Combining CDF and D0 Physics Results, Fermilab Workshop
on Confidence Limits, March 28, 2000
Luca Lista Statistical methods in LHC data analysis 46
Toy Monte CarloToy Monte Carlo• Generate a large number of experiments according to the fit
model, with fixed parameters ()• Fit all the toy samples as if they where the real data samples• Study the distributions of the fit quantities• Parameter pulls: p = (est )/
– Verify the absence of bias: p = 0– Verify the correct error estimate : (p) = 1
• Statistical uncertainty will depend on number of the Toy Monte Carlo experiments
• Distribution of maximum likelihood (or -2lnL) gives no information about the quality of the fit
• Goodness of fit for ML in more than one dimension is still an open and debated issue
• Often preferred likelihood ratio w.r.t. a null hypothesis– Asymptotically distributed as a chi-square– Determine the C.L. of the fit to real data as fraction of toy cases
with worse value of maximum log-likelihood-ratio
Luca Lista Statistical methods in LHC data analysis 47
Kolmogorov - Smirnov testKolmogorov - Smirnov test
• Assume you have a sample {x1, …, xn}, you want to test if the set is compatible with being produced by random variables obeying a PDF f(x)
• The test consists in building the cumulative distribution for the set and the PDF:
• The distance between the two cumulative distribution is evaluated as:
Luca Lista Statistical methods in LHC data analysis 48
Kolmogorov-Smirnov test in a pictureKolmogorov-Smirnov test in a picture
x
x1 x2 xn…
1
0
Dn
F(x)
Fn(x)
Luca Lista Statistical methods in LHC data analysis 49
Kolmogorov distributionKolmogorov distribution
• For large n:– Dn converges to zero (small Dn = good agreement)– K=n Dn has a distribution that is independent on f(x) known
as Kolmogorov distribution (related to Brownian motion)• Kolmogorov distribution is:
• Caveat with KS test:– Very common in HEP, but not always appropriate– If the shape or parameters of the PDF f(x) are determined
from the sample (i.e.: with a fit) the distribution of nDn may deviate from the Kolmogorov distribution.
– A toy Monte Carlo method could be used in those case to evaluate the distribution of n Dn
Luca Lista Statistical methods in LHC data analysis 50
Two sample KS testTwo sample KS test
• We can test whether two samples {x1, …, xn}, {y1, …, ym}, follow the same distribution using the distance:
• The variable that follows asymptotically the Kolmogorov distribution is, in this case:
A concrete 2 exampleA concrete 2 example
Electro-Weak precision tests
Luca Lista Statistical methods in LHC data analysis 52
Electro-Weak precision testsElectro-Weak precision tests
• SM inputs from LEP (Aleph, Delphi, L3, Opal), SLC (SLD), Tevatron (CDF, D0).
Luca Lista Statistical methods in LHC data analysis 53
Higgs mass predictionHiggs mass prediction
• Global 2 analysis, using ZFitter for detailed SM calculations
• Correlation terms not negligible, even cross-experiment (LEP energy…)
• Higgs mass prediction from indirect effect on radiative corrections
Luca Lista Statistical methods in LHC data analysis 54
ReferencesReferences• Gary J. Feldman, Robert D. Cousins, “Unified approach to the classical
statistical analysis of small signals”, Phys. Rev. D 57, 3873 - 3889 (1998)• J. Friedman, T. Hastie and R. Tibshirani, “The Elements of Statistical Learning”,
Springer Series in Statistics, 2001.• A. Webb, “Statistical Pattern Recognition”, 2nd Edition, J. Wiley & Sons Ltd,
2002.• L.I. Kuncheva, “Combining Pattern Classifiers”, J. Wiley & Sons, 2004.• Artificial Neural Networks
– Bing Cheng and D. M. Titterington, Statist. Sci. Volume 9, Number 1 (1994), 2-30, Neural Networks: A Review from a Statistical Perspective
– Robert P.W. Duin, Learned from Neural Networkshttp://ict.ewi.tudelft.nl/~duin/papers/asci_00_NNReview.pdf
• Boosted decision trees– R.E. Schapire, The boosting approach to machine learning: an overview, MSRI
Workshop on Nonlinear Estimation and Classification, 2002.– Y. Freund, R.E. Schapire, A short introduction to boosting, J. Jpn. Soc. Artif. Intell. 14
(5) (1999) 771– Byron P. Roe et al, Boosted Decision Trees as an Alternative to Artificial Neural
Networks for Particle Identification– B.P Roe et al., Nucl.Instrum.Meth. A543 (2005) 577-584 Boosted decision trees as an
alternative to artificial neural networks for particle identificationhttp://arxiv.org/abs/physics/0408124
– Bauer and Kohavi, Machine Learning 36 (1999),“An empirical comparison of voting classification algorithms”