mlhep lectures - day 1, basic track

Machine Learning in High Energy Physics

Lectures 1 & 2

Alex Rogozhnikov

Lund, MLHEP 2016

1 / 87

Intro notestwo tracks:

introductory course (this one)advanced track: Mon, Tue, Wed, then two tracks are merged

Introductory track:

two lectures and two practice seminars on each day

Kaggle challenges

'Triggers' — only for advanced track, lasts for 3 days'Higgs' — for both tracks, lasts for 7 daysknow material? Spend more time on challenges!

1 / 87

Intro notes — 2chat rooms

gitterif you want to share something between teams — please do it publicly(via chat)

repositoryglossary is in the repository

2 / 87

https://gitter.im/yandexdataschool/mlhep2016

https://github.com/yandexdataschool/mlhep2016

https://github.com/yandexdataschool/mlhep2016/blob/master/glossary.md

What is Machine Learning about?a method of teaching computers to make and improve predictions orbehaviors based on some data?a field of computer science, probability theory, and optimization theorywhich allows complex tasks to be solved for which a logical/proceduralapproach would not be possible or feasible?a type of AI that provides computers with the ability to learn without beingexplicitly programmed?somewhat in between of statistics, AI, optimization theory, signal processingand pattern matching?

3 / 87

What is Machine Learning about

Inference of statistical dependencies which give us ability to predict

4 / 87

What is Machine Learning about

Inference of statistical dependencies which give us ability to predict

Data is cheap, knowledge is precious

5 / 87

Machine Learning is used insearch enginesspam detectionsecurity: virus detection, DDOS defensecomputer vision and speech recognitionmarket basket analysis, customer relationship management (CRM), churnpredictioncredit scoring / insurance scoring, fraud detectionhealth monitoringtraffic jam prediction, self-driving carsadvertisement systems / recommendation systems / news clustering

6 / 87

Machine Learning is used insearch enginesspam detectionsecurity: virus detection, DDOS defensecomputer vision and speech recognitionmarket basket analysis, customer relationship management (CRM), churnpredictioncredit scoring / insurance scoring, fraud detectionhealth monitoringtraffic jam prediction, self-driving carsadvertisement systems / recommendation systems / news clusteringand hundreds more

7 / 87

Machine Learning in High Energy PhysicsTriggers (LHCb, CMS to join soon)Particle identificationCalibrationTaggingStripping lineAnalysis

8 / 87

Machine Learning in High Energy PhysicsTriggers (LHCb, CMS to join soon)Particle identificationCalibrationTaggingStripping lineAnalysis

On each stage different data is used and different information is inferred, butthe ideas beyond are quite similar.

9 / 87

General notionIn supervised learning the training data is represented as a set of pairs

is an index of event is a vector of features available for event is a target — the value we need to predict

features = observables = variables

,xi yi

ixiyi

10 / 87

Classification problem, where is finite set of labels.

Examples

particle identification based on information about track:

binary classification:

— is signal, is background

! Yyi Y

xi

Y==

(p, η, E, charge, PV , FlightTime)χ2

{electron, muon, pion, ... }

Y = {0, 1} 1 0

11 / 87

Regression problem

Examples:

predicting price of a house by it's positionspredicting number of customers / money incomereconstructing real momentum of particle

y ! ℝ

12 / 87

Regression problem

Examples:

predicting price of a house by it's positionspredicting number of customers / money incomereconstructing real momentum of particle

Why do we need automatic classification/regression?

in applications up to thousands of featureshigher qualitymuch faster adaptation to new problems

y ! ℝ

13 / 87

Classification based on nearest neighboursGiven training set of objects and their labels we predict the label for thenew observation :

Here and after is the distance in the space of features.

{ , }xi yix

= , j = arg ρ(x, )y yj mini

xi

ρ(x, )x

14 / 87

Visualization of decision ruleConsider a classification problem with 2 features:

, = ( , )xi x1i x2

i ! Y = {0, 1}yi15 / 87

Nearest Neighbours ( NN)A better way is to use neighbors:

k kk

(x) =py # of knn events of x in class y

k

16 / 87

17 / 87

k = 1, 2, 5, 3018 / 87

OverfittingWhat is the quality of classification on training dataset when ?k = 1

19 / 87

OverfittingWhat is the quality of classification on training dataset when ?

answer: it is ideal (closest neighbor is event itself)

k = 1

20 / 87


answer: it is ideal (closest neighbor is event itself)quality is lower when

k = 1

k > 1

21 / 87


answer: it is ideal (closest neighbor is event itself)quality is lower when

this doesn't mean is the best, it means we cannot use training events to estimate quality

when classifier's decision rule is too complex and captures details fromtraining data that are not relevant to distribution, we call this an overfitting(more details tomorrow)

k = 1

k > 1

k = 1

22 / 87

Regression using NNRegression with nearest neighbours is done by averaging of output

k

=y 1k ∑

j!knn(x)

yj

23 / 87

NN with weights

Average neighbours' output withweights:

the closer the neighbour, thehigher weights of its contribution,i.e.:

k

=y *j!knn(x) wj yj

*j!knn(x) wj

= 1/ρ(x, )wj xj

24 / 87

Computational complexityGiven that dimensionality of space is and there are training samples:

training time: ~ O(save a link to the data)prediction time: for each sample

d n

n × d

25 / 87

Spacial index: ball tree

26 / 87

Ball treetraining time ~ prediction time ~ for each sample

Other option exists: KD-tree.

O(d × n log(n))log(n) × d

27 / 87

Overview of NNAwesomely simple classifier and regressorProvides too optimistic quality on training dataQuite slow, though optimizations existToo sensitive to scale of featuresHard times with data of high dimensions

k

28 / 87

Sensitivity to scale of featuresEuclidean distance:

ρ(x, = ( + + ( + + ⋯ + ( +x )2 x1 x 1 )2 x2 x 2 )2 xd x d )2

29 / 87


Change scale of first feature:

ρ(x, = ( + + ( + + ⋯ + ( +x )2 x1 x 1 )2 x2 x 2 )2 xd x d )2

ρ(x, x )2

ρ(x, x )2=U

(10 + 10 + ( + + ⋯ + ( +x1 x 1 )2 x2 x 2 )2 xd x d )2

100 ( +x1 x 1 )2

30 / 87


Change scale of first feature:

Scaling of features frequently increases quality.

ρ(x, = ( + + ( + + ⋯ + ( +x )2 x1 x 1 )2 x2 x 2 )2 xd x d )2

ρ(x, x )2

ρ(x, x )2=U

(10 + 10 + ( + + ⋯ + ( +x1 x 1 )2 x2 x 2 )2 xd x d )2

100 ( +x1 x 1 )2

31 / 87

Distance function mattersMinkowski distance Canberra

Cosine metric

ρ(x, = ( +x )p *l xl x l )p

ρ(x, ) =x ∑l

| + |xl x l| | + | |xl x l

ρ(x, ) =x < x, >x

|x| | |x

32 / 87

Problems with high dimensionsWith higher dimensions the neighboring points are further.

Example: consider training data points being distributed unformly in the unitcube:

expected number of point in the ball of size is proportional to the to collect the same amount on neighbors, we need to put

NN suffers from curse of dimensionality.

d >> 1

n

r rd

r = → 1const1/d

k

33 / 87

Measuring quality of binary classificationThe classifier's output in binary classification is real variable (say, signal is blueand background is red)

Which classifier provides better discrimination?

34 / 87

Measuring quality of binary classificationThe classifier's output in binary classification is real variable (say, signal is blueand background is red)

Which classifier provides better discrimination?Discrimination is identical in all three cases

35 / 87

ROC curve demonstration

36 / 87

http://localhost:8000/arogozhnikov.github.io/RocCurve.html

ROC curve

37 / 87

ROC curve

These distributions have the same ROCcurve: (ROC curve is passed signal vs passed bckdependency)

38 / 87

ROC curveDefined only for binary classificationContains important information: all possible combinations of signal and background efficiencies you mayachieve by setting threshold

39 / 87

ROC curveDefined only for binary classificationContains important information: all possible combinations of signal and background efficiencies you mayachieve by setting thresholdParticular values of thresholds (and initial pdfs) don't matter, ROC curvedoesn't contain this information

40 / 87

ROC curveDefined only for binary classificationContains important information: all possible combinations of signal and background efficiencies you mayachieve by setting thresholdParticular values of thresholds (and initial pdfs) don't matter, ROC curvedoesn't contain this informationROC curve = information about order of events:

b b s b s b ... s s b s s

41 / 87

ROC curveDefined only for binary classificationContains important information: all possible combinations of signal and background efficiencies you mayachieve by setting thresholdParticular values of thresholds (and initial pdfs) don't matter, ROC curvedoesn't contain this informationROC curve = information about order of events:

b b s b s b ... s s b s s

Comparison of algorithms should be based on the information from ROCcurve.

42 / 87

Terminology and Conventionsfpr = background efficiency = btpr = signal efficiency = s

43 / 87

Terminology and Conventionsfpr = background efficiency = btpr = signal efficiency = s

→

44 / 87

ROC AUC (area under the ROC curve)

where are predictions ofrandom background and signalevents.

ROC AUC = P( < )rb rs,rb rs

45 / 87

Classifier have the same ROCAUC, but which is better fortriggers at the LHC? (we needto pass very few background)

46 / 87

Classifier have the same ROCAUC, but which is better fortriggers at the LHC? (we needto pass very few background)

Applications frequently demanddifferent metric.

47 / 87

-minutes breakn

48 / 87

Recapitulation

1. Statistical ML: applications and problems2. ML in HEP3. nearest neighbours classifier and regressor.4. ROC curve, ROC AUC

k

49 / 87

Statistical Machine LearningMachine learning we use in practice is based on statistics

Main assumption: the data is generated from probabilistic distribution:

Does there really exist the distribution of people / pages / texts?

p(x, y)

50 / 87

Statistical Machine LearningMachine learning we use in practice is based on statistics

Main assumption: the data is generated from probabilistic distribution:

Does there really exist the distribution of people / pages / texts?In HEP these distributions do exist

p(x, y)

51 / 87

Optimal classification. Bayes optimal classifierAssuming that we know real distributions we reconstruct using Bayes'rule

Lemma (Neyman–Pearson):

The best classification quality is provided by (Bayes optimal

classifier)

p(x, y)

p(y|x) = =p(x, y)p(x)

p(y)p(x|y)p(x)

=p(y = 1 | x)p(y = 0 | x)

p(y = 1) p(x | y = 1)p(y = 0) p(x | y = 0)

p(y = 1 | x)p(y = 0 | x)

53 / 87

Optimal Binary ClassificationBayes optimal classifier has highest possible ROC curve.Since the classification quality depends only on order, givesoptimal classification quality too!

p(y = 1 | x)

= ×p(y = 1 | x)p(y = 0 | x)

p(y = 1)p(y = 0)

p(x | y = 1)p(x | y = 0)

54 / 87

Optimal Binary ClassificationBayes optimal classifier has highest possible ROC curve.Since the classification quality depends only on order, givesoptimal classification quality too!

How can we estimate terms from this expression?

p(y = 1 | x)

= ×p(y = 1 | x)p(y = 0 | x)

p(y = 1)p(y = 0)

p(x | y = 1)p(x | y = 0)

55 / 87

Histograms density estimationCounting number of samples in each bin and normalizing.

fastchoice of binning is crucialnumber of bins grows exponentially curse of dimensionality→

56 / 87

Kernel density estimation

is kernel, isbandwidth

Typically, gaussian kernel isused, but there are many others.

Approach is very close toweighted NN.

f (x) = K ( )1nh ∑

i

x + xi

h

K(x) h

k 57 / 87

https://en.wikipedia.org/wiki/Kernel_(statistics)

Kernel density estimationbandwidth selectionSilverman's rule of thumb:

h = σ ( )43n

15

58 / 87

Kernel density estimationbandwidth selectionSilverman's rule of thumb:

may be irrelevant if the data is far frombeing gaussian

h = σ ( )43n

15

59 / 87

Parametric density estimationFamily of density functions: .

Problem: estimate parameters of a Gaussiandistribution.

f (x; θ)

f (x; μ,Σ) = exp(+ (x + μ (x + μ))1(2π |Σ)d/2 |1/2

12

)TΣ+1

60 / 87

QDA (Quadratic discriminant analysis)Reconstructing probabilities from data, assuming thoseare multidimensional normal distributions:

p(x | y = 1), p(x | y = 0)

p(x | y = 0) U ( , )μ0 Σ0

p(x | y = 1) U ( , )μ1 Σ1

= = const =p(y = 1 | x)p(y = 0 | x)

p(y = 1)p(y = 0)

p(x | y = 1)p(x | y = 0)

n1

n0

exp(+ (x + (x + ))12 μ1)TΣ+1

1 μ1

exp(+ (x + (x + ))12 μ0)TΣ+1

0 μ0

= exp(+ (x + (x + ) + (x + (x + ) + const)12

μ1)TΣ+11 μ1

12

μ0)TΣ+10 μ0

61 / 87

62 / 87

QDA complexity samples, dimensions

training consists of fitting and takes

computing covariance matrix inverting covariance matrix

prediction takes for each sample spent on computing dot product

n d

p(x | y = 0) p(x | y = 1) O(n + )d2 d3

O(n )d2

O( )d3

O( )d2

63 / 87

QDA overviewsimple decision rulefast predictionmany parameters to reconstruct in high dimensionsdata almost never has gaussian distribution

64 / 87

Gaussian mixtures for density estimationMixture of distributions:

Mixture of Gaussian distributions:

Parameters to be found: , ,

f (x) = (x, ) = 1∑c+components

πc fc θc ∑c+components

πc

f (x) = f (x; , )∑c+components

πc μc Σc

,… ,π1 πC ,… ,μ1 μC ,… ,Σ1 ΣC

65 / 87

66 / 87

Gaussian mixtures: finding parametersCriterion is maximizing likelihood (using MLE to find optimal parameters)

no analytic solutionwe can use general-purpose optimization methods

log f ( ; θ) →∑i

xi maxθ

67 / 87

Gaussian mixtures: finding parametersCriterion is maximizing likelihood (using MLE to find optimal parameters)

no analytic solutionwe can use general-purpose optimization methods

In mixtures parameters are split in two groups:

— parameters of components — contributions of components

log f ( ; θ) →∑i

xi maxθ

,… ,θ1 θC,… ,π1 πC

68 / 87

Expectation-Maximization algorithm [Dempster et al., 1977]Idea: introduce set of hidden variables

Expectation:

Maximization:

Maximization step is trivial for Gaussian distributions. EM-algorithm is more stable and has good convergence properties.

(x)πc

(x) ← p(x ! c) =πc(x; )πc fc θc

(x; )*c πc fc θc

πc

θc

←

←

( )∑i

πc xi

arg (x) log (x, )maxθ ∑

i

πc fc θc

69 / 87

EM algorithm

70 / 87

EM algorithm

71 / 87

Classification model based on mixtures density estimation is called MDA(mixture discriminant analysis)

Generative approachGenerative approach: trying to reconstruct , then use Bayes classificationformula to predict.

QDA, MDA are generative classifiers.

p(x, y)

72 / 87

Classification model based on mixtures density estimation is called MDA(mixture discriminant analysis)

Generative approachGenerative approach: trying to reconstruct , then use Bayes classificationformula to predict.

QDA, MDA are generative classifiers.

Problems of generative approach

Real life distributions hardly can be reconstructedEspecially in high-dimensional spacesSo, we switch to discriminative approach: guessing directly

p(x, y)

p(y|x)73 / 87

Classification: truck vs car

74 / 87

If we can avoid density estimation, we'd better do it.

75 / 87

Linear decision rule

Decision function is linear:

This is a parametric model (finding parameters ). QDA & MDA are parametric as well.

d(x) =< w, x > +w0

{ d(x) > 0 →d(x) < 0 →

= +1y = +1y

w, w0

76 / 87

Finding Optimal ParametersA good initial guess: get such , that error of classification is minimal:

Notion: .

Discontinuous optimization (arrrrgh!)

w, w0

= = sgn(d( ))∑i!events

1 yyi y i y i xi

= 1, = 01true 1f alse

77 / 87

Finding Optimal Parameters - 2Discontinuous optimization

solution: let's make decision rule smooth

(x)p+1

(x)p+1

= f (d(x))= 1 + (x)p+1

⎧

⎩⎨⎪⎪

f (0) = 0.5f (x) > 0.5f (x) < 0.5

if x > 0if x < 0

78 / 87

Logistic function

Properties

1. monotonic, 2. 3. 4.

σ(x) = =ex

1 + ex1

1 + e+x

σ(x) ! (0, 1)σ(x) + σ(+x) = 1

(x) = σ(x)(1 + σ(x))σ ′

2 σ(x) = 1 + tanh(x/2)

79 / 87

Logistic regressionDefine probabilities obtained with logistic function

and optimize log-likelihood:

Important exercise: find an expression and build a plot for

d(x)(x)p+1

(x)p+1

===

< w, x > +w0

σ(d(x))σ(+d(x))

= + ln( ( )) = L( , ) → min∑i!events

pyi xi ∑i

xi yi

L( , ) = + ln( ( ))xi yi pyi xi80 / 87

Linear model for regression

How to use linear functionfor regression?

Simplification of notion:.

d(x) =< w, x > +w0

= 1, x = (1, ,… , )x0 x1 xd

d(x) =< w, x >

81 / 87

Linear regression (ordinary least squares)We can use linear function for regression:

This is a linear system with variables and equations.

Minimize OLS aka MSE (mean squared error):

Explicit solution:

d( ) = d(x) =< w, x >xi yi

d + 1 n

= (d( ) + → min∑i

xi yi )2

( ) w =*i xixTi *i yixi

82 / 87

Linear regressioncan use some other error

but no explicit solution in other casesdemonstrates properties of linear models

reliable estimates when able to completely fit to the data if undefined when

n >> dn = d

d < n

83 / 87

Data Scientist Pipeline

Experiments in appropriate high-level language or environmentAfter experiments are over — implement final algorithm in low-levellanguage (C++, CUDA, FPGA)

Second point is not always needed

84 / 87

Scientific Python

NumPy vectorized computations in python

Matplotlib for drawing

Pandas for data manipulation and analysis (based on NumPy)

85 / 87

Scientific Python

Scikit-learn most popular library for machine learning

Scipy libraries for science and engineering

Root_numpy convenient way to work with ROOT files

86 / 87

87 / 87

mlhep lectures - day 1, basic track

Science