mlhep lectures - day 1, basic track

88
Machine Learning in High Energy Physics Lectures 1 & 2 Alex Rogozhnikov Lund, MLHEP 2016 1 / 87

Upload: arogozhnikov

Post on 21-Feb-2017

941 views

Category:

Science


0 download

TRANSCRIPT

Page 1: MLHEP Lectures - day 1, basic track

Machine Learning in High Energy Physics

Lectures 1 & 2

Alex Rogozhnikov

Lund, MLHEP 2016

1 / 87

Page 2: MLHEP Lectures - day 1, basic track

Intro notestwo tracks:

introductory course (this one)advanced track: Mon, Tue, Wed, then two tracks are merged

Introductory track:

two lectures and two practice seminars on each day

Kaggle challenges

'Triggers' — only for advanced track, lasts for 3 days'Higgs' — for both tracks, lasts for 7 daysknow material? Spend more time on challenges!

1 / 87

Page 3: MLHEP Lectures - day 1, basic track

Intro notes — 2chat rooms

gitterif you want to share something between teams — please do it publicly(via chat)

repositoryglossary is in the repository

2 / 87

Page 4: MLHEP Lectures - day 1, basic track

What is Machine Learning about?a method of teaching computers to make and improve predictions orbehaviors based on some data?a field of computer science, probability theory, and optimization theorywhich allows complex tasks to be solved for which a logical/proceduralapproach would not be possible or feasible?a type of AI that provides computers with the ability to learn without beingexplicitly programmed?somewhat in between of statistics, AI, optimization theory, signal processingand pattern matching?

3 / 87

Page 5: MLHEP Lectures - day 1, basic track

What is Machine Learning about

Inference of statistical dependencies which give us ability to predict

4 / 87

Page 6: MLHEP Lectures - day 1, basic track

What is Machine Learning about

Inference of statistical dependencies which give us ability to predict

Data is cheap, knowledge is precious

5 / 87

Page 7: MLHEP Lectures - day 1, basic track

Machine Learning is used insearch enginesspam detectionsecurity: virus detection, DDOS defensecomputer vision and speech recognitionmarket basket analysis, customer relationship management (CRM), churnpredictioncredit scoring / insurance scoring, fraud detectionhealth monitoringtraffic jam prediction, self-driving carsadvertisement systems / recommendation systems / news clustering

6 / 87

Page 8: MLHEP Lectures - day 1, basic track

Machine Learning is used insearch enginesspam detectionsecurity: virus detection, DDOS defensecomputer vision and speech recognitionmarket basket analysis, customer relationship management (CRM), churnpredictioncredit scoring / insurance scoring, fraud detectionhealth monitoringtraffic jam prediction, self-driving carsadvertisement systems / recommendation systems / news clusteringand hundreds more

7 / 87

Page 9: MLHEP Lectures - day 1, basic track

Machine Learning in High Energy PhysicsTriggers (LHCb, CMS to join soon)Particle identificationCalibrationTaggingStripping lineAnalysis

8 / 87

Page 10: MLHEP Lectures - day 1, basic track

Machine Learning in High Energy PhysicsTriggers (LHCb, CMS to join soon)Particle identificationCalibrationTaggingStripping lineAnalysis

On each stage different data is used and different information is inferred, butthe ideas beyond are quite similar.

9 / 87

Page 11: MLHEP Lectures - day 1, basic track

General notionIn supervised learning the training data is represented as a set of pairs

is an index of event is a vector of features available for event is a target — the value we need to predict

features = observables = variables

,xi yi

ixiyi

10 / 87

Page 12: MLHEP Lectures - day 1, basic track

Classification problem, where is finite set of labels.

Examples

particle identification based on information about track:

binary classification:

— is signal, is background

! Yyi Y

xi

Y==

(p, η, E, charge, PV , FlightTime)χ2

{electron, muon, pion, ... }

Y = {0, 1} 1 0

11 / 87

Page 13: MLHEP Lectures - day 1, basic track

Regression problem

Examples:

predicting price of a house by it's positionspredicting number of customers / money incomereconstructing real momentum of particle

y ! ℝ

12 / 87

Page 14: MLHEP Lectures - day 1, basic track

Regression problem

Examples:

predicting price of a house by it's positionspredicting number of customers / money incomereconstructing real momentum of particle

Why do we need automatic classification/regression?

in applications up to thousands of featureshigher qualitymuch faster adaptation to new problems

y ! ℝ

13 / 87

Page 15: MLHEP Lectures - day 1, basic track

Classification based on nearest neighboursGiven training set of objects and their labels we predict the label for thenew observation :

Here and after is the distance in the space of features.

{ , }xi yix

= , j = arg ρ(x, )y  yj mini

xi

ρ(x, )x  

14 / 87

Page 16: MLHEP Lectures - day 1, basic track

Visualization of decision ruleConsider a classification problem with 2 features:

, = ( , )xi x1i x2

i ! Y = {0, 1}yi15 / 87

Page 17: MLHEP Lectures - day 1, basic track

Nearest Neighbours ( NN)A better way is to use neighbors:

k kk

(x) =py  # of knn events of x in class y  

k

16 / 87

Page 18: MLHEP Lectures - day 1, basic track

17 / 87

Page 19: MLHEP Lectures - day 1, basic track

k = 1, 2, 5, 3018 / 87

Page 20: MLHEP Lectures - day 1, basic track

OverfittingWhat is the quality of classification on training dataset when ?k = 1

19 / 87

Page 21: MLHEP Lectures - day 1, basic track

OverfittingWhat is the quality of classification on training dataset when ?

answer: it is ideal (closest neighbor is event itself)

k = 1

20 / 87

Page 22: MLHEP Lectures - day 1, basic track

OverfittingWhat is the quality of classification on training dataset when ?

answer: it is ideal (closest neighbor is event itself)quality is lower when

k = 1

k > 1

21 / 87

Page 23: MLHEP Lectures - day 1, basic track

OverfittingWhat is the quality of classification on training dataset when ?

answer: it is ideal (closest neighbor is event itself)quality is lower when

this doesn't mean is the best, it means we cannot use training events to estimate quality

when classifier's decision rule is too complex and captures details fromtraining data that are not relevant to distribution, we call this an overfitting(more details tomorrow)

k = 1

k > 1

k = 1

22 / 87

Page 24: MLHEP Lectures - day 1, basic track

Regression using NNRegression with nearest neighbours is done by averaging of output

k

=y  1k ∑

j!knn(x)

yj

23 / 87

Page 25: MLHEP Lectures - day 1, basic track

NN with weights

Average neighbours' output withweights:

the closer the neighbour, thehigher weights of its contribution,i.e.:

k

=y *j!knn(x) wj yj

*j!knn(x) wj

= 1/ρ(x, )wj xj

24 / 87

Page 26: MLHEP Lectures - day 1, basic track

Computational complexityGiven that dimensionality of space is and there are training samples:

training time: ~ O(save a link to the data)prediction time: for each sample

d n

n × d

25 / 87

Page 27: MLHEP Lectures - day 1, basic track

Spacial index: ball tree

26 / 87

Page 28: MLHEP Lectures - day 1, basic track

Ball treetraining time ~ prediction time ~ for each sample

Other option exists: KD-tree.

O(d × n log(n))log(n) × d

27 / 87

Page 29: MLHEP Lectures - day 1, basic track

Overview of NNAwesomely simple classifier and regressorProvides too optimistic quality on training dataQuite slow, though optimizations existToo sensitive to scale of featuresHard times with data of high dimensions

k

28 / 87

Page 30: MLHEP Lectures - day 1, basic track

Sensitivity to scale of featuresEuclidean distance:

ρ(x, = ( + + ( + + ⋯ + ( +x  )2 x1 x  1 )2 x2 x  2 )2 xd x  d )2

29 / 87

Page 31: MLHEP Lectures - day 1, basic track

Sensitivity to scale of featuresEuclidean distance:

Change scale of first feature:

ρ(x, = ( + + ( + + ⋯ + ( +x  )2 x1 x  1 )2 x2 x  2 )2 xd x  d )2

ρ(x, x  )2

ρ(x, x  )2=U

(10 + 10 + ( + + ⋯ + ( +x1 x  1 )2 x2 x  2 )2 xd x  d )2

100 ( +x1 x  1 )2

30 / 87

Page 32: MLHEP Lectures - day 1, basic track

Sensitivity to scale of featuresEuclidean distance:

Change scale of first feature:

Scaling of features frequently increases quality.

ρ(x, = ( + + ( + + ⋯ + ( +x  )2 x1 x  1 )2 x2 x  2 )2 xd x  d )2

ρ(x, x  )2

ρ(x, x  )2=U

(10 + 10 + ( + + ⋯ + ( +x1 x  1 )2 x2 x  2 )2 xd x  d )2

100 ( +x1 x  1 )2

31 / 87

Page 33: MLHEP Lectures - day 1, basic track

Distance function mattersMinkowski distance Canberra

Cosine metric

ρ(x, = ( +x  )p *l xl x  l )p

ρ(x, ) =x   ∑l

| + |xl x  l| | + | |xl x  l

ρ(x, ) =x  < x, >x  

|x| | |x  

32 / 87

Page 34: MLHEP Lectures - day 1, basic track

Problems with high dimensionsWith higher dimensions the neighboring points are further.

Example: consider training data points being distributed unformly in the unitcube:

expected number of point in the ball of size is proportional to the to collect the same amount on neighbors, we need to put

NN suffers from curse of dimensionality.

d >> 1

n

r rd

r = → 1const1/d

k

33 / 87

Page 35: MLHEP Lectures - day 1, basic track

Measuring quality of binary classificationThe classifier's output in binary classification is real variable (say, signal is blueand background is red)

Which classifier provides better discrimination?

34 / 87

Page 36: MLHEP Lectures - day 1, basic track

Measuring quality of binary classificationThe classifier's output in binary classification is real variable (say, signal is blueand background is red)

Which classifier provides better discrimination?Discrimination is identical in all three cases

35 / 87

Page 37: MLHEP Lectures - day 1, basic track

ROC curve demonstration

36 / 87

Page 38: MLHEP Lectures - day 1, basic track

ROC curve

37 / 87

Page 39: MLHEP Lectures - day 1, basic track

ROC curve

These distributions have the same ROCcurve: (ROC curve is passed signal vs passed bckdependency)

38 / 87

Page 40: MLHEP Lectures - day 1, basic track

ROC curveDefined only for binary classificationContains important information: all possible combinations of signal and background efficiencies you mayachieve by setting threshold

39 / 87

Page 41: MLHEP Lectures - day 1, basic track

ROC curveDefined only for binary classificationContains important information: all possible combinations of signal and background efficiencies you mayachieve by setting thresholdParticular values of thresholds (and initial pdfs) don't matter, ROC curvedoesn't contain this information

40 / 87

Page 42: MLHEP Lectures - day 1, basic track

ROC curveDefined only for binary classificationContains important information: all possible combinations of signal and background efficiencies you mayachieve by setting thresholdParticular values of thresholds (and initial pdfs) don't matter, ROC curvedoesn't contain this informationROC curve = information about order of events:

b b s b s b ... s s b s s

41 / 87

Page 43: MLHEP Lectures - day 1, basic track

ROC curveDefined only for binary classificationContains important information: all possible combinations of signal and background efficiencies you mayachieve by setting thresholdParticular values of thresholds (and initial pdfs) don't matter, ROC curvedoesn't contain this informationROC curve = information about order of events:

b b s b s b ... s s b s s

Comparison of algorithms should be based on the information from ROCcurve.

42 / 87

Page 44: MLHEP Lectures - day 1, basic track

Terminology and Conventionsfpr = background efficiency = btpr = signal efficiency = s

43 / 87

Page 45: MLHEP Lectures - day 1, basic track

Terminology and Conventionsfpr = background efficiency = btpr = signal efficiency = s

44 / 87

Page 46: MLHEP Lectures - day 1, basic track

ROC AUC (area under the ROC curve)

where are predictions ofrandom background and signalevents.

ROC AUC = P( < )rb rs,rb rs

45 / 87

Page 47: MLHEP Lectures - day 1, basic track

Classifier have the same ROCAUC, but which is better fortriggers at the LHC? (we needto pass very few background)

46 / 87

Page 48: MLHEP Lectures - day 1, basic track

Classifier have the same ROCAUC, but which is better fortriggers at the LHC? (we needto pass very few background)

Applications frequently demanddifferent metric.

47 / 87

Page 49: MLHEP Lectures - day 1, basic track

-minutes breakn

48 / 87

Page 50: MLHEP Lectures - day 1, basic track

Recapitulation

1. Statistical ML: applications and problems2. ML in HEP3. nearest neighbours classifier and regressor.4. ROC curve, ROC AUC

k

49 / 87

Page 51: MLHEP Lectures - day 1, basic track

Statistical Machine LearningMachine learning we use in practice is based on statistics

Main assumption: the data is generated from probabilistic distribution:

Does there really exist the distribution of people / pages / texts?

p(x, y)

50 / 87

Page 52: MLHEP Lectures - day 1, basic track

Statistical Machine LearningMachine learning we use in practice is based on statistics

Main assumption: the data is generated from probabilistic distribution:

Does there really exist the distribution of people / pages / texts?In HEP these distributions do exist

p(x, y)

51 / 87

Page 53: MLHEP Lectures - day 1, basic track

Optimal classification. Bayes optimal classifierAssuming that we know real distributions we reconstruct using Bayes'rule

p(x, y)

p(y|x) = =p(x, y)p(x)

p(y)p(x|y)p(x)

=p(y = 1 | x)p(y = 0 | x)

p(y = 1) p(x | y = 1)p(y = 0) p(x | y = 0)

52 / 87

Page 54: MLHEP Lectures - day 1, basic track

Optimal classification. Bayes optimal classifierAssuming that we know real distributions we reconstruct using Bayes'rule

Lemma (Neyman–Pearson):

The best classification quality is provided by (Bayes optimal

classifier)

p(x, y)

p(y|x) = =p(x, y)p(x)

p(y)p(x|y)p(x)

=p(y = 1 | x)p(y = 0 | x)

p(y = 1) p(x | y = 1)p(y = 0) p(x | y = 0)

p(y = 1 | x)p(y = 0 | x)

53 / 87

Page 55: MLHEP Lectures - day 1, basic track

Optimal Binary ClassificationBayes optimal classifier has highest possible ROC curve.Since the classification quality depends only on order, givesoptimal classification quality too!

p(y = 1 | x)

= ×p(y = 1 | x)p(y = 0 | x)

p(y = 1)p(y = 0)

p(x | y = 1)p(x | y = 0)

54 / 87

Page 56: MLHEP Lectures - day 1, basic track

Optimal Binary ClassificationBayes optimal classifier has highest possible ROC curve.Since the classification quality depends only on order, givesoptimal classification quality too!

How can we estimate terms from this expression?

p(y = 1 | x)

= ×p(y = 1 | x)p(y = 0 | x)

p(y = 1)p(y = 0)

p(x | y = 1)p(x | y = 0)

55 / 87

Page 57: MLHEP Lectures - day 1, basic track

Histograms density estimationCounting number of samples in each bin and normalizing.

fastchoice of binning is crucialnumber of bins grows exponentially curse of dimensionality→

56 / 87

Page 58: MLHEP Lectures - day 1, basic track

Kernel density estimation

is kernel, isbandwidth

Typically, gaussian kernel isused, but there are many others.

Approach is very close toweighted NN.

f (x) = K ( )1nh ∑

i

x + xi

h

K(x) h

k 57 / 87

Page 59: MLHEP Lectures - day 1, basic track

Kernel density estimationbandwidth selectionSilverman's rule of thumb:

h = σ  ( )43n

15

58 / 87

Page 60: MLHEP Lectures - day 1, basic track

Kernel density estimationbandwidth selectionSilverman's rule of thumb:

may be irrelevant if the data is far frombeing gaussian

h = σ  ( )43n

15

59 / 87

Page 61: MLHEP Lectures - day 1, basic track

Parametric density estimationFamily of density functions: .

Problem: estimate parameters of a Gaussiandistribution.

f (x; θ)

f (x; μ,Σ) = exp(+ (x + μ (x + μ))1(2π |Σ)d/2 |1/2

12

)TΣ+1

60 / 87

Page 62: MLHEP Lectures - day 1, basic track

QDA (Quadratic discriminant analysis)Reconstructing probabilities from data, assuming thoseare multidimensional normal distributions:

p(x | y = 1), p(x | y = 0)

p(x | y = 0) U ( , )μ0 Σ0

p(x | y = 1) U ( , )μ1 Σ1

= = const =p(y = 1 | x)p(y = 0 | x)

p(y = 1)p(y = 0)

p(x | y = 1)p(x | y = 0)

n1

n0

exp(+ (x + (x + ))12 μ1)TΣ+1

1 μ1

exp(+ (x + (x + ))12 μ0)TΣ+1

0 μ0

= exp(+ (x + (x + ) + (x + (x + ) + const)12

μ1)TΣ+11 μ1

12

μ0)TΣ+10 μ0

61 / 87

Page 63: MLHEP Lectures - day 1, basic track

62 / 87

Page 64: MLHEP Lectures - day 1, basic track

QDA complexity samples, dimensions

training consists of fitting and takes

computing covariance matrix inverting covariance matrix

prediction takes for each sample spent on computing dot product

n d

p(x | y = 0) p(x | y = 1) O(n + )d2 d3

O(n )d2

O( )d3

O( )d2

63 / 87

Page 65: MLHEP Lectures - day 1, basic track

QDA overviewsimple decision rulefast predictionmany parameters to reconstruct in high dimensionsdata almost never has gaussian distribution

64 / 87

Page 66: MLHEP Lectures - day 1, basic track

Gaussian mixtures for density estimationMixture of distributions:

Mixture of Gaussian distributions:

Parameters to be found: , ,

f (x) = (x, ) = 1∑c+components

πc fc θc ∑c+components

πc

f (x) = f (x; , )∑c+components

πc μc Σc

,… ,π1 πC ,… ,μ1 μC ,… ,Σ1 ΣC

65 / 87

Page 67: MLHEP Lectures - day 1, basic track

66 / 87

Page 68: MLHEP Lectures - day 1, basic track

Gaussian mixtures: finding parametersCriterion is maximizing likelihood (using MLE to find optimal parameters)

no analytic solutionwe can use general-purpose optimization methods

log f ( ; θ) →∑i

xi maxθ

67 / 87

Page 69: MLHEP Lectures - day 1, basic track

Gaussian mixtures: finding parametersCriterion is maximizing likelihood (using MLE to find optimal parameters)

no analytic solutionwe can use general-purpose optimization methods

In mixtures parameters are split in two groups:

— parameters of components — contributions of components

log f ( ; θ) →∑i

xi maxθ

,… ,θ1 θC,… ,π1 πC

68 / 87

Page 70: MLHEP Lectures - day 1, basic track

Expectation-Maximization algorithm [Dempster et al., 1977]Idea: introduce set of hidden variables

Expectation:

Maximization:

Maximization step is trivial for Gaussian distributions. EM-algorithm is more stable and has good convergence properties.

(x)πc

(x) ← p(x ! c) =πc(x; )πc fc θc

(x; )*c  πc  fc   θc  

πc

θc

( )∑i

πc xi

arg (x) log (x, )maxθ ∑

i

πc fc θc

69 / 87

Page 71: MLHEP Lectures - day 1, basic track

EM algorithm

70 / 87

Page 72: MLHEP Lectures - day 1, basic track

EM algorithm

71 / 87

Page 73: MLHEP Lectures - day 1, basic track

Classification model based on mixtures density estimation is called MDA(mixture discriminant analysis)

Generative approachGenerative approach: trying to reconstruct , then use Bayes classificationformula to predict.

QDA, MDA are generative classifiers.

p(x, y)

72 / 87

Page 74: MLHEP Lectures - day 1, basic track

Classification model based on mixtures density estimation is called MDA(mixture discriminant analysis)

Generative approachGenerative approach: trying to reconstruct , then use Bayes classificationformula to predict.

QDA, MDA are generative classifiers.

Problems of generative approach

Real life distributions hardly can be reconstructedEspecially in high-dimensional spacesSo, we switch to discriminative approach: guessing directly

p(x, y)

p(y|x)73 / 87

Page 75: MLHEP Lectures - day 1, basic track

Classification: truck vs car

74 / 87

Page 76: MLHEP Lectures - day 1, basic track

If we can avoid density estimation, we'd better do it.

75 / 87

Page 77: MLHEP Lectures - day 1, basic track

Linear decision rule

Decision function is linear:

This is a parametric model (finding parameters ). QDA & MDA are parametric as well.

d(x) =< w, x > +w0

{ d(x) > 0 →d(x) < 0 →

= +1y = +1y 

w, w0

76 / 87

Page 78: MLHEP Lectures - day 1, basic track

Finding Optimal ParametersA good initial guess: get such , that error of classification is minimal:

Notion: .

Discontinuous optimization (arrrrgh!)

w, w0

= = sgn(d( ))∑i!events

1 yyi y  i y i xi

= 1, = 01true 1f alse

77 / 87

Page 79: MLHEP Lectures - day 1, basic track

Finding Optimal Parameters - 2Discontinuous optimization

solution: let's make decision rule smooth

(x)p+1

(x)p+1

= f (d(x))= 1 + (x)p+1

⎩⎨⎪⎪

f (0) = 0.5f (x) > 0.5f (x) < 0.5

if x > 0if x < 0

78 / 87

Page 80: MLHEP Lectures - day 1, basic track

Logistic function

Properties

1. monotonic, 2. 3. 4.

σ(x) = =ex

1 + ex1

1 + e+x

σ(x) ! (0, 1)σ(x) + σ(+x) = 1

(x) = σ(x)(1 + σ(x))σ ′

2 σ(x) = 1 + tanh(x/2)

79 / 87

Page 81: MLHEP Lectures - day 1, basic track

Logistic regressionDefine probabilities obtained with logistic function

and optimize log-likelihood:

Important exercise: find an expression and build a plot for

d(x)(x)p+1

(x)p+1

===

< w, x > +w0

σ(d(x))σ(+d(x))

= + ln( ( )) = L( , ) → min∑i!events

pyi xi ∑i

xi yi

L( , ) = + ln( ( ))xi yi pyi xi80 / 87

Page 82: MLHEP Lectures - day 1, basic track

Linear model for regression

How to use linear functionfor regression?

Simplification of notion:.

d(x) =< w, x > +w0

= 1, x = (1, ,… , )x0 x1 xd

d(x) =< w, x >

81 / 87

Page 83: MLHEP Lectures - day 1, basic track

Linear regression (ordinary least squares)We can use linear function for regression:

This is a linear system with variables and equations.

Minimize OLS aka MSE (mean squared error):

Explicit solution:

d( ) = d(x) =< w, x >xi yi

d + 1 n

= (d( ) + → min∑i

xi yi )2

( ) w =*i xixTi *i yixi

82 / 87

Page 84: MLHEP Lectures - day 1, basic track

Linear regressioncan use some other error

but no explicit solution in other casesdemonstrates properties of linear models

reliable estimates when able to completely fit to the data if undefined when

n >> dn = d

d < n

83 / 87

Page 85: MLHEP Lectures - day 1, basic track

Data Scientist Pipeline

Experiments in appropriate high-level language or environmentAfter experiments are over — implement final algorithm in low-levellanguage (C++, CUDA, FPGA)

Second point is not always needed

84 / 87

Page 86: MLHEP Lectures - day 1, basic track

Scientific Python

NumPy vectorized computations in python

Matplotlib for drawing

Pandas for data manipulation and analysis (based on NumPy)

85 / 87

Page 87: MLHEP Lectures - day 1, basic track

Scientific Python

Scikit-learn most popular library for machine learning

Scipy libraries for science and engineering

Root_numpy convenient way to work with ROOT files

86 / 87

Page 88: MLHEP Lectures - day 1, basic track

87 / 87