Download - Statistical Learning Methods in HEAP
C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 1
Jens Zimmermann,Christian Kiesling
Max-Planck-Institut für Physik, München
MPI für extraterrestrische Physik, München
Forschungszentrum Jülich GmbH
Statistical Learning: Introduction with a simple example
Occam‘s Razor
Decision Trees
Local Density Estimators
Methods Based on Linear Separation
Examples: Triggers in HEP and Astrophysics
Conclusion
Statistical Learning Methods in HEAP
C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 2
Statistical Learning
• Does not use prior knowledge„No theory required“
• Learns only from examples„Trial and error“„Learning by reinforcement“
• Two classes of statistical learning:discrete output 0/1: „classification“continuous output: „regression“
• Application in High Energy- and Astro-Physics:Background suppression, purification of eventsEstimation of parameters not directly measured
C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 3
A simple Example: Preparing a Talk
0 1 2 3 4 5 6 x10# formulas
# s
lide
s
0
1
2
3
4
5
6 x
10
# formulas # slides
42 21
28 8
71 19
64 31
29 36
15 34
48 44
56 51
25 55
12 16Exp
erim
enta
list
s
The
oris
ts
Data base established by Jens duringYoung Scientists Meeting at MPI
C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 4
Discriminating Theorists from Experimentalists: A First Analysis
0 2 4 6 x10# formulas
0 2 4 6 x10# slides
ExperimentalistsTheorists
0 1 2 3 4 5 6 x10# formulas
# s
lide
s
0
1
2
3
4
5
6 x
10
First talks handed in
Talks a week beforemeeting
C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 5
Completely separable, but only via complicated boundary
0 1 2 3 4 5 6 x10# formulas
# s
lide
s
0
1
2
3
4
5
6 x
10
First Problems
0 1 2 3 4 5 6 x10# formulas
# s
lide
s
0
1
2
3
4
5
6 x
10
New talk by Ludger:28 formulas on31 slides
At this point we cannotknow which feature is „real“!
Use Train/Test or Cross-Validation!
Simple „model“, but no completeseparation
C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 6
See Overtraining - Want Generalization Need Regularization
Want to tune the parameters of the learning algorithm depending on the overtraining seen!
0 1 2 3 4 5 6 x10# formulas
# s
lide
s
0
1
2
3
4
5
6 x
10
Train Test
2
1
1( )
N
i ii
E t out xN
Training epochs
E
Training Set
Test Set
Overtraining
C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 7
See Overtraining - Want Generalization Need Regularization
0 1 2 3 4 5 6 x10# formulas
# s
lide
s
0
1
2
3
4
5
6 x
10
Train Test
2
1
1( ),
N
i ii
E t out xN
w
Training epochs
E
Training Set
Test Set
Regularization will ensure adequate performance (e.g. VC dimensions):Limit the complexity of the model
“Factor 10” - Rule: (“Uncle Bernie’s Rule #2”)
C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 8
Philosophy: Occam‘s Razor
Pluralitas non est ponenda sine necessitate.
• Do not make assumptions, unless they are really necessary.
• From theories which describe the same phenomenon equally well choose the one which contains the least number of assumptions.
First razor: Given two models with the same generalization error, the simpler one should be preferred because simplicity is desirable in itself.
Second razor: Given two models with the same training-set error, the simpler one should be preferred because it is likely to have lower generalization error.
14th century
No! „No free lunch“-theorem Wolpert1996
Yes! But not of much use.
C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 9
Decision Trees
0 2 4 6 x10
# formulas #formulas < 20 exp
0 2 4 6 x10
# slides
20 < #formulas < 60?
#slides > 40 exp#slides < 40 th
#slides < 40 #slides > 40
expth
#formulas < 20 #formulas > 60rest
exp th
all events
subset 20 < #formulas < 60
Classify Ringaile:31 formulas on32 slides th
Regularization:Pruning
#formulas > 60 th
C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 10
Local Density Estimators
Search for similar events already classified within specified region,count the members of the two classes in that region.
0 1 2 3 4 5 6 x10# formulas
# s
lide
s
0
1
2
3
4
5
6 x
10
0 1 2 3 4 5 6 x10# formulas
# s
lide
s
0
1
2
3
4
5
6 x
10
C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 11
Maximum Likelihood
0 2 4 6 x10
# formulas
0 2 4 6 x10
# slides
31 32
24.05
3
5
2Thp
04.05
1
5
1Expp
out=
Correlation gets lost completely by projection! Regularization:Binning
C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 12
k-Nearest-Neighbour
0 1 2 3 4 5 6 x10# formulas
# s
lide
s
0
1
2
3
4
5
6 x
10
k=1out=
k=2out=
k=3out=
k=4out=
k=5out=
For every evaluation position the distances to eachtraining position need to be determined!
Regularization:Parameter k
C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 13
0 1 2 3 4 5 6 x10# formulas
# s
lide
s
0
1
2
3
4
5
6 x
10
1
3
4 5
6
8
7
Range Search
1
2 3
4 56
7
8
910xx
x
y y
Tree needs to be traversed only partially if box size is small enough!
Small box: checked 1,2,4,9out=
Large box: checked all out=
3
5 8y
6
10x
7 2
9
10
Regularization:Box-Size
C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 14
Methods Based on Linear Separation
Divide the input space into regionsseparated by one or more hyperplanes.
Extrapolation is done!
0 1 2 3 4 5 6 x10# formulas
# s
lide
s
0
1
2
3
4
5
6 x
10
0 1 2 3 4 5 6 x10# formulas
# s
lide
s
0
1
2
3
4
5
6 x
10
LDA (Fisher discr.)
C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 15
Neural Networks
aeaσ
1
1)(
-50
+0.1+1.1 -1.1
+20
+0.2
+3.6 +3.6
-1.8
# formulas # slides
sxwσy ii
0 1 2 3 4 5 6 x10
0
1
2
3
4
5
6 x
10
0
1
Regularization:# hidden neuronsweight decay
arbitrary inputs andhidden neurons Network with two
hidden neurons (gradient descent):
2
1
1( )
N
i ii
E t out xN
C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 16
Support Vector Machines
Separating hyperplane with maximum distance to each data point: Maximum margin classifier
Found by setting up condition for correct classficationand minimizing which leads to the Lagrangian
1)( bxwy ii
2
w
1)(2
1 2 bxwyαwL iii
Necessary condition for a minimum is
Output becomes iii xyαw
bxxyαout iii sgn
Only linear separation?
The mapping to feature spaceis hidden in a kernel
FRd :)()(),( yxyxK
No! Replace dot products: )()( yxyx
Non-separable case: iξCww
22
2
1
2
1
C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 17
Physics Applications: Neural Network Trigger at HERA
keep physics reject background
H1
C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 18
Trigger for J/Events
H1NN 99.6%SVM 98.3%k-NN 97.7%RS 97.5%C4.5 97.5%ML 91.2%LDA 82%
Eff@Rej=95%:
C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 19
Triggering Charged Current Events
signal
background
p
W
pX
~
e
p
NN 74%SVM 73%C4.5 72%RS 72%k-NN 71%LDA 68%ML 65%
Eff@Rej=80%:
C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 20
Astrophysics: MAGIC - Gamma/Hadron Separation
Random Forest: = 93.3 Neural Net: = 96.5
Training with Data and MCEvaluation with Data
vs.
Photon Hadron
= signal (photon) enhancement factor
C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 21
Future Experiment XEUS: Position of X-ray Photons
of reconstruction in µmNN 3.6 SVM 3.6 k-NN 3.7 RS 3.7 ETA 3.9 CCOM 4.0
XEUS
~300µm
~10µm
electron potential
transfer direction
i
iiCOM c
cxx
)( COMCOMCCOM xxx
(Application of Stat. Learning in Regression Problems)
C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 22
Conclusion
• Statistical learning theory is full of subtle details (models statistics)
• Neural Networks found superior in the HEP and Astrophysics applications (classification, regression) studied so far
• Widely used statistical learning methods studied:• Decision Trees• LDE: ML, k-NN, RS• Linear separation: LDA, Neural Nets, SVM‘s
• Further applications (trigger, offline analyses) under study