1/21 acat 2011eckhard von toerne 06.sept 2011 status of tmva the toolkit for multivariate analysis...
DESCRIPTION
3/21 ACAT 2011Eckhard von Toerne 06.Sept 2011 What is TMVA Supervised learning Classification and Regression tasks Easy to train, evaluate and compare various MVA methods Various preprocessing methods (Decorr.,PCA, Gauss...) Integrated in ROOT MVA outputTRANSCRIPT
1/21ACAT 2011 Eckhard von Toerne 06.Sept 2011
Status of TMVA the Toolkit for
MultiVariate AnalysisEckhard von Toerne (University of Bonn)For the TMVA core developer team:A. Hoecker, P. Speckmayer, J. Stelzer, J. Therhaag, E.v.T., H. Voss
2/21ACAT 2011 Eckhard von Toerne 06.Sept 2011
Outline• Overview• New developments• Recent physics results that use TMVA
– Web-Site: http://tmva.sourceforge.net/– See also: "TMVA - Toolkit for Multivariate Data Analysis , A.
Hoecker, P. Speckmayer, J. Stelzer, J. Therhaag, E.v.Toerne, H. Voss et al., arXiv:physics/0703039v5 [physics.data-an]
3/21ACAT 2011 Eckhard von Toerne 06.Sept 2011
What is TMVA• Supervised learning• Classification and
Regression tasks• Easy to train,
evaluate and compare various MVA methods
• Various preprocessing methods (Decorr.,PCA, Gauss...)
• Integrated in ROOT
MVA output
4/21ACAT 2011 Eckhard von Toerne 06.Sept 2011
• Training:– Classification:
Learn the features of the differentevent classes from a sample withknown signal/background composition
– Regression:Learn the functional dependencebetween input variables and targets
• Testing:– Evaluate the performance of the trained
classifier/regressor on an independenttest sample
– Compare different methods
• Application:– Apply the classifier/regressor to real data
TMVA workflow
5/21ACAT 2011 Eckhard von Toerne 06.Sept 2011
H1
H0
x1
x2
Classification of signal/background – How to find best decision boundary?
H1
H0
x1
x2H1
H0
x1
x2
Regression – How to determine the correct model?
Classification/Regression
6/21ACAT 2011 Eckhard von Toerne 06.Sept 2011
• If you have a training sample with only few events? Number of „parameters“ must be limited Use Linear classifier or FDA, small BDT, small MLP
• Variables are uncorrelated (or only linear corrs) likelihood• I just want something simple use Cuts, LD, Fisher• Methods for complex problems use BDT, MLP, SVM
How to choose a method?
List of acronyms:BDT = boosted decision tree, see manual page 103ANN = articifical neural networkMLP = multi-layer perceptron, a specific form of ANN, also the name of our flagship ANN, manual p. 92FDA = functional discriminant analysis, see manual p. 87LD = linear discriminant , manual p. 85SVM = support vector machine, manual p. 98 , SVM currently available only for classificationCuts = like in “cut selection“, manual p. 56Fisher = Ronald A. Fisher, classifier similar to LD, manual p. 83
7/21ACAT 2011 Eckhard von Toerne 06.Sept 2011
Performance Speed Robustness Curse of Dim.
Transparency Regression
No/linear correlations
Nonlinear correlations
Training Response Overtraining Weak input vars
1D multi D
1( ) 1 xA x e
1
i
. . .N
1 input layer k hidden layers 1 ouput layer
1
j
M1
. . .
. . . 1
. . .Mk
2 output classes (signal and background)
Nvar discriminating input variables
11w
ijw
1jw. . .. . .
1( ) ( ) ( ) ( 1)
01
kMk k k k
j j ij ii
x w w xA
var
(0)1..i Nx
( 1)1,2kx
(“Activation” function)
with:
Feed
-for
war
d M
ultil
ayer
Per
cept
ron
• Modelling of arbitrary nonlinear functions as a nonlinear combination of simple „neuron activation functions“
• Advantages:– very flexible,
no assumptionabout the functionnecessary
• Disadvantages:– „black box“– needs tuning– seed dependent
Artificial Neural Networks
8/21ACAT 2011 Eckhard von Toerne 06.Sept 2011
Performance Speed Robustness Curse of Dim.
Transparency Regression
No/linear correlations
Nonlinear correlations
Training Response Overtraining Weak input vars
1D multi D
/ /
• Grow a forest of decision trees and determine the event class/target by majority vote
• Weights of misclassified events are increased in the next iteration
• Advantages:– ignores weak
variables– works out of
the box
• Disadvantages:– vulnerable to overtraining
Boosted Decision Trees
9/21ACAT 2011 Eckhard von Toerne 06.Sept 2011
No Single Best Classifier…
Criteria
Classifiers
Cuts Likeli-hood
PDERS/ k-NN H-Matrix Fisher MLP BDT RuleFit SVM
Perfor-mance
no / linear correlations nonlinear
correlations
SpeedTraining
Response /
Robust-ness
Overtraining Weak input variables
Curse of dimensionality Transparency
The properties of the Function discriminant (FDA) depend on the chosen function
10/21ACAT 2011 Eckhard von Toerne 06.Sept 2011
Neyman-Pearson:The Likelihood ratio used as “selection criterion” y(x) gives for each selection efficiency the best possible background rejection. i.e. it maximizes the area under the “Receiver Operation Characteristics” (ROC) curve
Varying y(x)>“cut” moves working point (efficiency and purity) along ROC curve
How to choose “cut”? need to know prior probabilities (S, B abundances) • Measurement of signal cross section: maximum of S/√(S+B) or equiv. √(·p)• Discovery of a signal : maximum of S/√(B)• Precision measurement: high purity (p)• Trigger selection: high efficiency (
0 1
1
0
1-
back
gr.
signal
random guessing
good classification
better classification
“limit” in ROC curve
given by likelihood ratio
Type-1 error smallType-2 error large
Type-1 error large Type-2 error small
Neyman-Pearson Lemma
11/21ACAT 2011 Eckhard von Toerne 06.Sept 2011
• 3-dimensional distribution• Signal: sum of Gaussians• Background=flat
• Theoretical limit calculated using Neyman-Pearson Lemma
• Neural net (MLP) with two hidden layers and backpropagation training. Bayesian option has little influence on high statistics training
• TMVA-ANN converges towards theoretical limit for sufficient Ntrain (~100k)
Performance with toy data
12/21ACAT 2011 Eckhard von Toerne 06.Sept 2011
Recent developments• Current version: TMVA version 4.1.2 in root release 5.30• Unit test framework for daily software and method performance
validation (C. Rosemann, E.v.T.)• Multiclass classification for MLP, BDTG, FDA• BDT automatic parameter optimization for building the tree
architecture• new method to treat data with distinct sub-populations (Method
Category)• Optional Bayesian treatment of ANN weights in MLP with back-
propagation (Jiahang Zhong)• Extended PDEFoam functionality (A. Voigt)• Variable transformations on a user-defined subset of variables
13/21ACAT 2011 Eckhard von Toerne 06.Sept 2011
Unit test• automated framework to verify functionality and
performance (ours based on B. Eckel‘s description)• slimmed version runs every night on various OS
C. Rosemann, E.v.T.
**************************************************************** TMVA - U N I T test : Summary ****************************************************************Test 0 : Event [107/107]....................................OKTest 1 : VariableInfo [31/31]...............................OKTest 2 : DataSetInfo [20/20]................................OKTest 3 : DataSet [15/15]....................................OKTest 4 : Factory [16/16]....................................OKTest 7 : LDN_selVar_Gauss [4/4].............................OK....Test 107 : BoostedPDEFoam [4/4]..............................OKTest 108 : BoostedDTPDEFoam [4/4]............................OKTotal number of failures: 0***************************************************************
14/21ACAT 2011 Eckhard von Toerne 06.Sept 2011
And now: switching from statistics to physics
…acknowledging the hard work of our users
15/21ACAT 2011 Eckhard von Toerne 06.Sept 2011
Review of recent results
• x BDT trained on individual mH samples with 10 variables. Expect 1.47 ev signal events at mH=160GeV .. compared to 1.27ev with cut-based analysis (on 4 variables) and same bgd.
Phys.Lett.B699:25-47,2011.
16/21ACAT 2011 Eckhard von Toerne 06.Sept 2011
Super-Kamiokande Coll., “Kinematic reconstruction of atmospheric neutrino events in a large water Cherenkov detector with proton identification“ Phys.Rev.D79:112010,2009.7 input variables MLP with one hidden layer
Signal Background
Review of recent results
17/21ACAT 2011 Eckhard von Toerne 06.Sept 2011
•BDT analysis with ~20 input variables•lepton + ET-mis+ jets •Results for s+t-channel
• CDF Coll., “First Observation of Electroweak Single Top Quark Production“, Phys.Rev.Lett.103:092002,2009.
Review of recent results
18/21ACAT 2011 Eckhard von Toerne 06.Sept 2011
Review of recent results using TMVA
• CDF+D0 combined higgs working group, hep-ex/1107.5518. (SVM)
• CMS Coll., H-->WW search, Phys.Lett.B699:25-47,2011. (BDT)• IceCube Coll., astro-ph/1101.1692. (MLP)• D0 Coll., top-pairs, Phys.Rev.D84:012008,2011. (BDT)• IceCube Coll., Phys.Rev.D83:012001,2011. (BDT)• IceCube Coll., Phys.Rev.D82:112003,2010. (BDT)• D0 Coll., Higgs search, Phys.Rev.Lett.105:251801,2010. (BDT) • CDF Coll., single top, Phys.Rev.D82:112005,2010. (BDT)• D0 Coll., single top, Phys.Lett.B690:5-14,2010. (BDT) • D0 Coll., top pairs, Phys.Rev.D82:032002,2010. (Likelihood)• CDF Coll., single top obs., Phys.Rev.Lett.103:092002,2009. (BDT)• Super-Kamiokande Coll., Phys.Rev.D79:112010,2009. (MLP)• BABAR Coll., Phys.Rev.D79:051101,2009. (BDT)• + other papers• + several ATLAS papers with TMVA about to come out…
• + many ATLAS results about to be published…
19/21ACAT 2011 Eckhard von Toerne 06.Sept 2011
Review of recent results using TMVA
• CDF+D0 combined higgs working group, hep-ex/1107.5518. (SVM)
• CMS Coll., H-->WW search, Phys.Lett.B699:25-47,2011. (BDT)• IceCube Coll., astro-ph/1101.1692. (MLP)• D0 Coll., top-pairs, Phys.Rev.D84:012008,2011. (BDT)• IceCube Coll., Phys.Rev.D83:012001,2011. (BDT)• IceCube Coll., Phys.Rev.D82:112003,2010. (BDT)• D0 Coll., Higgs search, Phys.Rev.Lett.105:251801,2010. (BDT) • CDF Coll., single top, Phys.Rev.D82:112005,2010. (BDT)• D0 Coll., single top, Phys.Lett.B690:5-14,2010. (BDT) • D0 Coll., top pairs, Phys.Rev.D82:032002,2010. (Likelihood)• CDF Coll., single top obs., Phys.Rev.Lett.103:092002,2009. (BDT)• Super-Kamiokande Coll., Phys.Rev.D79:112010,2009. (MLP)• BABAR Coll., Phys.Rev.D79:051101,2009. (BDT)• + other papers• + several ATLAS papers with TMVA about to come out…
• + many ATLAS results about to be published…
Thank you for u
sing TMVA !
20/21ACAT 2011 Eckhard von Toerne 06.Sept 2011
Summary• TMVA versatile package for
classification and regression tasks• Integrated into ROOT• Easy to train classifiers/regression
methods• A multitude of physics results based on
TMVA are coming out• Thank you for your attention!
21/21ACAT 2011 Eckhard von Toerne 06.Sept 2011
• Several similar data mining efforts with rising importance in most fields of science and industry
• TMVA is open source software
• Use & redistribution of source permitted according to terms in BSD license
Contributed to TMVA have: Andreas Hoecker (CERN, Switzerland), Jörg Stelzer (CERN, Switzerland), Peter Speckmayer (CERN, Switzerland), Jan Therhaag (Universität Bonn, Germany), Eckhard von Toerne (Universität Bonn, Germany), Helge Voss (MPI für Kernphysik Heidelberg, Germany), Moritz Backes (Geneva University, Switzerland), Tancredi Carli (CERN, Switzerland), Asen Christov (Universität Freiburg, Germany), Or Cohen (CERN, Switzerland and Weizmann, Israel), Krzysztof Danielowski (IFJ and AGH/UJ, Krakow, Poland), Dominik Dannheim (CERN, Switzerland), Sophie Henrot-Versille (LAL Orsay, France), Matthew Jachowski (Stanford University, USA), Kamil Kraszewski (IFJ and AGH/UJ, Krakow, Poland), Attila Krasznahorkay Jr. (CERN, Switzerland, and Manchester U., UK), Maciej Kruk (IFJ and AGH/UJ, Krakow, Poland), Yair Mahalalel (Tel Aviv University, Israel), Rustem Ospanov (University of Texas, USA), Xavier Prudent (LAPP Annecy, France), Arnaud Robert (LPNHE Paris, France), Christoph Rosemann (DESY), Doug Schouten (S. Fraser U., Canada), Fredrik Tegenfeldt (Iowa University, USA, until Aug 2007), Alexander Voigt (CERN, Switzerland), Kai Voss (University of Victoria, Canada), Marcin Wolter (IFJ PAN Krakow, Poland), Andrzej Zemla (IFJ PAN Krakow, Poland), Jiahang Zhong (Academica Sinica, Taipeh).
Credits
22/21ACAT 2011 Eckhard von Toerne 06.Sept 2011
Spare Slides
23/21ACAT 2011 Eckhard von Toerne 06.Sept 2011
void TMVAnalysis( ){ TFile* outputFile = TFile::Open( "TMVA.root", "RECREATE" ); TMVA::Factory *factory = new TMVA::Factory( "MVAnalysis", outputFile,"!V");
TFile *input = TFile::Open("tmva_example.root"); factory->AddVariable("var1+var2", 'F'); factory->AddVariable("var1-var2", 'F'); //factory->AddTarget("tarval", 'F'); factory->AddSignalTree ( (TTree*)input->Get("TreeS"), 1.0 ); factory->AddBackgroundTree ( (TTree*)input->Get("TreeB"), 1.0 ); //factory->AddRegressionTree ( (TTree*)input->Get("regTree"), 1.0 ); factory->PrepareTrainingAndTestTree( "", "", "nTrain_Signal=200:nTrain_Background=200:nTest_Signal=200:nTest_Background=200:NormMode=None" ); factory->BookMethod( TMVA::Types::kLikelihood, "Likelihood", "!V:!TransformOutput:Spline=2:NSmooth=5:NAvEvtPerBin=50" ); factory->BookMethod( TMVA::Types::kMLP, "MLP", "!V:NCycles=200:HiddenLayers=N+1,N:TestRate=5" ); factory->TrainAllMethods(); // factory->TrainAllMethodsForRegression(); factory->TestAllMethods(); factory->EvaluateAllMethods(); outputFile->Close(); delete factory;}
A complete TMVA training/testing session
Create Factory
Add variables/targets
Initialize Trees
Book MVA methods
Train, test and evaluate
24/21ACAT 2011 Eckhard von Toerne 06.Sept 2011
What is a multi-variate analysis?
• “Combine“ all input variables into one output variable
• Supervised learning means learning by example: the program extracts patterns from training data
Input Variables
Classifier Output
25/21ACAT 2011 Eckhard von Toerne 06.Sept 2011
Metaclassifiers – Category Classifier and Boosting