1/21 acat 2011eckhard von toerne 06.sept 2011 status of tmva the toolkit for multivariate analysis...

1/21ACAT 2011 Eckhard von Toerne 06.Sept 2011

Status of TMVA the Toolkit for

MultiVariate AnalysisEckhard von Toerne (University of Bonn)For the TMVA core developer team:A. Hoecker, P. Speckmayer, J. Stelzer, J. Therhaag, E.v.T., H. Voss


Outline• Overview• New developments• Recent physics results that use TMVA

– Web-Site: http://tmva.sourceforge.net/– See also: "TMVA - Toolkit for Multivariate Data Analysis , A.

Hoecker, P. Speckmayer, J. Stelzer, J. Therhaag, E.v.Toerne, H. Voss et al., arXiv:physics/0703039v5 [physics.data-an]

http://tmva.sourceforge.net/

http://arxiv.org/abs/physics/0703039


What is TMVA• Supervised learning• Classification and

Regression tasks• Easy to train,

evaluate and compare various MVA methods

• Various preprocessing methods (Decorr.,PCA, Gauss...)

• Integrated in ROOT

MVA output


• Training:– Classification:

Learn the features of the differentevent classes from a sample withknown signal/background composition

– Regression:Learn the functional dependencebetween input variables and targets

• Testing:– Evaluate the performance of the trained

classifier/regressor on an independenttest sample

– Compare different methods

• Application:– Apply the classifier/regressor to real data

TMVA workflow


H1

H0

x1

x2

Classification of signal/background – How to find best decision boundary?

H1

H0

x1

x2H1

H0

x1

x2

Regression – How to determine the correct model?

Classification/Regression


• If you have a training sample with only few events? Number of „parameters“ must be limited Use Linear classifier or FDA, small BDT, small MLP

• Variables are uncorrelated (or only linear corrs) likelihood• I just want something simple use Cuts, LD, Fisher• Methods for complex problems use BDT, MLP, SVM

How to choose a method?

List of acronyms:BDT = boosted decision tree, see manual page 103ANN = articifical neural networkMLP = multi-layer perceptron, a specific form of ANN, also the name of our flagship ANN, manual p. 92FDA = functional discriminant analysis, see manual p. 87LD = linear discriminant , manual p. 85SVM = support vector machine, manual p. 98 , SVM currently available only for classificationCuts = like in “cut selection“, manual p. 56Fisher = Ronald A. Fisher, classifier similar to LD, manual p. 83


Performance Speed Robustness Curse of Dim.

Transparency Regression

No/linear correlations

Nonlinear correlations

Training Response Overtraining Weak input vars

1D multi D

1( ) 1 xA x e

1

i

. . .N

1 input layer k hidden layers 1 ouput layer

1

j

M1

. . .

. . . 1

. . .Mk

2 output classes (signal and background)

Nvar discriminating input variables

11w

ijw

1jw. . .. . .

1( ) ( ) ( ) ( 1)

01

kMk k k k

j j ij ii

x w w xA

var

(0)1..i Nx

( 1)1,2kx

(“Activation” function)

with:

Feed

-for

war

d M

ultil

ayer

Per

cept

ron

• Modelling of arbitrary nonlinear functions as a nonlinear combination of simple „neuron activation functions“

• Advantages:– very flexible,

no assumptionabout the functionnecessary

• Disadvantages:– „black box“– needs tuning– seed dependent

Artificial Neural Networks


Performance Speed Robustness Curse of Dim.

Transparency Regression

No/linear correlations

Nonlinear correlations

Training Response Overtraining Weak input vars

1D multi D

/ /

• Grow a forest of decision trees and determine the event class/target by majority vote

• Weights of misclassified events are increased in the next iteration

• Advantages:– ignores weak

variables– works out of

the box

• Disadvantages:– vulnerable to overtraining

Boosted Decision Trees


No Single Best Classifier…

Criteria

Classifiers

Cuts Likeli-hood

PDERS/ k-NN H-Matrix Fisher MLP BDT RuleFit SVM

Perfor-mance

no / linear correlations nonlinear

correlations

SpeedTraining

Response /

Robust-ness

Overtraining Weak input variables

Curse of dimensionality Transparency

The properties of the Function discriminant (FDA) depend on the chosen function


Neyman-Pearson:The Likelihood ratio used as “selection criterion” y(x) gives for each selection efficiency the best possible background rejection. i.e. it maximizes the area under the “Receiver Operation Characteristics” (ROC) curve

Varying y(x)>“cut” moves working point (efficiency and purity) along ROC curve

How to choose “cut”? need to know prior probabilities (S, B abundances) • Measurement of signal cross section: maximum of S/√(S+B) or equiv. √(·p)• Discovery of a signal : maximum of S/√(B)• Precision measurement: high purity (p)• Trigger selection: high efficiency (

0 1

1

0

1-

back

gr.

signal

random guessing

good classification

better classification

“limit” in ROC curve

given by likelihood ratio

Type-1 error smallType-2 error large

Type-1 error large Type-2 error small

Neyman-Pearson Lemma


• 3-dimensional distribution• Signal: sum of Gaussians• Background=flat

• Theoretical limit calculated using Neyman-Pearson Lemma

• Neural net (MLP) with two hidden layers and backpropagation training. Bayesian option has little influence on high statistics training

• TMVA-ANN converges towards theoretical limit for sufficient Ntrain (~100k)

Performance with toy data


Recent developments• Current version: TMVA version 4.1.2 in root release 5.30• Unit test framework for daily software and method performance

validation (C. Rosemann, E.v.T.)• Multiclass classification for MLP, BDTG, FDA• BDT automatic parameter optimization for building the tree

architecture• new method to treat data with distinct sub-populations (Method

Category)• Optional Bayesian treatment of ANN weights in MLP with back-

propagation (Jiahang Zhong)• Extended PDEFoam functionality (A. Voigt)• Variable transformations on a user-defined subset of variables


Unit test• automated framework to verify functionality and

performance (ours based on B. Eckel‘s description)• slimmed version runs every night on various OS

C. Rosemann, E.v.T.

**************************************************************** TMVA - U N I T test : Summary ****************************************************************Test 0 : Event [107/107]....................................OKTest 1 : VariableInfo [31/31]...............................OKTest 2 : DataSetInfo [20/20]................................OKTest 3 : DataSet [15/15]....................................OKTest 4 : Factory [16/16]....................................OKTest 7 : LDN_selVar_Gauss [4/4].............................OK....Test 107 : BoostedPDEFoam [4/4]..............................OKTest 108 : BoostedDTPDEFoam [4/4]............................OKTotal number of failures: 0***************************************************************


And now: switching from statistics to physics

…acknowledging the hard work of our users


Review of recent results

• x BDT trained on individual mH samples with 10 variables. Expect 1.47 ev signal events at mH=160GeV .. compared to 1.27ev with cut-based analysis (on 4 variables) and same bgd.

Phys.Lett.B699:25-47,2011.


Super-Kamiokande Coll., “Kinematic reconstruction of atmospheric neutrino events in a large water Cherenkov detector with proton identification“ Phys.Rev.D79:112010,2009.7 input variables MLP with one hidden layer

Signal Background



•BDT analysis with ~20 input variables•lepton + ET-mis+ jets •Results for s+t-channel

• CDF Coll., “First Observation of Electroweak Single Top Quark Production“, Phys.Rev.Lett.103:092002,2009.



Review of recent results using TMVA

• CDF+D0 combined higgs working group, hep-ex/1107.5518. (SVM)

• CMS Coll., H-->WW search, Phys.Lett.B699:25-47,2011. (BDT)• IceCube Coll., astro-ph/1101.1692. (MLP)• D0 Coll., top-pairs, Phys.Rev.D84:012008,2011. (BDT)• IceCube Coll., Phys.Rev.D83:012001,2011. (BDT)• IceCube Coll., Phys.Rev.D82:112003,2010. (BDT)• D0 Coll., Higgs search, Phys.Rev.Lett.105:251801,2010. (BDT) • CDF Coll., single top, Phys.Rev.D82:112005,2010. (BDT)• D0 Coll., single top, Phys.Lett.B690:5-14,2010. (BDT) • D0 Coll., top pairs, Phys.Rev.D82:032002,2010. (Likelihood)• CDF Coll., single top obs., Phys.Rev.Lett.103:092002,2009. (BDT)• Super-Kamiokande Coll., Phys.Rev.D79:112010,2009. (MLP)• BABAR Coll., Phys.Rev.D79:051101,2009. (BDT)• + other papers• + several ATLAS papers with TMVA about to come out…

• + many ATLAS results about to be published…


Review of recent results using TMVA

• CDF+D0 combined higgs working group, hep-ex/1107.5518. (SVM)

• CMS Coll., H-->WW search, Phys.Lett.B699:25-47,2011. (BDT)• IceCube Coll., astro-ph/1101.1692. (MLP)• D0 Coll., top-pairs, Phys.Rev.D84:012008,2011. (BDT)• IceCube Coll., Phys.Rev.D83:012001,2011. (BDT)• IceCube Coll., Phys.Rev.D82:112003,2010. (BDT)• D0 Coll., Higgs search, Phys.Rev.Lett.105:251801,2010. (BDT) • CDF Coll., single top, Phys.Rev.D82:112005,2010. (BDT)• D0 Coll., single top, Phys.Lett.B690:5-14,2010. (BDT) • D0 Coll., top pairs, Phys.Rev.D82:032002,2010. (Likelihood)• CDF Coll., single top obs., Phys.Rev.Lett.103:092002,2009. (BDT)• Super-Kamiokande Coll., Phys.Rev.D79:112010,2009. (MLP)• BABAR Coll., Phys.Rev.D79:051101,2009. (BDT)• + other papers• + several ATLAS papers with TMVA about to come out…

• + many ATLAS results about to be published…

Thank you for u

sing TMVA !


Summary• TMVA versatile package for

classification and regression tasks• Integrated into ROOT• Easy to train classifiers/regression

methods• A multitude of physics results based on

TMVA are coming out• Thank you for your attention!


• Several similar data mining efforts with rising importance in most fields of science and industry

• TMVA is open source software

• Use & redistribution of source permitted according to terms in BSD license

Contributed to TMVA have: Andreas Hoecker (CERN, Switzerland), Jörg Stelzer (CERN, Switzerland), Peter Speckmayer (CERN, Switzerland), Jan Therhaag (Universität Bonn, Germany), Eckhard von Toerne (Universität Bonn, Germany), Helge Voss (MPI für Kernphysik Heidelberg, Germany), Moritz Backes (Geneva University, Switzerland), Tancredi Carli (CERN, Switzerland), Asen Christov (Universität Freiburg, Germany), Or Cohen (CERN, Switzerland and Weizmann, Israel), Krzysztof Danielowski (IFJ and AGH/UJ, Krakow, Poland), Dominik Dannheim (CERN, Switzerland), Sophie Henrot-Versille (LAL Orsay, France), Matthew Jachowski (Stanford University, USA), Kamil Kraszewski (IFJ and AGH/UJ, Krakow, Poland), Attila Krasznahorkay Jr. (CERN, Switzerland, and Manchester U., UK), Maciej Kruk (IFJ and AGH/UJ, Krakow, Poland), Yair Mahalalel (Tel Aviv University, Israel), Rustem Ospanov (University of Texas, USA), Xavier Prudent (LAPP Annecy, France), Arnaud Robert (LPNHE Paris, France), Christoph Rosemann (DESY), Doug Schouten (S. Fraser U., Canada), Fredrik Tegenfeldt (Iowa University, USA, until Aug 2007), Alexander Voigt (CERN, Switzerland), Kai Voss (University of Victoria, Canada), Marcin Wolter (IFJ PAN Krakow, Poland), Andrzej Zemla (IFJ PAN Krakow, Poland), Jiahang Zhong (Academica Sinica, Taipeh).

Credits

http://tmva.sf.net/LICENSE


Spare Slides


void TMVAnalysis( ){ TFile* outputFile = TFile::Open( "TMVA.root", "RECREATE" ); TMVA::Factory *factory = new TMVA::Factory( "MVAnalysis", outputFile,"!V");

TFile *input = TFile::Open("tmva_example.root"); factory->AddVariable("var1+var2", 'F'); factory->AddVariable("var1-var2", 'F'); //factory->AddTarget("tarval", 'F'); factory->AddSignalTree ( (TTree*)input->Get("TreeS"), 1.0 ); factory->AddBackgroundTree ( (TTree*)input->Get("TreeB"), 1.0 ); //factory->AddRegressionTree ( (TTree*)input->Get("regTree"), 1.0 ); factory->PrepareTrainingAndTestTree( "", "", "nTrain_Signal=200:nTrain_Background=200:nTest_Signal=200:nTest_Background=200:NormMode=None" ); factory->BookMethod( TMVA::Types::kLikelihood, "Likelihood", "!V:!TransformOutput:Spline=2:NSmooth=5:NAvEvtPerBin=50" ); factory->BookMethod( TMVA::Types::kMLP, "MLP", "!V:NCycles=200:HiddenLayers=N+1,N:TestRate=5" ); factory->TrainAllMethods(); // factory->TrainAllMethodsForRegression(); factory->TestAllMethods(); factory->EvaluateAllMethods(); outputFile->Close(); delete factory;}

A complete TMVA training/testing session

Create Factory

Add variables/targets

Initialize Trees

Book MVA methods

Train, test and evaluate


What is a multi-variate analysis?

• “Combine“ all input variables into one output variable

• Supervised learning means learning by example: the program extracts patterns from training data

Input Variables

Classifier Output


Metaclassifiers – Category Classifier and Boosting

1/21 acat 2011eckhard von toerne 06.sept 2011 status of tmva the toolkit for multivariate analysis...

Documents