1/21 acat 2011eckhard von toerne 06.sept 2011 status of tmva the toolkit for multivariate analysis...

25
1/21 ACAT 2011 Eckhard von Toerne 06.Sept 2011 Status of TMVA the Toolkit for MultiVariate Analysis Eckhard von Toerne (University of Bonn) For the TMVA core developer team: A. Hoecker, P. Speckmayer, J. Stelzer, J. Therhaag, E.v.T., H. Voss

Upload: lora-spencer

Post on 19-Jan-2018

217 views

Category:

Documents


0 download

DESCRIPTION

3/21 ACAT 2011Eckhard von Toerne 06.Sept 2011 What is TMVA Supervised learning Classification and Regression tasks Easy to train, evaluate and compare various MVA methods Various preprocessing methods (Decorr.,PCA, Gauss...) Integrated in ROOT MVA output

TRANSCRIPT

Page 1: 1/21 ACAT 2011Eckhard von Toerne 06.Sept 2011 Status of TMVA the Toolkit for MultiVariate Analysis Eckhard von Toerne (University of Bonn) For the TMVA

1/21ACAT 2011 Eckhard von Toerne 06.Sept 2011

Status of TMVA the Toolkit for

MultiVariate AnalysisEckhard von Toerne (University of Bonn)For the TMVA core developer team:A. Hoecker, P. Speckmayer, J. Stelzer, J. Therhaag, E.v.T., H. Voss

Page 2: 1/21 ACAT 2011Eckhard von Toerne 06.Sept 2011 Status of TMVA the Toolkit for MultiVariate Analysis Eckhard von Toerne (University of Bonn) For the TMVA

2/21ACAT 2011 Eckhard von Toerne 06.Sept 2011

Outline• Overview• New developments• Recent physics results that use TMVA

– Web-Site: http://tmva.sourceforge.net/– See also: "TMVA - Toolkit for Multivariate Data Analysis , A.

Hoecker, P. Speckmayer, J. Stelzer, J. Therhaag, E.v.Toerne, H. Voss et al., arXiv:physics/0703039v5 [physics.data-an]

Page 3: 1/21 ACAT 2011Eckhard von Toerne 06.Sept 2011 Status of TMVA the Toolkit for MultiVariate Analysis Eckhard von Toerne (University of Bonn) For the TMVA

3/21ACAT 2011 Eckhard von Toerne 06.Sept 2011

What is TMVA• Supervised learning• Classification and

Regression tasks• Easy to train,

evaluate and compare various MVA methods

• Various preprocessing methods (Decorr.,PCA, Gauss...)

• Integrated in ROOT

MVA output

Page 4: 1/21 ACAT 2011Eckhard von Toerne 06.Sept 2011 Status of TMVA the Toolkit for MultiVariate Analysis Eckhard von Toerne (University of Bonn) For the TMVA

4/21ACAT 2011 Eckhard von Toerne 06.Sept 2011

• Training:– Classification:

Learn the features of the differentevent classes from a sample withknown signal/background composition

– Regression:Learn the functional dependencebetween input variables and targets

• Testing:– Evaluate the performance of the trained

classifier/regressor on an independenttest sample

– Compare different methods

• Application:– Apply the classifier/regressor to real data

TMVA workflow

Page 5: 1/21 ACAT 2011Eckhard von Toerne 06.Sept 2011 Status of TMVA the Toolkit for MultiVariate Analysis Eckhard von Toerne (University of Bonn) For the TMVA

5/21ACAT 2011 Eckhard von Toerne 06.Sept 2011

H1

H0

x1

x2

Classification of signal/background – How to find best decision boundary?

H1

H0

x1

x2H1

H0

x1

x2

Regression – How to determine the correct model?

Classification/Regression

Page 6: 1/21 ACAT 2011Eckhard von Toerne 06.Sept 2011 Status of TMVA the Toolkit for MultiVariate Analysis Eckhard von Toerne (University of Bonn) For the TMVA

6/21ACAT 2011 Eckhard von Toerne 06.Sept 2011

• If you have a training sample with only few events? Number of „parameters“ must be limited Use Linear classifier or FDA, small BDT, small MLP

• Variables are uncorrelated (or only linear corrs) likelihood• I just want something simple use Cuts, LD, Fisher• Methods for complex problems use BDT, MLP, SVM

How to choose a method?

List of acronyms:BDT = boosted decision tree, see manual page 103ANN = articifical neural networkMLP = multi-layer perceptron, a specific form of ANN, also the name of our flagship ANN, manual p. 92FDA = functional discriminant analysis, see manual p. 87LD = linear discriminant , manual p. 85SVM = support vector machine, manual p. 98 , SVM currently available only for classificationCuts = like in “cut selection“, manual p. 56Fisher = Ronald A. Fisher, classifier similar to LD, manual p. 83

Page 7: 1/21 ACAT 2011Eckhard von Toerne 06.Sept 2011 Status of TMVA the Toolkit for MultiVariate Analysis Eckhard von Toerne (University of Bonn) For the TMVA

7/21ACAT 2011 Eckhard von Toerne 06.Sept 2011

Performance Speed Robustness Curse of Dim.

Transparency Regression

No/linear correlations

Nonlinear correlations

Training Response Overtraining Weak input vars

1D multi D

1( ) 1 xA x e

1

i

. . .N

1 input layer k hidden layers 1 ouput layer

1

j

M1

. . .

. . . 1

. . .Mk

2 output classes (signal and background)

Nvar discriminating input variables

11w

ijw

1jw. . .. . .

1( ) ( ) ( ) ( 1)

01

kMk k k k

j j ij ii

x w w xA

var

(0)1..i Nx

( 1)1,2kx

(“Activation” function)

with:

Feed

-for

war

d M

ultil

ayer

Per

cept

ron

• Modelling of arbitrary nonlinear functions as a nonlinear combination of simple „neuron activation functions“

• Advantages:– very flexible,

no assumptionabout the functionnecessary

• Disadvantages:– „black box“– needs tuning– seed dependent

Artificial Neural Networks

Page 8: 1/21 ACAT 2011Eckhard von Toerne 06.Sept 2011 Status of TMVA the Toolkit for MultiVariate Analysis Eckhard von Toerne (University of Bonn) For the TMVA

8/21ACAT 2011 Eckhard von Toerne 06.Sept 2011

Performance Speed Robustness Curse of Dim.

Transparency Regression

No/linear correlations

Nonlinear correlations

Training Response Overtraining Weak input vars

1D multi D

/ /

• Grow a forest of decision trees and determine the event class/target by majority vote

• Weights of misclassified events are increased in the next iteration

• Advantages:– ignores weak

variables– works out of

the box

• Disadvantages:– vulnerable to overtraining

Boosted Decision Trees

Page 9: 1/21 ACAT 2011Eckhard von Toerne 06.Sept 2011 Status of TMVA the Toolkit for MultiVariate Analysis Eckhard von Toerne (University of Bonn) For the TMVA

9/21ACAT 2011 Eckhard von Toerne 06.Sept 2011

No Single Best Classifier…

Criteria

Classifiers

Cuts Likeli-hood

PDERS/ k-NN H-Matrix Fisher MLP BDT RuleFit SVM

Perfor-mance

no / linear correlations nonlinear

correlations

SpeedTraining

Response /

Robust-ness

Overtraining Weak input variables

Curse of dimensionality Transparency

The properties of the Function discriminant (FDA) depend on the chosen function

Page 10: 1/21 ACAT 2011Eckhard von Toerne 06.Sept 2011 Status of TMVA the Toolkit for MultiVariate Analysis Eckhard von Toerne (University of Bonn) For the TMVA

10/21ACAT 2011 Eckhard von Toerne 06.Sept 2011

Neyman-Pearson:The Likelihood ratio used as “selection criterion” y(x) gives for each selection efficiency the best possible background rejection. i.e. it maximizes the area under the “Receiver Operation Characteristics” (ROC) curve

Varying y(x)>“cut” moves working point (efficiency and purity) along ROC curve

How to choose “cut”? need to know prior probabilities (S, B abundances) • Measurement of signal cross section: maximum of S/√(S+B) or equiv. √(·p)• Discovery of a signal : maximum of S/√(B)• Precision measurement: high purity (p)• Trigger selection: high efficiency (

0 1

1

0

1-

back

gr.

signal

random guessing

good classification

better classification

“limit” in ROC curve

given by likelihood ratio

Type-1 error smallType-2 error large

Type-1 error large Type-2 error small

Neyman-Pearson Lemma

Page 11: 1/21 ACAT 2011Eckhard von Toerne 06.Sept 2011 Status of TMVA the Toolkit for MultiVariate Analysis Eckhard von Toerne (University of Bonn) For the TMVA

11/21ACAT 2011 Eckhard von Toerne 06.Sept 2011

• 3-dimensional distribution• Signal: sum of Gaussians• Background=flat

• Theoretical limit calculated using Neyman-Pearson Lemma

• Neural net (MLP) with two hidden layers and backpropagation training. Bayesian option has little influence on high statistics training

• TMVA-ANN converges towards theoretical limit for sufficient Ntrain (~100k)

Performance with toy data

Page 12: 1/21 ACAT 2011Eckhard von Toerne 06.Sept 2011 Status of TMVA the Toolkit for MultiVariate Analysis Eckhard von Toerne (University of Bonn) For the TMVA

12/21ACAT 2011 Eckhard von Toerne 06.Sept 2011

Recent developments• Current version: TMVA version 4.1.2 in root release 5.30• Unit test framework for daily software and method performance

validation (C. Rosemann, E.v.T.)• Multiclass classification for MLP, BDTG, FDA• BDT automatic parameter optimization for building the tree

architecture• new method to treat data with distinct sub-populations (Method

Category)• Optional Bayesian treatment of ANN weights in MLP with back-

propagation (Jiahang Zhong)• Extended PDEFoam functionality (A. Voigt)• Variable transformations on a user-defined subset of variables

Page 13: 1/21 ACAT 2011Eckhard von Toerne 06.Sept 2011 Status of TMVA the Toolkit for MultiVariate Analysis Eckhard von Toerne (University of Bonn) For the TMVA

13/21ACAT 2011 Eckhard von Toerne 06.Sept 2011

Unit test• automated framework to verify functionality and

performance (ours based on B. Eckel‘s description)• slimmed version runs every night on various OS

C. Rosemann, E.v.T.

**************************************************************** TMVA - U N I T test : Summary ****************************************************************Test 0 : Event [107/107]....................................OKTest 1 : VariableInfo [31/31]...............................OKTest 2 : DataSetInfo [20/20]................................OKTest 3 : DataSet [15/15]....................................OKTest 4 : Factory [16/16]....................................OKTest 7 : LDN_selVar_Gauss [4/4].............................OK....Test 107 : BoostedPDEFoam [4/4]..............................OKTest 108 : BoostedDTPDEFoam [4/4]............................OKTotal number of failures: 0***************************************************************

Page 14: 1/21 ACAT 2011Eckhard von Toerne 06.Sept 2011 Status of TMVA the Toolkit for MultiVariate Analysis Eckhard von Toerne (University of Bonn) For the TMVA

14/21ACAT 2011 Eckhard von Toerne 06.Sept 2011

And now: switching from statistics to physics

…acknowledging the hard work of our users

Page 15: 1/21 ACAT 2011Eckhard von Toerne 06.Sept 2011 Status of TMVA the Toolkit for MultiVariate Analysis Eckhard von Toerne (University of Bonn) For the TMVA

15/21ACAT 2011 Eckhard von Toerne 06.Sept 2011

Review of recent results

• x BDT trained on individual mH samples with 10 variables. Expect 1.47 ev signal events at mH=160GeV .. compared to 1.27ev with cut-based analysis (on 4 variables) and same bgd.

Phys.Lett.B699:25-47,2011.

Page 16: 1/21 ACAT 2011Eckhard von Toerne 06.Sept 2011 Status of TMVA the Toolkit for MultiVariate Analysis Eckhard von Toerne (University of Bonn) For the TMVA

16/21ACAT 2011 Eckhard von Toerne 06.Sept 2011

Super-Kamiokande Coll., “Kinematic reconstruction of atmospheric neutrino events in a large water Cherenkov detector with proton identification“ Phys.Rev.D79:112010,2009.7 input variables MLP with one hidden layer

Signal Background

Review of recent results

Page 17: 1/21 ACAT 2011Eckhard von Toerne 06.Sept 2011 Status of TMVA the Toolkit for MultiVariate Analysis Eckhard von Toerne (University of Bonn) For the TMVA

17/21ACAT 2011 Eckhard von Toerne 06.Sept 2011

•BDT analysis with ~20 input variables•lepton + ET-mis+ jets •Results for s+t-channel

• CDF Coll., “First Observation of Electroweak Single Top Quark Production“, Phys.Rev.Lett.103:092002,2009.

Review of recent results

Page 18: 1/21 ACAT 2011Eckhard von Toerne 06.Sept 2011 Status of TMVA the Toolkit for MultiVariate Analysis Eckhard von Toerne (University of Bonn) For the TMVA

18/21ACAT 2011 Eckhard von Toerne 06.Sept 2011

Review of recent results using TMVA

• CDF+D0 combined higgs working group, hep-ex/1107.5518. (SVM)

• CMS Coll., H-->WW search, Phys.Lett.B699:25-47,2011. (BDT)• IceCube Coll., astro-ph/1101.1692. (MLP)• D0 Coll., top-pairs, Phys.Rev.D84:012008,2011. (BDT)• IceCube Coll., Phys.Rev.D83:012001,2011. (BDT)• IceCube Coll., Phys.Rev.D82:112003,2010. (BDT)• D0 Coll., Higgs search, Phys.Rev.Lett.105:251801,2010. (BDT) • CDF Coll., single top, Phys.Rev.D82:112005,2010. (BDT)• D0 Coll., single top, Phys.Lett.B690:5-14,2010. (BDT) • D0 Coll., top pairs, Phys.Rev.D82:032002,2010. (Likelihood)• CDF Coll., single top obs., Phys.Rev.Lett.103:092002,2009. (BDT)• Super-Kamiokande Coll., Phys.Rev.D79:112010,2009. (MLP)• BABAR Coll., Phys.Rev.D79:051101,2009. (BDT)• + other papers• + several ATLAS papers with TMVA about to come out…

• + many ATLAS results about to be published…

Page 19: 1/21 ACAT 2011Eckhard von Toerne 06.Sept 2011 Status of TMVA the Toolkit for MultiVariate Analysis Eckhard von Toerne (University of Bonn) For the TMVA

19/21ACAT 2011 Eckhard von Toerne 06.Sept 2011

Review of recent results using TMVA

• CDF+D0 combined higgs working group, hep-ex/1107.5518. (SVM)

• CMS Coll., H-->WW search, Phys.Lett.B699:25-47,2011. (BDT)• IceCube Coll., astro-ph/1101.1692. (MLP)• D0 Coll., top-pairs, Phys.Rev.D84:012008,2011. (BDT)• IceCube Coll., Phys.Rev.D83:012001,2011. (BDT)• IceCube Coll., Phys.Rev.D82:112003,2010. (BDT)• D0 Coll., Higgs search, Phys.Rev.Lett.105:251801,2010. (BDT) • CDF Coll., single top, Phys.Rev.D82:112005,2010. (BDT)• D0 Coll., single top, Phys.Lett.B690:5-14,2010. (BDT) • D0 Coll., top pairs, Phys.Rev.D82:032002,2010. (Likelihood)• CDF Coll., single top obs., Phys.Rev.Lett.103:092002,2009. (BDT)• Super-Kamiokande Coll., Phys.Rev.D79:112010,2009. (MLP)• BABAR Coll., Phys.Rev.D79:051101,2009. (BDT)• + other papers• + several ATLAS papers with TMVA about to come out…

• + many ATLAS results about to be published…

Thank you for u

sing TMVA !

Page 20: 1/21 ACAT 2011Eckhard von Toerne 06.Sept 2011 Status of TMVA the Toolkit for MultiVariate Analysis Eckhard von Toerne (University of Bonn) For the TMVA

20/21ACAT 2011 Eckhard von Toerne 06.Sept 2011

Summary• TMVA versatile package for

classification and regression tasks• Integrated into ROOT• Easy to train classifiers/regression

methods• A multitude of physics results based on

TMVA are coming out• Thank you for your attention!

Page 21: 1/21 ACAT 2011Eckhard von Toerne 06.Sept 2011 Status of TMVA the Toolkit for MultiVariate Analysis Eckhard von Toerne (University of Bonn) For the TMVA

21/21ACAT 2011 Eckhard von Toerne 06.Sept 2011

• Several similar data mining efforts with rising importance in most fields of science and industry

• TMVA is open source software

• Use & redistribution of source permitted according to terms in BSD license

Contributed to TMVA have: Andreas Hoecker (CERN, Switzerland), Jörg Stelzer (CERN, Switzerland), Peter Speckmayer (CERN, Switzerland), Jan Therhaag (Universität Bonn, Germany), Eckhard von Toerne (Universität Bonn, Germany), Helge Voss (MPI für Kernphysik Heidelberg, Germany), Moritz Backes (Geneva University, Switzerland), Tancredi Carli (CERN, Switzerland), Asen Christov (Universität Freiburg, Germany), Or Cohen (CERN, Switzerland and Weizmann, Israel), Krzysztof Danielowski (IFJ and AGH/UJ, Krakow, Poland), Dominik Dannheim (CERN, Switzerland), Sophie Henrot-Versille (LAL Orsay, France), Matthew Jachowski (Stanford University, USA), Kamil Kraszewski (IFJ and AGH/UJ, Krakow, Poland), Attila Krasznahorkay Jr. (CERN, Switzerland, and Manchester U., UK), Maciej Kruk (IFJ and AGH/UJ, Krakow, Poland), Yair Mahalalel (Tel Aviv University, Israel), Rustem Ospanov (University of Texas, USA), Xavier Prudent (LAPP Annecy, France), Arnaud Robert (LPNHE Paris, France), Christoph Rosemann (DESY), Doug Schouten (S. Fraser U., Canada), Fredrik Tegenfeldt (Iowa University, USA, until Aug 2007), Alexander Voigt (CERN, Switzerland), Kai Voss (University of Victoria, Canada), Marcin Wolter (IFJ PAN Krakow, Poland), Andrzej Zemla (IFJ PAN Krakow, Poland), Jiahang Zhong (Academica Sinica, Taipeh).

Credits

Page 22: 1/21 ACAT 2011Eckhard von Toerne 06.Sept 2011 Status of TMVA the Toolkit for MultiVariate Analysis Eckhard von Toerne (University of Bonn) For the TMVA

22/21ACAT 2011 Eckhard von Toerne 06.Sept 2011

Spare Slides

Page 23: 1/21 ACAT 2011Eckhard von Toerne 06.Sept 2011 Status of TMVA the Toolkit for MultiVariate Analysis Eckhard von Toerne (University of Bonn) For the TMVA

23/21ACAT 2011 Eckhard von Toerne 06.Sept 2011

void TMVAnalysis( ){ TFile* outputFile = TFile::Open( "TMVA.root", "RECREATE" ); TMVA::Factory *factory = new TMVA::Factory( "MVAnalysis", outputFile,"!V");

TFile *input = TFile::Open("tmva_example.root"); factory->AddVariable("var1+var2", 'F'); factory->AddVariable("var1-var2", 'F'); //factory->AddTarget("tarval", 'F'); factory->AddSignalTree ( (TTree*)input->Get("TreeS"), 1.0 ); factory->AddBackgroundTree ( (TTree*)input->Get("TreeB"), 1.0 ); //factory->AddRegressionTree ( (TTree*)input->Get("regTree"), 1.0 ); factory->PrepareTrainingAndTestTree( "", "", "nTrain_Signal=200:nTrain_Background=200:nTest_Signal=200:nTest_Background=200:NormMode=None" ); factory->BookMethod( TMVA::Types::kLikelihood, "Likelihood", "!V:!TransformOutput:Spline=2:NSmooth=5:NAvEvtPerBin=50" ); factory->BookMethod( TMVA::Types::kMLP, "MLP", "!V:NCycles=200:HiddenLayers=N+1,N:TestRate=5" ); factory->TrainAllMethods(); // factory->TrainAllMethodsForRegression(); factory->TestAllMethods(); factory->EvaluateAllMethods(); outputFile->Close(); delete factory;}

A complete TMVA training/testing session

Create Factory

Add variables/targets

Initialize Trees

Book MVA methods

Train, test and evaluate

Page 24: 1/21 ACAT 2011Eckhard von Toerne 06.Sept 2011 Status of TMVA the Toolkit for MultiVariate Analysis Eckhard von Toerne (University of Bonn) For the TMVA

24/21ACAT 2011 Eckhard von Toerne 06.Sept 2011

What is a multi-variate analysis?

• “Combine“ all input variables into one output variable

• Supervised learning means learning by example: the program extracts patterns from training data

Input Variables

Classifier Output

Page 25: 1/21 ACAT 2011Eckhard von Toerne 06.Sept 2011 Status of TMVA the Toolkit for MultiVariate Analysis Eckhard von Toerne (University of Bonn) For the TMVA

25/21ACAT 2011 Eckhard von Toerne 06.Sept 2011

Metaclassifiers – Category Classifier and Boosting