Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube
Tim Ruhe, TU Dortmund
2
Outline
Data mining is more... Why is IceCube interesting (from a machine learning point of view) Data preprocessing and dimensionality reduction Training and validation of a learning algorithm Results Other Detector configuration? Summary & Outlook
3
Data Mining is more...
Model
BeisBeisExamples (annotated)
Historical data, simulations
New data(not annotated)
Learning Algorithm
Application
II Information,knowledge
Nobel prize(s)
4
Data Mining is more...
Model
BeisBeisExamples (annotated)
Historical data, simulations
New data(not annotated)
Learning Algorithm
Application
II Information,knowledge
Nobel prize(s)
Preprocessing
Garbage in/Garbage out
5
Data Mining is more...
Model
BeisBeisExamples (annotated)
Historical data, simulations
New data(not annotated)
Learning Algorithm
Application
II Information,knowledge
Nobel prize(s)
Preprocessing
Garbage in/Garbage out
Validation
6
Why is IceCube interesting from a machine learning point of view?
Huge amount of data Highly imbalanced distribution of event
classes (signal and background) Huge amount of data to be processed by
the learner (Big Data)
Real life problem
7
Preprocessing (1): Reducing the Data Volume Through Cuts
Background Rejection: 91.4%Signal Efficiency: 57.1%
BUT: Remaining Background
is significantly harder to reject!
8
Preprocessing (2): Variable Selection
Tim Ruhe | Statistische Methoden der Datenanalyse
Check for missing values.
Check for potential bias.
Check for correlations.
Exclude if number of missing values exceed a 30%.
Exclude everything that is useless, redundant or a source of potential bias.
Exclude everything that has a
correlation of 1.0.Automated
Feature Selection
2600 variables
477 variables
9
Relevance vs. Redundancy: MRMR (continuous case)
Relevance: Redundancy:
MRMR: or
10
Feature Selection Stability
BA
BAJ
Jaccard:
Average over many sets of variables:
11
Comparing Forward Selection and MRMR
12
Training and Validation of a Random Forest
treesn
ii
trees
sn
s0
1
use an ensemble of simple decision trees
Obtain final classification as an average over all trees
13
Training and Validation of a Random Forest
treesn
ii
trees
sn
s0
1
use an ensemble of simple decision trees
Obtain final classification as an average over all trees
5-fold cross validation to validate the performance of the forest.
14
Random Forest and Cross Validation in Detail (1)
Background Muons750,000 in total
CORSIKA, Polygonato
Neutrinos70,000 in total
NuGen, E-2 Spectrum
600,000 available for training
56,000 available for training
27,000
27,000
Sam
plin
g
15
Random Forest and Cross Validation in Detail (2)
150,000 available for testing
14,000 available for testing
27,000
27,000
Train Apply
Repeat (x5)
500 Trees
16
Random Forest Output
17
Random Forest Output
We need an additional
cut on the output of the
Random Forest!
18
Random Forest Output: Cut at 500 trees
We need an additional
cut on the output of the
Random Forest!
28830 ± 480 expected neutrino candidates
28830 ± 480 expected background muons
27,771 neutrino candidates
Background Rejection: 99.9999% Signal Efficiency 18.2% Estimated Purity: (99.59±0.37)%
Apply to experimental data
This yields
19
Unfolding the spectrum
TRUEE
This is no Data Mining...
...but it ain‘t magic either
20
Moving on... IC79
212 neutrino candidates per day 66885 neutrino candidates in total 330±200 background muons
Entire analysis chain can be applied on other detector configurations
...with minor changes (e.g. ice model)
21
Summary and Outlook
99.9999% Background Rejection
Purities above 99% are routinely achieved
Future Improvements???
By starting at an earlier analysis level...
MRMRRandom Forest
22
Backup Slides
23
RapidMiner in a Nutshell
Developed at the Department of Computer Science at TU Dortmund(YALE) Operator based, written in Java It used to be open source Many, many plugins due to a rather active community One of the most widely used data mining tools
24
What I like about it
Data flow is nicely visualized and can be easily followed and comprehended
Rather easy to learn, even without programming experience Large Community (Updates, Bugfixes, Plugins) Professional Tool (They actually make money with that!) Good support Many tutorials can be found online, even special one Most operators work like a charm Extendable
25
Relevance vs. Redundancy: MRMR (discrete case)
Relevance: Redundancy:
MRMR: or
Mutual Information
26
Feature Selection Stability
BA
BAJ
||
||||
)(),(
2
BAr
kBA
knk
krnBAIC
Jaccard:
Kuncheva:
27
Ensemblemethoden
Tim Ruhe | Statistische Methoden der Datenanalyse
Ensemble methods
With Weight (e.g. Boosting)
Without Weight (e.g. Random Forest)
28
Random Forest: What is randomized?
Randomness 1: Events the tree is trained on (bagging)
Randomness 2: Variables that are available for a split
29
Are we actually better, than simpler methods?