big data competition: maximizing your potential exampled with the 2014 higgs boson machine learning...

BIG DATA COMPETITION: MAXIMIZING YOUR POTENTIAL

EXAMPLED WITH THE 2014 HIGGS BOSON MACHINE LEARNING CHALLENGE

Dr. Cheng CHEN email: [email protected]

twitter: @cheng_chen_us

Development Consulting International LLC

goDCI.com

1this presentation is copyright protected ©

Ohio State University, Tongji University

Ph.D. Civil Engineering

M.S. Applied Statistics

Minor Computer Science

Advanced trainings:

City and Regional Planning

Industrial and Systems Engineering

Mathematics

Passion: (this) machine learning

PRESENTER

2

• Goal: improve the procedure that produces the selection region of Higgs Boson

• 4 Month Duration

• 1,785 teams

• Many machine learning experts, statisticians, and physicist

• Top 5 are from 5 different countries

HIGGS BOSON MACHINE LEARNING CHALLENGE

3

Netherlands

Hungary

France

Russia

U.S.A/Chinahttp://www.kaggle.com/c/higgs-‐boson/leaderboard

Background

4

Data

Model

Understand

Explore Enhance

Train Select Optimize

read

visualize

reduce

generate

cross validate innovate

read

discuss

Validate

apply

fine-‐tune

find

©

Background

5

Data

Model

Understand

Explore Enhance


read

visualize

reduce

generate

innovate

read

discuss

Validate

apply

fine-‐tune

find

cross validate

©

READ AND DISCUSS

6

• a.k.a the God Particle (explains some mass)

• A fundamental particle theorized in 1964 in the Standard Model of Particle Physics

• “Considered” discovered in 2011 – 2013 in LHC by CERN

• A number of prestigious awards in 2013, including a Nobel prize

HIGGS BOSON

7http://upload.wikimedia.org/wikipedia/commons/0/00/Standard_Model_of_Elementary_Particles.svg

A "definitive" answer might require "another few years" after the collider's 2015 restart.deputy chair of physics at Brookhaven National Laboratory

http://en.wikipedia.org/wiki/Higgs_boson

• Established in 1954

• Birth of World Wide Web (1989)

CERN: THE EUROPEAN ORGANIZATION FOR NUCLEAR RESEARCH

8

maps.google.com

• 27 km (17 mi) in circumference

• 175 meters (574 ft) beneath ground

• Built from 1998 to 2008

• Over 10,000 scientists and engineers

• Over 100 counties

• Seven particle detectors

LARGE HADRON COLLIDER (LHC)

9https://www.llnl.gov/news/llnl-‐set-‐host-‐international-‐lattice-‐physics-‐conference

http://en.wikipedia.org/wiki/Large_Hadron_Collider


• 46 meters long

• 25 meters in diameter

• Weighs about 7,000 tonnes

• Contains some 3000 km of cable

• Involves roughly 3,000 physicists from over 175 institutions in 38 countries.

ATLAS

10


http://higgsml.lal.in2p3.fr/documentation/

• 46 meters long





ATLAS

11



• 46 meters long





ATLAS

12



• Higgs Boson can not be measured directly (decays immediately into lighter particles)

• Other particles can decay into the same set of lighter particles

• PRODUCTION and DECAY of Higgs Boson depends on the mass, while mass was not predicted by theory (now we know it is close to 125 Gev)

CHALLENGES IN DETECTION OF HIGGS BOSON

13https://www2.physics.ox.ac.uk/sites/default/files/2012-‐03-‐27/sinead_farrington_pdf_17376.pdf

Seeing a circular shaped shadow does not mean the real object is a sphere ball

• Raw data collected from LHC

• Hundreds of millions of proton-‐proton collisions (event) per second

• 400 events of interest are selected per second

– Signal event (i.e. Higgs Boson)

–Background event (i.e. other particles)

• Events in Ad Hoc selection region (in certain channels) exceeding background noise

CURRENT DETECTION MECHANISM

14

Needs improvement in significance and robustness in selection criteria

• Simulated Data

• Fixed mass (125 GeV)

• Simplified decay channel

–Next Slide

• Simplified background events (three representative types only)

–Decay of the Z boson (91.2 GeV) into Tau-‐Tau –Decay of a pair of top quarks into lepton and hadronic tau –“Decay” of the W boson into lepton and hadronic tau due to imperfections in the particle identification procedure

• Simplified objective function (significance score)

SIMPLIFICATIONS FOR COMPETITION

15

• Decay of Tau-‐Tau Channel only

• One tau decays into lepton and two neutrino

• The other tau decays into hadronic tau and a neutrino

• (Note: Neutrinos can not be detected)

SIMPLIFIED DECAY CHANNEL

16

hadronic tau:a bunch of hadrons






17

hadronic tau:a bunch of hadrons






18

Jets MET

vectorized momenta are givenhadronic tau:a bunch of hadrons

Background

19

Data

Model

Understand

Explore Enhance


read

visualize

reduce

generate

innovate

read

discuss

Validate

apply

fine-‐tune

find

cross validate

©

• 250,000 training

• 550,000 testing

• 30 variables

– 17 Primitive • Momenta • Direction

– 13 Derived

DATA DIMENSION

20

4 rows in training data

EventId DER_mass_MMC

DER_mass_transverse_met_lep

DER_mass_vis

DER_pt_h

DER_deltaeta_jet_jet

DER_mass_jet_jet

DER_prodeta_jet_jet

DER_deltar_tau_lep

DER_pt_tot

DER_sum_pt

100000 138.47 51.655 97.827 27.98 0.91 124.711 2.666 3.064 41.928 197.76100001 160.937 68.768 103.235 48.146 NA NA NA 3.473 2.078 125.157100002 NA 162.172 125.953 35.635 NA NA NA 3.148 9.336 197.814100003 143.905 81.417 80.943 0.414 NA NA NA 3.31 0.414 75.968

EventIdDER_pt_ratio_lep_tau

DER_met_phi_centrality

DER_lep_eta_centrality

PRI_tau_pt

PRI_tau_eta

PRI_tau_phi

PRI_lep_pt

PRI_lep_eta

PRI_lep_phi PRI_met

100000 1.582 1.396 0.2 32.638 1.017 0.381 51.626 2.273 -2.414 16.824100001 0.879 1.414 NA 42.014 2.039 -3.011 36.918 0.501 0.103 44.704100002 3.776 1.414 NA 32.154 -0.705 -2.093 121.409 -0.953 1.052 54.283100003 2.354 -1.285 NA 22.647 -1.655 0.01 53.321 -0.522 -3.1 31.082

EventId PRI_met_phi

PRI_met_sumet

PRI_jet_num

PRI_jet_leading_pt

PRI_jet_leading_eta

PRI_jet_leading_phi

PRI_jet_subleading_pt

PRI_jet_subleading_eta

PRI_jet_subleading_phi

PRI_jet_all_pt

100000 -0.277 258.733 2 67.435 2.15 0.444 46.062 1.24 -2.475 113.497100001 -1.916 164.546 1 46.226 0.725 1.158 NA NA NA 46.226100002 -2.186 260.414 1 44.251 2.053 -2.028 NA NA NA 44.251100003 0.06 86.062 0 NA NA NA NA NA NA 0

EventId Weight Label100000 0.00265331133733s100001 2.23358448717b100002 2.34738894364b100003 5.44637821192b

Data loaded correctly Notice NA values

MISSING VALUES

21

col_name NA_count NA_pct1 EventId 2 DER_mass_MMC 38,114 15%3 DER_mass_transverse_met_lep 4 DER_mass_vis 5 DER_pt_h 6 DER_deltaeta_jet_jet 177,457 71%7 DER_mass_jet_jet 177,457 71%8 DER_prodeta_jet_jet 177,457 71%9 DER_deltar_tau_lep 10 DER_pt_tot 11 DER_sum_pt 12 DER_pt_ratio_lep_tau 13 DER_met_phi_centrality 14 DER_lep_eta_centrality 177,457 71%15 PRI_tau_pt 16 PRI_tau_eta 17 PRI_tau_phi 18 PRI_lep_pt 19 PRI_lep_eta 20 PRI_lep_phi 21 PRI_met 22 PRI_met_phi 23 PRI_met_sumet 24 PRI_jet_num 25 PRI_jet_leading_pt 99,913 40%26 PRI_jet_leading_eta 99,913 40%27 PRI_jet_leading_phi 99,913 40% 28 PRI_jet_subleading_pt 177,457 71%29 PRI_jet_subleading_eta 177,457 71%30 PRI_jet_subleading_phi 177,457 71%31 PRI_jet_all_pt 32 Weight 33 Label

MISSING VALUES

22

col_name NA_count NA_pct1 EventId 2 DER_mass_MMC 38,114 15%3 DER_mass_transverse_met_lep 4 DER_mass_vis 5 DER_pt_h 6 DER_deltaeta_jet_jet 177,457 71%7 DER_mass_jet_jet 177,457 71%8 DER_prodeta_jet_jet 177,457 71%9 DER_deltar_tau_lep 10 DER_pt_tot 11 DER_sum_pt 12 DER_pt_ratio_lep_tau 13 DER_met_phi_centrality 14 DER_lep_eta_centrality 177,457 71%15 PRI_tau_pt 16 PRI_tau_eta 17 PRI_tau_phi 18 PRI_lep_pt 19 PRI_lep_eta 20 PRI_lep_phi 21 PRI_met 22 PRI_met_phi 23 PRI_met_sumet 24 PRI_jet_num 25 PRI_jet_leading_pt 99,913 40%26 PRI_jet_leading_eta 99,913 40%27 PRI_jet_leading_phi 99,913 40%28 PRI_jet_subleading_pt 177,457 71%29 PRI_jet_subleading_eta 177,457 71%30 PRI_jet_subleading_phi 177,457 71%31 PRI_jet_all_pt 32 Weight 33 Label

Notice the consistency in missing values

• Assign a value

–Generate a random value

– Fit a value (mean, median, nearest neighbor, etc.)

– Fix a value (domain knowledge)

• Remove the record

• Leave as is

HOW TO HANDLE MISSING VALUES

23

• Assign a value

–Generate a random value

– Fit a value (mean, median, nearest neighbor, etc.)

– Fix a value (domain knowledge)

• Remove the record

• Leave as is

HOW TO HANDLE MISSING VALUES

24

HISTOGRAM

25Density is more meaningful in the range of x No fuzzy jump at the edge

PRI_jet_leading_pt

Coun

t

Log transformation

Coun

t

Inverse transformation

Coun

t

HISTOGRAM (CONT’D)

26Bi-‐modality is revealed

DER_pt_h

Coun

t

Log transformation

Coun

t

Inverse transformation

Coun

t

INTERACTIVE VISUALIZATION R SHINY

27http://chencheng.shinyapps.io/demo_higgsDEMO


29

Use a reasonable number of bins to display the underlying distribution

http://chencheng.shinyapps.io/demo_higgsDEMO


30

Use a reasonable transformation to display the underlying distribution


HISTOGRAM (CONT’D)

31

Coun

t

PRI_tau_etaTransformations are sometimes not necessary

32

Do that for all 30 variables

PAIRWISE CORRELATIONS

33

Coun

t

Count

BKG

SGN

PRI_lep_phi & PRI_met_phi


34

Coun

t

CountSet transparency parameter appropriately to reveal important patterns

BKG

SGN



35

Coun

t

CountCorrelation coefficient == 0 does not mean no correlation

BKG

SGN



36

Coun

t

Count

BKG

SGN


FEATURE ENHANCEMENT ROTATION

37Validate visual “evidence” from various perspectives

BKG

SGN

rotated PRI_lep_phi & PRI_met_phi


38Validate visual “evidence” from various perspectives

BKG

SGN

rotated PRI_lep_phi & PRI_met_phi

PAIRWISE VARIABLES — LOW RES.

39

Coun

t

Count

BKG

SGN

DER_pt_h & DER_deltar_tau_lep

PAIRWISE VARIABLES — HIGH RES.

40Try High Resolution

Coun

t

Count

BKG

SGN


PAIRWISE VARIABLES — HIGH RES.

41Curve fitting

Coun

t

Count

BKG

SGN


FEATURE ENHANCEMENT CURVE FITTING

42Enhance a variable based on correlation with another variable

Coun

t

Count

BKG

SGN


FEATURE ENHANCEMENT ROTATION BY PRI_TAU_PHI

43

Domain Knowledge

Coun

t

Count

BKG

SGN

DER_pt_h & PRI_lep_phi

FEATURE ENHANCEMENT ROTATION BY PRI_TAU_PHI

44Feature enhancement by applying domain knowledge

Coun

t

Count

BKG

SGN

DER_pt_h & PRI_lep_phi

Domain Knowledge


45

Coun

t

Count

BKG

SGN

PRI_jet_leading_eta & PRI_jet_subleading_eta

• Select variable(s): One var. for histogram, two var. for scatter plot

DATA DRILL DOWN


• Dynamically select a subset of data — PRI_jet_num = 2

DATA DRILL DOWN


• Patterns in the subset data — PRI_jet_leading_eta & PRI_jet_subleading_eta

DATA DRILL DOWN


• Dynamically select a subset of data — PRI_jet_num = 3

DATA DRILL DOWN



DATA DRILL DOWN



DATA DRILL DOWN

51

PRI_jet_num = 2 PRI_jet_num = 3

Interactive data visualization techniques are helpful


52

Do that for all 30 * 29 ~= 900 pairs

PARTICLE LOCATION — (0, S)

53

Convert numerical data back into actual object with meaning

Animation

PARTICLE LOCATION — (0, B)

54

Animation

• Distance ratio between MET-‐Lep and Tau-‐Lep

d(MET, Lep)/d(Tau, Lep)

INSPIRATION FROM ANIMATION

55

Inspiration from meaningful visualization can be helpful

Coun

t

dist_ratio_met_lep_tau

BKG

SGN

• Distance ratio between MET-‐Lep and Tau-‐Lep

d(MET, Lep)/d(Tau, Lep)

BKG

SGN

INSPIRATION FROM ANIMATION

56

Adjust visualization for better efficiency

Coun

t


Coun

t


BKG

SGN

• Variable reduction

– Simple rotation

– Transformation

–Domain knowledge

–…

• Feature generation

–Domain knowledge

– Inspiration from various visualizations

– Statistical approaches

–…

FEATURE ENHANCEMENT

57

Principle component analysis

distance_ratio

Rotation by phiCurve fitting

45 degree rotation

Background

58

Data

Model

Understand

Explore Enhance


read

visualize

reduce

generate

innovateapply

fine-‐tune

read

discuss

Validate

find

cross validate

©

• Gradient boosting tree

• Neural network

• Bayesian network

• Support vector machine

• Generalized additive model

MODELS

59

• Gradient boosting tree

• Neural network

• Bayesian network

• Support vector machine

• Generalized additive model

MODELS

60

• Decision tree

–Build many shallow trees

• Boosting

–Build trees based on residual

• Bagging

– Each tree uses a subset of the data

• Ensembling

–Combine the trees

GRADIENT BOOSTING TREE

61

• Decision tree

–Build many shallow trees

• Boosting

–Build trees based on residual

• Bagging

– Each tree uses a subset of the data

• Ensembling

–Combine the trees

GRADIENT BOOSTING TREE

62

• Regression tree

DECISION TREE

63

−1.0

−0.5

0.0

0.5

1.0

0.0 2.5 5.0 7.5 10.0x

y

• Regression tree

DECISION TREE

64

−1.0

−0.5

0.0

0.5

1.0

0.0 2.5 5.0 7.5 10.0x

y

|

x< 6.614x>=6.614

0.19n=100

−0.08n=64

0.66n=36

Regression Tree with Node Depth = 1

Depth = 1

• Regression tree

DECISION TREE

65

|

x< 6.614

x>=3.049 x>=8.953

x>=6.614

x< 3.049 x< 8.953

0.19n=100

−0.08n=64

−0.53n=40

0.67n=24

0.66n=36

0.086n=7

0.8n=29


−1.0

−0.5

0.0

0.5

1.0

0.0 2.5 5.0 7.5 10.0x

y

Depth = 2

• Regression tree

DECISION TREE

66

|

x< 6.614

x>=3.049

x< 5.862

x>=8.953

x< 7.207

x>=6.614

x< 3.049

x>=5.862

x< 8.953

x>=7.207

0.19n=100

−0.08n=64

−0.53n=40

−0.67n=32

0.045n=8

0.67n=24

0.66n=36

0.086n=7

0.8n=29

0.57n=7

0.87n=22


−1.0

−0.5

0.0

0.5

1.0

0.0 2.5 5.0 7.5 10.0x

y

Depth = 3

• Regression tree

DECISION TREE

67

|

x< 6.614

x>=3.049

x< 5.862

x>=3.594

x>=8.953

x< 7.207

x>=6.614

x< 3.049

x>=5.862

x< 3.594

x< 8.953

x>=7.207

0.19n=100

−0.08n=64

−0.53n=40

−0.67n=32

−0.8n=25

−0.23n=7

0.045n=8

0.67n=24

0.66n=36

0.086n=7

0.8n=29

0.57n=7

0.87n=22


−1.0

−0.5

0.0

0.5

1.0

0.0 2.5 5.0 7.5 10.0x

y

Depth = 4

X0 = X; Y0 = Y;

latest_model = train_tree(X, Y);

for ii = 1:NUM_ITER

Index_train = random(1:NUM_REC, FRAC_TRAIN * NUM_REC)

X = X0[Index_train]; Y = Y0[Index_train];

v_resid = Y -‐ wts * latest_model(X);

tree(ii) = train_tree(X, v_pseudo_resid, wts);

latest_model += LARNING_RATE * tree(ii)

DECISION TREE

68

base model

X0 = X; Y0 = Y;


for ii = 1:NUM_ITER



v_resid = Y -‐ latest_model(X);

tree_add= train_tree(X, v_resid);

latest_model += LARNING_RATE * tree_add

GRADIENT BOOSTING TREE (V. 1)

69

get the residuals

fit a tree for residuals

additive model

X0 = X; Y0 = Y;


for ii = 1:NUM_ITER



v_resid = Y -‐ latest_model(X);

tree_add = train_tree(X, v_resid);


(STOCHASTIC) GRADIENT BOOSTING TREE

70

get sampled index

sampled records as input

store input

X0 = X; Y0 = Y;

latest_model = train_tree(X, Y, wts);

for ii = 1:NUM_ITER



v_resid = Y -‐ wts * latest_model(X);

tree_add = train_tree(X, v_resid, wts);


(STOCHASTIC) GRADIENT BOOSTING TREE WITH WEIGHT

71

X0 = X; Y0 = Y;

latest_model = train_base_model(X, Y, wts);

for ii = 1:NUM_ITER



v_pseudo_resid = get_pseudo_residual(X, Y, wts, latest_model, LOSS_FUNCTION_TYPE);

model_add_base = train_base_model(X, v_pseudo_resid, wts);

alpha = linear_search(cost_function, model_add_base, X, Y, wts);

latest_model += LARNING_RATE * (alpha * model_add_base)

(GENERAL) GRADIENT BOOSTING

72

[Stochastic Gradient Boosting] Jerome H. Friedman, 1999

Background

73

Data

Model

Understand

Explore Enhance


read

visualize

reduce

generate

innovateapply

fine-‐tune

read

discuss

Validate

find

cross validate

©

gbm_model = gbm.fit(

x=train[,x_vars, with = FALSE],

y=train$Label,

distribution = char_distr,

w = w,

n.trees = n_trees,

interaction.depth = num_inter,

n.minobsinnode = min_obs_node,

shrinkage = shrinkage_rate,

bag.fraction = frac_bag)

APPLYING GBM IN R

74

VARIABLE IMPORTANCE

75Relative Importance

APPLY MODEL ON TEST DATA

76

EventId Score RankOrder Class

1 0.98 501 s

2 0.42 259,579 b

3 0.46 264,125 b

. . . .

. . . .

449,998 0.86 31,154 s

449,999 0.12 489,251 b

550,000 0.79 110,154 b

Background

77

Data

Model

Understand

Explore Enhance


read

visualize

reduce

generate

innovateapply

fine-‐tune

read

discuss

Validate

find

cross validate

• Number of iteration

• Minimum observation for each node

• Fraction of bagging (0.5 ~ 0.8)

• Learning rate (<0.1)

• Depth of tree (4 ~ 8)

GRADIENT BOOSTING PARAMETERS

78

Background

79

Data

Model

Understand

Explore Enhance


read

visualize

reduce

generate

innovateapply

fine-‐tune

read

discuss

Validate

find

cross validate

• Split training data

– 70% for training

– 30% for cross validation

• Train model (70%)

• Measure performance (30%)

CROSS VALIDATION

80

PERFORMANCE BASED ON AMS

81

Trade-‐off between: Ratio of Signal/Background events Number of records in selection region

EventId Score RankOrder

Class truth

1 0.98 501 S S

2 0.42 259,579 B

3 0.46 264,125 B

. . . .

. . . .

449,998 0.86 31,154 S B

449,999 0.12 489,251 B

550,000 0.79 110,154 B

Selection Region

s = sum(S) b= sum(B)

PERFORMANCE BASED ON AMS

82

Percentile

AMS

AMS

percentage of signal

COMPARE TWO MODEL RESULTS

Percentile

83

Training

Cross validation

Percentile

AMS

AMS


Percentile

84

COMPARE TWO MODEL RESULTS

Training

Cross validation

Percentile

AMS

AMS


AMS BY NUM. ITERATION

85

Percentile

AMS

Animation

Background

86

Data

Model

Understand

Explore Enhance


read

visualize

reduce

generate

innovateapply

fine-‐tune

read

discuss

Validate

find

cross validate

s

b

>> 4

HEAT MAP OF AMS ON B-‐S PLAN

87

OPTIMIZATION BASED ON OBJECTIVE FUNCTION

Percentile

88

A

B

C

AMS


89

s

b

A

B

C


90

s

b

A

B

C

Inspiration from Lagrangian Method Weight signal and background events by partial derivatives of AMS function

AMS CURVE ON B-‐S PLAN

91

A

B

C

Inspiration from Lagrangian Method Weight signal and background events by partial derivatives of AMS function

s

b

partial derivative of AMS against s

partial derivative of AMS against b

Ratio of the derivatives ==> relative weight

IMPROVEMENT DUE TO WEIGHTING

92

AMS*

Num_Iterations

AMS

IMPROVEMENT DUE TO WEIGHTING (CONT’D)

93Num_Iterations

AMS*

AMS

IMPROVEMENT DUE TO ELIMINATION

96Num_Iterations

AMS*

AMS

IMPROVEMENT DUE TO ELIMINATION (CONT’D)

97Num_Iterations

AMS*

AMS

Background

99

Data

Model

Understand

Explore Enhance


read

visualize

reduce

generate

innovateapply

fine-‐tune

read

discuss

Validate

find

cross validate

• Version control (Git, Source Tree)

– Effectively implement many different ideas

• File organization

– Efficiently pull out the file needed

• Effective code (R, Python)

– it matters so much when dealing with big data

OTHER TOPICS

100

Thanks you for your participation!

Any Questions?

goDCI.com

big data competition: maximizing your potential exampled with the 2014 higgs boson machine learning...

Data & Analytics

colliderh higgs boson

decay of higgs boson

diameter weighs

machine learning experts

fundamental particle

god particle

cheng chen email

deputy chair of physics