what is wrong with data challenges

Center for Data ScienceParis-Saclay1

CNRS & University Paris Saclay Center for Data Science

BALÁZS KÉGL

WHAT IS WRONG WITH DATA CHALLENGES

THE HIGGSML STORY: THE GOOD, THE BAD AND THE UGLY

2

Why am I so critical? !

Why do I mitigate our own success with the HiggsML?

3

Because I believe that there is enormous potential in

open innovation/crowdsourcing in science.

!

The current data challenge format is a single point in the landscape.

4Olga Kokshagina 2015

INTERMEDIARIES: THE GROWING INTEREST FOR « CROWDS » - > EXPLOSION OF TOOLS

!  Crowdsourcing !  is a model leveraging

on novel technologies (web 2.0, mobile apps, social networks)

!  To build content and a

structured set of information by gathering contributions from large groups of individuals

5

Center for Data ScienceParis-Saclay

CROWDSOURCING ANNOTATION

5


CROWDSOURCING COLLECTION AND ANNOTATION

6


CROWDSOURCING MATH

7


CROWDSOURCING ANALYTICS

8


OPEN SOURCE

9


NEW PUBLICATION MODELS

10


THE BOOK TO READ

11


• Summary of our conclusions after the HiggsML challenge

• The good, the bad and the ugly

• Elaborating on some of the points

• Rapid Analytics and Model Prototyping

• an experimental format we have been developing

12

OUTLINE


CIML WORKSHOP TOMORROW


• Publicity, awareness

• both in physics (about the technology) and in ML (about the problem)

• Triggering open data

• http://opendata.cern.ch/collection/ATLAS-Higgs-Challenge-2014

• Learning a lot from Gábor on how to win a challenge

• Gábor getting hired by Google Deep Mind

• Benchmarking

• Tool dissemination (xgboost, keras)

14

THE GOOD


• No direct access to code

• No direct access to data scientists

• No fundamentally new ideas

• No incentive to collaborate

15

THE BAD


• 18 months to prepare

• legal issues, access to data

• problem formulation: intellectually way more interesting than the challenge itself, but difficult to “market” or to crowdsource

• once a problem is formalized/formatted to challenge, the problem is solved (“learning is easy” - Gael Varoquaux)

16

THE UGLY


• We asked the wrong question, on purpose!

• because the right questions are complex and don’t fit the challenge setup

• would have led to way less participation

• would have led to bitterness among the participants, bad (?) for marketing

17

THE UGLY


• The HiggsML challenge on Kaggle

• https://www.kaggle.com/c/higgs-boson

18

PUBLICITY, AWARENESS

https://www.kaggle.com/c/higgs-boson


PUBLICITY, AWARENESS

19

B. Kégl / AppStat@LAL Learning to discover

CLASSIFICATION FOR DISCOVERY

14


AWARENESS DYNAMICS

20

• HEPML workshop @NIPS14

• JMLR WS proceedings: http://jmlr.csail.mit.edu/proceedings/papers/v42

• CERN Open Data

• http://opendata.cern.ch/collection/ATLAS-Higgs-Challenge-2014

• DataScience@LHC

• http://indico.cern.ch/event/395374/

• Flavors of physics challenge

• https://www.kaggle.com/c/flavours-of-physics

http://jmlr.csail.mit.edu/proceedings/papers/v42

https://www.kaggle.com/c/flavours-of-physics


LEARNING FROM THE WINNER

21

https://indico.lal.in2p3.fr/event/2692/contribution/1/material/slides/0.pdf




22

• Sophisticated cross validation, CV bagging

• Sophisticated calibration and model averaging

• The first step: pro participants check if the effort is worthy, risk assessment

• variance estimate of the score

• Don’t use the public leaderboard score for model selection

• None of Gábor’s 200 out-of-the-ordinary ideas worked




BENCHMARKING

23

B. Kégl / AppStat@LAL Learning to discover


15


BENCHMARKING

24

But what score did we optimize?

!

And why?


count (per year)

background

signal

probability

background

signal


25

Goal: optimize the expected discovery significance

flux × time

selectionexpected background

say, b = 100 events

total count, say, 150 events

excess is s = 50 events

AMS = = 5 sigma

approaches a simple asymptotic form related to the chi-squared distribution in the large-samplelimit. In practice the asymptotic formulae are found to provide a useful approximation even formoderate data samples (see, e.g., [6]). Assuming that these hold, the p-value of the background-only hypothesis from an observed value of q0 is found to be

p = 1 � F (p

q0) , (11)

where F is the standard Gaussian cumulative distribution.In particle physics it is customary to convert the p-value into the equivalent significance Z,

defined asZ = F�1(1 � p), (12)

where F�1 is the standard normal quantile. Eqs. (11) and (12) lead therefore to the simple result

Z =p

q0 =

s

2✓

n ln✓

nµb

◆� n + µb

◆(13)

if n > µb and Z = 0 otherwise. The quantity Z measures the statistical significance in unitsof standard deviations or “sigmas”. Often in particle physics a significance of at least Z = 5 (afive-sigma effect) is regarded as sufficient to claim a discovery. This corresponds to finding thep-value less than 2.9 ⇥ 10�7.11

4.2 The median discovery significanceEq. (13) represents the significance that we would obtain for a given number of events n observedin the search region G, knowing the background expectation µb. When optimizing the design ofthe classifier g which defines the search region G = {x : g(x) = s}, we do not know n and µb. Asusual in empirical risk minimization [9], we estimate the expectation µb by its empirical counter-part b from Eq. (5). We then replace n by s + b to obtain the approximate median significance

AMS2 =

r2⇣(s + b) ln

⇣1 +

sb

⌘� s

⌘. (14)

Taking into consideration that (x + 1) ln(x + 1) = x + x2/2 +O(x3), AMS2 can be rewritten as

AMS2 = AMS3 ⇥s

1 +O✓⇣ s

b

⌘3◆

,

whereAMS3 =

spb

. (15)

The two criteria Eqs. (14) and (15) are practically indistinguishable when b � s. This approxima-tion often holds in practice and may, depending on the chosen search region, be a valid surrogatein the Challenge.

In preliminary runs it happened sometimes that AMS2 was maximized in small selectionregions G, resulting in a large variance of the AMS. While large variance in the real analysis isnot necessarily a problem, it would make it difficult to reliably compare the participants of theChallenge if the optimal region was small. So, in order to decrease the variance of the AMS, wedecided to bias the optimal selection region towards larger regions by adding and artificial shiftbreg to b. The value breg = 10 was determined using preliminary experiments.

11This extremely high threshold for statistical significance is motivated by a number of factors related to multipletesting, accounting for mismodeling and the high standard one would like to require for an important discovery.

10

selection thresholdselection threshold


How to handle systematic (model) uncertainties?• OK, so let’s design an objective function that can take background

systematics into consideration

• Likelihood with unknown background b ⇠ N (µb,�b)

L(µs, µb) = P (n, b|µs, µb,�b) =(µs + µb)n

n!e�(µs+µb) 1p

2⇡�be�(b�µb)

2/2�b2

• Profile likelihood ratio �(0) =L(0, ˆ̂µb)

L(µ̂s, µ̂b)

• The new Approximate Median Significance (by Glen Cowan)

AMS =

s

2

✓(s+ b) ln

s+ b

b0� s� b+ b0

◆+

(b� b0)2

�b2

whereb0 =

1

2

⇣b� �b

2 +p(b� �b

2)2 + 4(s+ b)�b2⌘

1 / 126


HOW TO HANDLE SYSTEMATIC UNCERTAINTIES

27

Why didn’t we use it?


How to handle systematic (model) uncertainties?•

The new Approximate Median Significance

AMS =

s

2

✓(s+ b) ln

s+ b

b0� s� b+ b0

◆+

(b� b0)2

�b

2

where

b0 =1

2

⇣b� �

b

2 +p(b� �

b

2)2 + 4(s+ b)�b

2⌘

1 / 1

New AMS

ATLAS

Old AMS



29

• Sophisticated cross validation, CV bagging

• Sophisticated calibration and model averaging

• The first step: pro participants check if the effort is worthy, risk assessment

• variance estimate of the score

• Don’t use the public leaderboard score for model selection

• None of Gábor’s 200 out-of-the-ordinary ideas worked


THE TWO MOST COMMON DATA CHALLENGE KILLERS

30

Leakage

Variance of the test score


VARIANCE OF THE TEST SCORE

31


• Challenges are useful for

• generating visibility in the data science community about novel application domains

• benchmarking in a fair way state-of-the-art techniques on well-defined problems

• finding talented data scientists

• Limitations

• not necessary adapted to solving complex and open-ended data science problems in realistic environments

• no direct access to solutions and data scientist

• no incentive to collaboration

32

DATA CHALLENGES

33

We decided to design something better


• Direct access to code, prototyping

• Incentivizing diversity

• Incentivizing collaboration

• Training

• Networking

34

RAPID ANALYTICS AND MODEL PROTOTYPING (RAMP)


• Our experience with the HiggsML challenge

• Need to connect data scientist to domain scientists and problems at the Paris-Saclay Center for Data Science

• Collaboration with management scientists specializing in managing innovation

• Michel Nielsen’s book: Reinventing Discovery

• 5+ iterations so far35

WHERE DOES IT COME FROM?


UNIVERSITÉ PARIS-SACLAY

36

+ horizontal multi-disciplinary and multi-partner initiatives to create cohesion



A multi-disciplinary initiative to define, structure, and manage the data science ecosystem at the Université Paris-Saclay

http://www.datascience-paris-saclay.fr/

Biology & bioinformaticsIBISC/UEvry LRI/UPSudHepatinovCESP/UPSud-UVSQ-Inserm IGM-I2BC/UPSud MIA/AgroMIAj-MIG/INRALMAS/Centrale

ChemistryEA4041/UPSud

Earth sciencesLATMOS/UVSQ GEOPS/UPSudIPSL/UVSQLSCE/UVSQLMD/Polytechnique

EconomyLM/ENSAE RITM/UPSudLFA/ENSAE

NeuroscienceUNICOG/InsermU1000/InsermNeuroSpin/CEA

Particle physics astrophysics & cosmologyLPP/Polytechnique DMPH/ONERACosmoStat/CEAIAS/UPSudAIM/CEALAL/UPSud

The Paris-Saclay Center for Data ScienceData Science for scientific Data

250 researchers in 35 laboratories

Machine learningLRI/UPSud LTCI/TelecomCMLA/Cachan LS/ENSAELIX/PolytechniqueMIA/AgroCMA/PolytechniqueLSS/SupélecCVN/Centrale LMAS/CentraleDTIM/ONERAIBISC/UEvry

VisualizationINRIALIMSI

Signal processingLTCI/TelecomCMA/PolytechniqueCVN/CentraleLSS/SupélecCMLA/CachanLIMSIDTIM/ONERA

StatisticsLMO/UPSud LS/ENSAELSS/SupélecCMA/PolytechniqueLMAS/CentraleMIA/AgroParisTech

Data sciencestatistics

machine learninginformation retrieval

signal processingdata visualization

databases

Domain sciencehuman society

life brain earth

universe

Tool buildingsoftware engineering

clouds/gridshigh-performance

computingoptimization

Data scientist

Applied scientist

Domain scientist

Data engineer

Software engineer


datascience-paris-saclay.fr

@SaclayCDS

LIST/CEA

http://www.datascience-paris-saclay.fr/

38

THE DATA SCIENCE LANDSCAPE

Domain scienceenergy and physical sciences

health and life sciences Earth and environment

economy and society brain

Data scientist

Data trainer

Applied scientist

Domain scientistSoftware engineer

Data engineer


machine learning information retrieval

signal processing data visualization

databases

Tool building software engineering

clouds/grids high-performance

computing optimization


https://medium.com/@balazskegl

https://medium.com/@balazskegl/the-data-science-ecosystem-678459ba6013


TOOLS: LANDSCAPE TO ECOSYSTEM

40

Data scientist

Data trainer

Applied scientist

Domain expertSoftware engineer

Data engineer

Tool building Data domains


machine learning information retrieval

signal processing data visualization

databases

• interdisciplinary projects • matchmaking tool • design and innovation strategy workshops • data challenges

• coding sprints • Open Software Initiative • code consolidator and engineering projects

software engineeringclouds/grids

high-performancecomputing

optimization

energy and physical sciences health and life sciences Earth and environment

economy and society brain

• data science RAMPs and TSs • IT platform for linked data • annotation tools • SaaS data science platform


• Modularizing the collaboration

• independent subtasks

• reduces barriers

• broadens the range of available expertise

• Encouraging small contributions

• Rich and well-structured information commons

• so people can build on earlier work

41

NIELSEN’S CROWDSOURCING PRINCIPLES


RAMPS

• Single-day coding sessions

• 20-40 participants

• preparation is similar to challenges

• Goals

• focusing and motivating top talents

• promoting collaboration, speed, and efficiency

• solving (prototyping) real problems

43

TRAINING SPRINTS

• Single-day training sessions

• 20-40 participants

• focusing on a single subject (deep learning, model tuning, functional data, etc.)

• preparing RAMPs

44

ANALYTICS TOOLS TO PROMOTE COLLABORATION AND CODE REUSE


ANALYTICS TOOL TO PROMOTE COLLABORATION AND CODE REUSE


ANALYTICS TOOLS TO MONITOR PROGRESS

46


RAPID ANALYTICS AND MODEL PROTOTYPING

2015 Jan 15 The HiggsML challenge

47



2015 Apr 10 Classifying variable stars

48


VARIABLE STARS

49

Learning to discoverB. Kégl / CNRS - Saclay

VARIABLE STARS

50

accuracy improvement: 89% to 96%



2015 June 16 and Sept 26 Predicting El Nino

51

52


RMSE improvement: 0.9˚C to 0.4˚C

53

2015 October 8 Insect classification


54


accuracy improvement: 30% to 70%

55

CONCLUSIONS

• Explore the open innovation space

• read Nielsen’s book

• Drop me a mail ([email protected]) if you are interested in beta-testing the RAMP tool

• Come to our CIML WS tomorrow

mailto:[email protected]?subject=


THANK YOU!