what is wrong with data challenges
TRANSCRIPT
Center for Data ScienceParis-Saclay1
CNRS & University Paris Saclay Center for Data Science
BALÁZS KÉGL
WHAT IS WRONG WITH DATA CHALLENGES
THE HIGGSML STORY: THE GOOD, THE BAD AND THE UGLY
2
Why am I so critical? !
Why do I mitigate our own success with the HiggsML?
3
Because I believe that there is enormous potential in
open innovation/crowdsourcing in science.
!
The current data challenge format is a single point in the landscape.
4Olga Kokshagina 2015
INTERMEDIARIES: THE GROWING INTEREST FOR « CROWDS » - > EXPLOSION OF TOOLS
! Crowdsourcing ! is a model leveraging
on novel technologies (web 2.0, mobile apps, social networks)
! To build content and a
structured set of information by gathering contributions from large groups of individuals
5
Center for Data ScienceParis-Saclay
CROWDSOURCING ANNOTATION
5
Center for Data ScienceParis-Saclay
CROWDSOURCING COLLECTION AND ANNOTATION
6
Center for Data ScienceParis-Saclay
CROWDSOURCING MATH
7
Center for Data ScienceParis-Saclay
CROWDSOURCING ANALYTICS
8
Center for Data ScienceParis-Saclay
OPEN SOURCE
9
Center for Data ScienceParis-Saclay
NEW PUBLICATION MODELS
10
Center for Data ScienceParis-Saclay
THE BOOK TO READ
11
Center for Data ScienceParis-Saclay
• Summary of our conclusions after the HiggsML challenge
• The good, the bad and the ugly
• Elaborating on some of the points
• Rapid Analytics and Model Prototyping
• an experimental format we have been developing
12
OUTLINE
Center for Data ScienceParis-Saclay13
CIML WORKSHOP TOMORROW
Center for Data ScienceParis-Saclay
• Publicity, awareness
• both in physics (about the technology) and in ML (about the problem)
• Triggering open data
• http://opendata.cern.ch/collection/ATLAS-Higgs-Challenge-2014
• Learning a lot from Gábor on how to win a challenge
• Gábor getting hired by Google Deep Mind
• Benchmarking
• Tool dissemination (xgboost, keras)
14
THE GOOD
Center for Data ScienceParis-Saclay
• No direct access to code
• No direct access to data scientists
• No fundamentally new ideas
• No incentive to collaborate
15
THE BAD
Center for Data ScienceParis-Saclay
• 18 months to prepare
• legal issues, access to data
• problem formulation: intellectually way more interesting than the challenge itself, but difficult to “market” or to crowdsource
• once a problem is formalized/formatted to challenge, the problem is solved (“learning is easy” - Gael Varoquaux)
16
THE UGLY
Center for Data ScienceParis-Saclay
• We asked the wrong question, on purpose!
• because the right questions are complex and don’t fit the challenge setup
• would have led to way less participation
• would have led to bitterness among the participants, bad (?) for marketing
17
THE UGLY
Center for Data ScienceParis-Saclay
• The HiggsML challenge on Kaggle
• https://www.kaggle.com/c/higgs-boson
18
PUBLICITY, AWARENESS
Center for Data ScienceParis-Saclay
PUBLICITY, AWARENESS
19
B. Kégl / AppStat@LAL Learning to discover
CLASSIFICATION FOR DISCOVERY
14
Center for Data ScienceParis-Saclay
AWARENESS DYNAMICS
20
• HEPML workshop @NIPS14
• JMLR WS proceedings: http://jmlr.csail.mit.edu/proceedings/papers/v42
• CERN Open Data
• http://opendata.cern.ch/collection/ATLAS-Higgs-Challenge-2014
• DataScience@LHC
• http://indico.cern.ch/event/395374/
• Flavors of physics challenge
• https://www.kaggle.com/c/flavours-of-physics
Center for Data ScienceParis-Saclay
LEARNING FROM THE WINNER
21
https://indico.lal.in2p3.fr/event/2692/contribution/1/material/slides/0.pdf
Center for Data ScienceParis-Saclay
LEARNING FROM THE WINNER
22
• Sophisticated cross validation, CV bagging
• Sophisticated calibration and model averaging
• The first step: pro participants check if the effort is worthy, risk assessment
• variance estimate of the score
• Don’t use the public leaderboard score for model selection
• None of Gábor’s 200 out-of-the-ordinary ideas worked
https://indico.lal.in2p3.fr/event/2692/contribution/1/material/slides/0.pdf
Center for Data ScienceParis-Saclay
BENCHMARKING
23
B. Kégl / AppStat@LAL Learning to discover
CLASSIFICATION FOR DISCOVERY
15
Center for Data ScienceParis-Saclay
BENCHMARKING
24
But what score did we optimize?
!
And why?
Center for Data ScienceParis-Saclay
count (per year)
background
signal
probability
background
signal
CLASSIFICATION FOR DISCOVERY
25
Goal: optimize the expected discovery significance
flux × time
selectionexpected background
say, b = 100 events
total count, say, 150 events
excess is s = 50 events
AMS = = 5 sigma
approaches a simple asymptotic form related to the chi-squared distribution in the large-samplelimit. In practice the asymptotic formulae are found to provide a useful approximation even formoderate data samples (see, e.g., [6]). Assuming that these hold, the p-value of the background-only hypothesis from an observed value of q0 is found to be
p = 1 � F (p
q0) , (11)
where F is the standard Gaussian cumulative distribution.In particle physics it is customary to convert the p-value into the equivalent significance Z,
defined asZ = F�1(1 � p), (12)
where F�1 is the standard normal quantile. Eqs. (11) and (12) lead therefore to the simple result
Z =p
q0 =
s
2✓
n ln✓
nµb
◆� n + µb
◆(13)
if n > µb and Z = 0 otherwise. The quantity Z measures the statistical significance in unitsof standard deviations or “sigmas”. Often in particle physics a significance of at least Z = 5 (afive-sigma effect) is regarded as sufficient to claim a discovery. This corresponds to finding thep-value less than 2.9 ⇥ 10�7.11
4.2 The median discovery significanceEq. (13) represents the significance that we would obtain for a given number of events n observedin the search region G, knowing the background expectation µb. When optimizing the design ofthe classifier g which defines the search region G = {x : g(x) = s}, we do not know n and µb. Asusual in empirical risk minimization [9], we estimate the expectation µb by its empirical counter-part b from Eq. (5). We then replace n by s + b to obtain the approximate median significance
AMS2 =
r2⇣(s + b) ln
⇣1 +
sb
⌘� s
⌘. (14)
Taking into consideration that (x + 1) ln(x + 1) = x + x2/2 +O(x3), AMS2 can be rewritten as
AMS2 = AMS3 ⇥s
1 +O✓⇣ s
b
⌘3◆
,
whereAMS3 =
spb
. (15)
The two criteria Eqs. (14) and (15) are practically indistinguishable when b � s. This approxima-tion often holds in practice and may, depending on the chosen search region, be a valid surrogatein the Challenge.
In preliminary runs it happened sometimes that AMS2 was maximized in small selectionregions G, resulting in a large variance of the AMS. While large variance in the real analysis isnot necessarily a problem, it would make it difficult to reliably compare the participants of theChallenge if the optimal region was small. So, in order to decrease the variance of the AMS, wedecided to bias the optimal selection region towards larger regions by adding and artificial shiftbreg to b. The value breg = 10 was determined using preliminary experiments.
11This extremely high threshold for statistical significance is motivated by a number of factors related to multipletesting, accounting for mismodeling and the high standard one would like to require for an important discovery.
10
selection thresholdselection threshold
Center for Data ScienceParis-Saclay
How to handle systematic (model) uncertainties?• OK, so let’s design an objective function that can take background
systematics into consideration
• Likelihood with unknown background b ⇠ N (µb,�b)
L(µs, µb) = P (n, b|µs, µb,�b) =(µs + µb)n
n!e�(µs+µb) 1p
2⇡�be�(b�µb)
2/2�b2
• Profile likelihood ratio �(0) =L(0, ˆ̂µb)
L(µ̂s, µ̂b)
• The new Approximate Median Significance (by Glen Cowan)
AMS =
s
2
✓(s+ b) ln
s+ b
b0� s� b+ b0
◆+
(b� b0)2
�b2
whereb0 =
1
2
⇣b� �b
2 +p(b� �b
2)2 + 4(s+ b)�b2⌘
1 / 126
Center for Data ScienceParis-Saclay
HOW TO HANDLE SYSTEMATIC UNCERTAINTIES
27
Why didn’t we use it?
Center for Data ScienceParis-Saclay28
How to handle systematic (model) uncertainties?•
The new Approximate Median Significance
AMS =
s
2
✓(s+ b) ln
s+ b
b0� s� b+ b0
◆+
(b� b0)2
�b
2
where
b0 =1
2
⇣b� �
b
2 +p(b� �
b
2)2 + 4(s+ b)�b
2⌘
1 / 1
New AMS
ATLAS
Old AMS
Center for Data ScienceParis-Saclay
LEARNING FROM THE WINNER
29
• Sophisticated cross validation, CV bagging
• Sophisticated calibration and model averaging
• The first step: pro participants check if the effort is worthy, risk assessment
• variance estimate of the score
• Don’t use the public leaderboard score for model selection
• None of Gábor’s 200 out-of-the-ordinary ideas worked
Center for Data ScienceParis-Saclay
THE TWO MOST COMMON DATA CHALLENGE KILLERS
30
Leakage
Variance of the test score
Center for Data ScienceParis-Saclay
VARIANCE OF THE TEST SCORE
31
Center for Data ScienceParis-Saclay
• Challenges are useful for
• generating visibility in the data science community about novel application domains
• benchmarking in a fair way state-of-the-art techniques on well-defined problems
• finding talented data scientists
• Limitations
• not necessary adapted to solving complex and open-ended data science problems in realistic environments
• no direct access to solutions and data scientist
• no incentive to collaboration
32
DATA CHALLENGES
33
We decided to design something better
Center for Data ScienceParis-Saclay
• Direct access to code, prototyping
• Incentivizing diversity
• Incentivizing collaboration
• Training
• Networking
34
RAPID ANALYTICS AND MODEL PROTOTYPING (RAMP)
Center for Data ScienceParis-Saclay
• Our experience with the HiggsML challenge
• Need to connect data scientist to domain scientists and problems at the Paris-Saclay Center for Data Science
• Collaboration with management scientists specializing in managing innovation
• Michel Nielsen’s book: Reinventing Discovery
• 5+ iterations so far35
WHERE DOES IT COME FROM?
Center for Data ScienceParis-Saclay
UNIVERSITÉ PARIS-SACLAY
36
+ horizontal multi-disciplinary and multi-partner initiatives to create cohesion
Center for Data ScienceParis-Saclay37
Center for Data ScienceParis-Saclay
A multi-disciplinary initiative to define, structure, and manage the data science ecosystem at the Université Paris-Saclay
http://www.datascience-paris-saclay.fr/
Biology & bioinformaticsIBISC/UEvry LRI/UPSudHepatinovCESP/UPSud-UVSQ-Inserm IGM-I2BC/UPSud MIA/AgroMIAj-MIG/INRALMAS/Centrale
ChemistryEA4041/UPSud
Earth sciencesLATMOS/UVSQ GEOPS/UPSudIPSL/UVSQLSCE/UVSQLMD/Polytechnique
EconomyLM/ENSAE RITM/UPSudLFA/ENSAE
NeuroscienceUNICOG/InsermU1000/InsermNeuroSpin/CEA
Particle physics astrophysics & cosmologyLPP/Polytechnique DMPH/ONERACosmoStat/CEAIAS/UPSudAIM/CEALAL/UPSud
The Paris-Saclay Center for Data ScienceData Science for scientific Data
250 researchers in 35 laboratories
Machine learningLRI/UPSud LTCI/TelecomCMLA/Cachan LS/ENSAELIX/PolytechniqueMIA/AgroCMA/PolytechniqueLSS/SupélecCVN/Centrale LMAS/CentraleDTIM/ONERAIBISC/UEvry
VisualizationINRIALIMSI
Signal processingLTCI/TelecomCMA/PolytechniqueCVN/CentraleLSS/SupélecCMLA/CachanLIMSIDTIM/ONERA
StatisticsLMO/UPSud LS/ENSAELSS/SupélecCMA/PolytechniqueLMAS/CentraleMIA/AgroParisTech
Data sciencestatistics
machine learninginformation retrieval
signal processingdata visualization
databases
Domain sciencehuman society
life brain earth
universe
Tool buildingsoftware engineering
clouds/gridshigh-performance
computingoptimization
Data scientist
Applied scientist
Domain scientist
Data engineer
Software engineer
Center for Data ScienceParis-Saclay
datascience-paris-saclay.fr
@SaclayCDS
LIST/CEA
38
THE DATA SCIENCE LANDSCAPE
Domain scienceenergy and physical sciences
health and life sciences Earth and environment
economy and society brain
Data scientist
Data trainer
Applied scientist
Domain scientistSoftware engineer
Data engineer
Data sciencestatistics
machine learning information retrieval
signal processing data visualization
databases
Tool building software engineering
clouds/grids high-performance
computing optimization
Center for Data ScienceParis-Saclay39
https://medium.com/@balazskegl
Center for Data ScienceParis-Saclay
TOOLS: LANDSCAPE TO ECOSYSTEM
40
Data scientist
Data trainer
Applied scientist
Domain expertSoftware engineer
Data engineer
Tool building Data domains
Data sciencestatistics
machine learning information retrieval
signal processing data visualization
databases
• interdisciplinary projects • matchmaking tool • design and innovation strategy workshops • data challenges
• coding sprints • Open Software Initiative • code consolidator and engineering projects
software engineeringclouds/grids
high-performancecomputing
optimization
energy and physical sciences health and life sciences Earth and environment
economy and society brain
• data science RAMPs and TSs • IT platform for linked data • annotation tools • SaaS data science platform
Center for Data ScienceParis-Saclay
• Modularizing the collaboration
• independent subtasks
• reduces barriers
• broadens the range of available expertise
• Encouraging small contributions
• Rich and well-structured information commons
• so people can build on earlier work
41
NIELSEN’S CROWDSOURCING PRINCIPLES
Center for Data ScienceParis-Saclay42
RAMPS
• Single-day coding sessions
• 20-40 participants
• preparation is similar to challenges
• Goals
• focusing and motivating top talents
• promoting collaboration, speed, and efficiency
• solving (prototyping) real problems
43
TRAINING SPRINTS
• Single-day training sessions
• 20-40 participants
• focusing on a single subject (deep learning, model tuning, functional data, etc.)
• preparing RAMPs
44
ANALYTICS TOOLS TO PROMOTE COLLABORATION AND CODE REUSE
Center for Data ScienceParis-Saclay45
ANALYTICS TOOL TO PROMOTE COLLABORATION AND CODE REUSE
Center for Data ScienceParis-Saclay
ANALYTICS TOOLS TO MONITOR PROGRESS
46
Center for Data ScienceParis-Saclay
RAPID ANALYTICS AND MODEL PROTOTYPING
2015 Jan 15 The HiggsML challenge
47
Center for Data ScienceParis-Saclay
RAPID ANALYTICS AND MODEL PROTOTYPING
2015 Apr 10 Classifying variable stars
48
Center for Data ScienceParis-Saclay
VARIABLE STARS
49
Learning to discoverB. Kégl / CNRS - Saclay
VARIABLE STARS
50
accuracy improvement: 89% to 96%
Center for Data ScienceParis-Saclay
RAPID ANALYTICS AND MODEL PROTOTYPING
2015 June 16 and Sept 26 Predicting El Nino
51
52
RAPID ANALYTICS AND MODEL PROTOTYPING
RMSE improvement: 0.9˚C to 0.4˚C
53
2015 October 8 Insect classification
RAPID ANALYTICS AND MODEL PROTOTYPING
54
RAPID ANALYTICS AND MODEL PROTOTYPING
accuracy improvement: 30% to 70%
55
CONCLUSIONS
• Explore the open innovation space
• read Nielsen’s book
• Drop me a mail ([email protected]) if you are interested in beta-testing the RAMP tool
• Come to our CIML WS tomorrow
Center for Data ScienceParis-Saclay56
THANK YOU!