ramp: data challenges with modularization and code submission
TRANSCRIPT
![Page 1: RAMP: data challenges with modularization and code submission](https://reader035.vdocuments.us/reader035/viewer/2022070510/58ac1c481a28abf03a8b4613/html5/thumbnails/1.jpg)
Center for Data ScienceParis-SaclayB. Kégl (CNRS) 1
Université Paris-Saclay / CNRSBALÁZS KÉGL
RAMP DATA CHALLENGES WITH
MODULARIZATION AND CODE SUBMISSION LESSONS LEARNED
![Page 2: RAMP: data challenges with modularization and code submission](https://reader035.vdocuments.us/reader035/viewer/2022070510/58ac1c481a28abf03a8b4613/html5/thumbnails/2.jpg)
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
• A short history of RAMPs
• motivations, design principles, and the current tool
• Three data challenges
• anomaly detection in the LHC ATLAS detector
• classifying and regressing on molecular spectra
• time series forecasting of El Niño
• What have we learned?
• number of participants, incentives?
• open vs closed?
• blending vs human ingenuity
2
OUTLINE
![Page 3: RAMP: data challenges with modularization and code submission](https://reader035.vdocuments.us/reader035/viewer/2022070510/58ac1c481a28abf03a8b4613/html5/thumbnails/3.jpg)
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
Biology & bioinformaticsIBISC/UEvry LRI/UPSudHepatinovCESP/UPSud-UVSQ-Inserm IGM-I2BC/UPSud MIA/AgroMIAj-MIG/INRALMAS/Centrale
ChemistryEA4041/UPSud
Earth sciencesLATMOS/UVSQ GEOPS/UPSudIPSL/UVSQLSCE/UVSQLMD/Polytechnique
EconomyLM/ENSAE RITM/UPSudLFA/ENSAE
NeuroscienceUNICOG/InsermU1000/InsermNeuroSpin/CEA
Particle physics astrophysics & cosmologyLPP/Polytechnique DMPH/ONERACosmoStat/CEAIAS/UPSudAIM/CEALAL/UPSud
The Paris-Saclay Center for Data ScienceData Science for scientific Data
250 researchers in 35 laboratories
Machine learningLRI/UPSud LTCI/TelecomCMLA/Cachan LS/ENSAELIX/PolytechniqueMIA/AgroCMA/PolytechniqueLSS/SupélecCVN/Centrale LMAS/CentraleDTIM/ONERAIBISC/UEvry
VisualizationINRIALIMSI
Signal processingLTCI/TelecomCMA/PolytechniqueCVN/CentraleLSS/SupélecCMLA/CachanLIMSIDTIM/ONERA
StatisticsLMO/UPSud LS/ENSAELSS/SupélecCMA/PolytechniqueLMAS/CentraleMIA/AgroParisTech
Data sciencestatistics
machine learninginformation retrieval
signal processingdata visualization
databases
Domain sciencehuman society
life brain earth
universe
Tool buildingsoftware engineering
clouds/gridshigh-performance
computingoptimization
Data scientist
Applied scientist
Domain scientist
Data engineer
Software engineer
Center for Data ScienceParis-Saclay
datascience-paris-saclay.fr
@SaclayCDS
LIST/CEA
3
Center for Data ScienceParis-Saclay
A multi-disciplinary initiative, building interfaces, matching people, helping them launching projects
345 affiliated researchers, 50 laboratories
![Page 4: RAMP: data challenges with modularization and code submission](https://reader035.vdocuments.us/reader035/viewer/2022070510/58ac1c481a28abf03a8b4613/html5/thumbnails/4.jpg)
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
CDS: A SET OF INNOVATIVE TOOLS AND PROCESSES TO CONNECT DATA SCIENCE AND DOMAIN SCIENCE COMMUNITIES
4
Data scientist
Data trainer
Applied scientist
Domain expertSoftware engineer
Data engineer
Tool building Data domains
Data sciencestatistics
machine learning information retrieval
signal processing data visualization
databases
• interdisciplinary projects • data challenges • ultrawalls and interactive visualization
• coding sprints • Open Software Initiative • code consolidator and engineering projects
software engineeringclouds/grids
high-performancecomputing
optimization
energy and physical sciences health and life sciences Earth and environment
economy and society brain
!• data science RAMPs and TSs • IT platform for linked data
![Page 6: RAMP: data challenges with modularization and code submission](https://reader035.vdocuments.us/reader035/viewer/2022070510/58ac1c481a28abf03a8b4613/html5/thumbnails/6.jpg)
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
• Organizers have no direct access to solutions
• Emphasize competition: participants cannot build on each other’s solutions
• No modularization: ideas go unnoticed unless packaged into a top submission
6
LIMITATIONS OF DATA CHALLENGES
![Page 7: RAMP: data challenges with modularization and code submission](https://reader035.vdocuments.us/reader035/viewer/2022070510/58ac1c481a28abf03a8b4613/html5/thumbnails/7.jpg)
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
• Challenge with code submission
• Following Nielsen’s three crowdsourcing principles:
• modularity: pipelines are sliced into workflow element modules that can be tackled independently
• encourage small contributions: e.g., copy another submission, add features, change the hyperparameters, resubmit
• rich and well structured information commons: open and download each other’s code, discuss on slack
7
RAMP
![Page 8: RAMP: data challenges with modularization and code submission](https://reader035.vdocuments.us/reader035/viewer/2022070510/58ac1c481a28abf03a8b4613/html5/thumbnails/8.jpg)
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
• Roughly two formats
• single day hackatons with 20-50 participants, open leaderboard, 15 minute timeout
• 1-3 week course challenges up 150 students (but no limit really): closed phase with 1-3 submissions per day followed by an open phase with 15 minute timeout
• 500+ users, 3000+ models
8
RAMP RAPID ANALYTICS AND MODEL PROTOTYPING
![Page 9: RAMP: data challenges with modularization and code submission](https://reader035.vdocuments.us/reader035/viewer/2022070510/58ac1c481a28abf03a8b4613/html5/thumbnails/9.jpg)
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
RAMP RAPID ANALYTICS AND MODEL PROTOTYPING
9
frontend
DB
backend
users submissions score problems workflow starting kit crossval
data pipeline
train+test+blend
![Page 10: RAMP: data challenges with modularization and code submission](https://reader035.vdocuments.us/reader035/viewer/2022070510/58ac1c481a28abf03a8b4613/html5/thumbnails/10.jpg)
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
RAMP
10
![Page 11: RAMP: data challenges with modularization and code submission](https://reader035.vdocuments.us/reader035/viewer/2022070510/58ac1c481a28abf03a8b4613/html5/thumbnails/11.jpg)
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
RAMP
11
![Page 12: RAMP: data challenges with modularization and code submission](https://reader035.vdocuments.us/reader035/viewer/2022070510/58ac1c481a28abf03a8b4613/html5/thumbnails/12.jpg)
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
RAMP
12
![Page 13: RAMP: data challenges with modularization and code submission](https://reader035.vdocuments.us/reader035/viewer/2022070510/58ac1c481a28abf03a8b4613/html5/thumbnails/13.jpg)
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
RAMP
13
![Page 14: RAMP: data challenges with modularization and code submission](https://reader035.vdocuments.us/reader035/viewer/2022070510/58ac1c481a28abf03a8b4613/html5/thumbnails/14.jpg)
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
RAMP
14
![Page 15: RAMP: data challenges with modularization and code submission](https://reader035.vdocuments.us/reader035/viewer/2022070510/58ac1c481a28abf03a8b4613/html5/thumbnails/15.jpg)
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
RAMP
15
![Page 16: RAMP: data challenges with modularization and code submission](https://reader035.vdocuments.us/reader035/viewer/2022070510/58ac1c481a28abf03a8b4613/html5/thumbnails/16.jpg)
16
Three recent RAMPs
![Page 17: RAMP: data challenges with modularization and code submission](https://reader035.vdocuments.us/reader035/viewer/2022070510/58ac1c481a28abf03a8b4613/html5/thumbnails/17.jpg)
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
ANOMALY DETECTION IN THE LHC ATLAS DETECTOR
17
reconstruction+simulated anomalies
classifier
anomaly (isSkewed = 1)
correct (isSkewed = 0)
?
![Page 18: RAMP: data challenges with modularization and code submission](https://reader035.vdocuments.us/reader035/viewer/2022070510/58ac1c481a28abf03a8b4613/html5/thumbnails/18.jpg)
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
CLASSIFYING AND REGRESSING ON MOLECULAR SPECTRA
18
chemotherapy drug in elastic pocket
laser spectrometer
molecular spectra
feature extractor 1
feature extractor 2
regressor
concentration
classifier
drug type
![Page 19: RAMP: data challenges with modularization and code submission](https://reader035.vdocuments.us/reader035/viewer/2022070510/58ac1c481a28abf03a8b4613/html5/thumbnails/19.jpg)
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
FORECASTING EL NINO SIX MONTHS AHEAD
19
![Page 20: RAMP: data challenges with modularization and code submission](https://reader035.vdocuments.us/reader035/viewer/2022070510/58ac1c481a28abf03a8b4613/html5/thumbnails/20.jpg)
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
FORECASTING EL NINO SIX MONTHS AHEAD
20
… 300.14 299.83 298.76 299.87 299.82 300.15 300.10 299.50… …
feature extractor
x (a fixed length feature vector) regressor
• We give the full series to the feature extractor
• It could look ahead in the future (even inadvertently)
• Checking lookahead by a randomized test
![Page 21: RAMP: data challenges with modularization and code submission](https://reader035.vdocuments.us/reader035/viewer/2022070510/58ac1c481a28abf03a8b4613/html5/thumbnails/21.jpg)
21
Analyzing the analysis
![Page 22: RAMP: data challenges with modularization and code submission](https://reader035.vdocuments.us/reader035/viewer/2022070510/58ac1c481a28abf03a8b4613/html5/thumbnails/22.jpg)
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
OPEN PHASE LETS PARTICIPANTS CATCH UP
22
![Page 23: RAMP: data challenges with modularization and code submission](https://reader035.vdocuments.us/reader035/viewer/2022070510/58ac1c481a28abf03a8b4613/html5/thumbnails/23.jpg)
Center for Data ScienceParis-SaclayB. Kégl (CNRS) 23
T-SNE ON TEST PREDICTIONS
starting kit
the crowdearly influencers
inventors
![Page 24: RAMP: data challenges with modularization and code submission](https://reader035.vdocuments.us/reader035/viewer/2022070510/58ac1c481a28abf03a8b4613/html5/thumbnails/24.jpg)
Center for Data ScienceParis-SaclayB. Kégl (CNRS) 24
the single day hackaton ceiling
what you achieved with a well tuned deep net
the diversity gap
the human blender gap
![Page 25: RAMP: data challenges with modularization and code submission](https://reader035.vdocuments.us/reader035/viewer/2022070510/58ac1c481a28abf03a8b4613/html5/thumbnails/25.jpg)
Center for Data ScienceParis-SaclayB. Kégl (CNRS) 25
blending is immune to overfitting
the single day hackaton floor
![Page 26: RAMP: data challenges with modularization and code submission](https://reader035.vdocuments.us/reader035/viewer/2022070510/58ac1c481a28abf03a8b4613/html5/thumbnails/26.jpg)
Center for Data ScienceParis-SaclayB. Kégl (CNRS) 26
the single day hackaton floor
![Page 27: RAMP: data challenges with modularization and code submission](https://reader035.vdocuments.us/reader035/viewer/2022070510/58ac1c481a28abf03a8b4613/html5/thumbnails/27.jpg)
Center for Data ScienceParis-SaclayB. Kégl (CNRS) 27
![Page 28: RAMP: data challenges with modularization and code submission](https://reader035.vdocuments.us/reader035/viewer/2022070510/58ac1c481a28abf03a8b4613/html5/thumbnails/28.jpg)
Center for Data ScienceParis-SaclayB. Kégl (CNRS) 28
![Page 29: RAMP: data challenges with modularization and code submission](https://reader035.vdocuments.us/reader035/viewer/2022070510/58ac1c481a28abf03a8b4613/html5/thumbnails/29.jpg)
Center for Data ScienceParis-SaclayB. Kégl (CNRS) 29
![Page 30: RAMP: data challenges with modularization and code submission](https://reader035.vdocuments.us/reader035/viewer/2022070510/58ac1c481a28abf03a8b4613/html5/thumbnails/30.jpg)
Center for Data ScienceParis-SaclayB. Kégl (CNRS) 30
![Page 31: RAMP: data challenges with modularization and code submission](https://reader035.vdocuments.us/reader035/viewer/2022070510/58ac1c481a28abf03a8b4613/html5/thumbnails/31.jpg)
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
• Course RAMPs beat single day hackatons significantly
• larger number of students?
• longer RAMPs?
• master-level students are better than data science researchers?
• stronger incentives?
• closed phase preceding an open phase (vs pure open RAMP) helps to create diversity?
• Open phase helps novice participants to catch up: the goal of teaching!
• Sometimes also makes the best and blended score better
• Human blending often beats machine blending
• Human feature engineering easily beats deep learning on some data
31
WHAT WE LEARNED
![Page 32: RAMP: data challenges with modularization and code submission](https://reader035.vdocuments.us/reader035/viewer/2022070510/58ac1c481a28abf03a8b4613/html5/thumbnails/32.jpg)
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
• Fast development of analytics solutions
• Teaching support
• Networking
• Support for collaborative team work
32
THE RAMP TOOL
A prototyping tool for collaborative development of data science workflows
![Page 33: RAMP: data challenges with modularization and code submission](https://reader035.vdocuments.us/reader035/viewer/2022070510/58ac1c481a28abf03a8b4613/html5/thumbnails/33.jpg)
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
• Open sourcing and packaging for easy deployment
• More RAMPs, stay tuned, sign up athttp://www.ramp.studio if interested
33
WHAT’S NEXT