center for data science paris-saclay
TRANSCRIPT
![Page 1: Center for Data Science Paris-Saclay](https://reader030.vdocuments.us/reader030/viewer/2022020314/5868e0bf1a28ab78408c2b20/html5/thumbnails/1.jpg)
1
Center for Data ScienceParis-Saclay
DR / CNRS LAL & LRI
CNRS & University Paris-Sud
BALÁZS KÉGL
Pr / UPSud
LRI
CÉCILE GERMAIN
MdC / Telecom ParisTech
LTCI
ALEXANDRE GRAMFORT
MdC / Mines ParisTech
CGS
AKIN KAZAKÇI
Pr / ENSAE
Laboratoire de Statistique
ARNAK DALALYAN
![Page 2: Center for Data Science Paris-Saclay](https://reader030.vdocuments.us/reader030/viewer/2022020314/5868e0bf1a28ab78408c2b20/html5/thumbnails/2.jpg)
2
Meta
![Page 3: Center for Data Science Paris-Saclay](https://reader030.vdocuments.us/reader030/viewer/2022020314/5868e0bf1a28ab78408c2b20/html5/thumbnails/3.jpg)
3
I will not talk about science
![Page 4: Center for Data Science Paris-Saclay](https://reader030.vdocuments.us/reader030/viewer/2022020314/5868e0bf1a28ab78408c2b20/html5/thumbnails/4.jpg)
4
I will talk about management
(of) (data) science
![Page 5: Center for Data Science Paris-Saclay](https://reader030.vdocuments.us/reader030/viewer/2022020314/5868e0bf1a28ab78408c2b20/html5/thumbnails/5.jpg)
Center for Data ScienceParis-Saclay
• My eight-year of experience interfacing between high-energy physics and data science
• Our one-year experience of running PSCDS
• Extensive discussions with management scientist and the MI/Mastodons/MaDICS for the last year
5
WHERE DOES IT COME FROM?
![Page 6: Center for Data Science Paris-Saclay](https://reader030.vdocuments.us/reader030/viewer/2022020314/5868e0bf1a28ab78408c2b20/html5/thumbnails/6.jpg)
Center for Data ScienceParis-Saclay6
UNIVERSITÉ PARIS-SACLAY
19 founding partners
![Page 7: Center for Data Science Paris-Saclay](https://reader030.vdocuments.us/reader030/viewer/2022020314/5868e0bf1a28ab78408c2b20/html5/thumbnails/7.jpg)
Center for Data ScienceParis-Saclay
UNIVERSITÉ PARIS-SACLAY
7
+ horizontal multi-disciplinary and multi-partner initiatives (“lidex”) to create cohesion
![Page 8: Center for Data Science Paris-Saclay](https://reader030.vdocuments.us/reader030/viewer/2022020314/5868e0bf1a28ab78408c2b20/html5/thumbnails/8.jpg)
Center for Data ScienceParis-Saclay8
Biology & bioinformaticsIBISC/UEvry LRI/UPSudHepatinovCESP/UPSud-UVSQ-Inserm IGM-I2BC/UPSud MIA/AgroMIAj-MIG/INRALMAS/Centrale
ChemistryEA4041/UPSud
Earth sciencesLATMOS/UVSQ GEOPS/UPSudIPSL/UVSQLSCE/UVSQLMD/Polytechnique
EconomyLM/ENSAE RITM/UPSudLFA/ENSAE
NeuroscienceUNICOG/InsermU1000/InsermNeuroSpin/CEA
Particle physics astrophysics & cosmologyLPP/Polytechnique DMPH/ONERACosmoStat/CEAIAS/UPSudAIM/CEALAL/UPSud
The Paris-Saclay Center for Data ScienceData Science for scientific Data
250 researchers in 35 laboratories
Machine learningLRI/UPSud LTCI/TelecomCMLA/Cachan LS/ENSAELIX/PolytechniqueMIA/AgroCMA/PolytechniqueLSS/SupélecCVN/Centrale LMAS/CentraleDTIM/ONERAIBISC/UEvry
VisualizationINRIALIMSI
Signal processingLTCI/TelecomCMA/PolytechniqueCVN/CentraleLSS/SupélecCMLA/CachanLIMSIDTIM/ONERA
StatisticsLMO/UPSud LS/ENSAELSS/SupélecCMA/PolytechniqueLMAS/CentraleMIA/AgroParisTech
Data sciencestatistics
machine learninginformation retrieval
signal processingdata visualization
databases
Domain sciencehuman society
life brain earth
universe
Tool buildingsoftware engineering
clouds/gridshigh-performance
computingoptimization
Data scientist
Applied scientist
Domain scientist
Data engineer
Software engineer
Center for Data ScienceParis-Saclay
datascience-paris-saclay.fr
@SaclayCDS
Center for Data ScienceParis-Saclay
A multi-disciplinary initiative to define, structure, and manage the data science ecosystem at the Université Paris-Saclay
http://www.datascience-paris-saclay.fr/
![Page 9: Center for Data Science Paris-Saclay](https://reader030.vdocuments.us/reader030/viewer/2022020314/5868e0bf1a28ab78408c2b20/html5/thumbnails/9.jpg)
Center for Data ScienceParis-Saclay
DATA SCIENCE
9
Design of automated methods
to analyze massive and complex data
to extract useful information
![Page 10: Center for Data Science Paris-Saclay](https://reader030.vdocuments.us/reader030/viewer/2022020314/5868e0bf1a28ab78408c2b20/html5/thumbnails/10.jpg)
Center for Data ScienceParis-Saclay10
CENTER FOR DATA SCIENCE=
DATA CENTER
We are focusing on inference:
data knowledge
Interfacing with HPC, cloud, storage, production
![Page 11: Center for Data Science Paris-Saclay](https://reader030.vdocuments.us/reader030/viewer/2022020314/5868e0bf1a28ab78408c2b20/html5/thumbnails/11.jpg)
Center for Data ScienceParis-Saclay11
PARAMETERS
• 2 years: April 2014 - June 2016, 1.2M€
• +1 year, conditional on evaluation
• Light management
• executive committee of 17 members
• work groups
• management, strategy (around objectives)
• thematic (around scientific themes),
• open to everyone to propose and to participate
![Page 12: Center for Data Science Paris-Saclay](https://reader030.vdocuments.us/reader030/viewer/2022020314/5868e0bf1a28ab78408c2b20/html5/thumbnails/12.jpg)
Center for Data ScienceParis-Saclay
Domain scienceenergy and physical sciences
health and life sciences Earth and environment
economy and society brain
12
THE DATA SCIENCE LANDSCAPEData scientist
Data trainer
Applied scientist
Domain scientistSoftware engineer
Data engineer
Data sciencestatistics
machine learning information retrieval
signal processing data visualization
databases
Tool building software engineering
clouds/grids high-performance
computing optimization
![Page 13: Center for Data Science Paris-Saclay](https://reader030.vdocuments.us/reader030/viewer/2022020314/5868e0bf1a28ab78408c2b20/html5/thumbnails/13.jpg)
Center for Data ScienceParis-Saclay
• Manpower
• especially at the interfaces
• industrial brain-drain
• Incentives
• data scientists are not incentivized to work on domain science
• scientists are not incentivized to work on tools
• Access
• no well-developed channels to identify the right experts for a given problem
• Tools
• few tools that can help domain scientists and data scientists to collaborate efficiently
13
CHALLENGES
![Page 14: Center for Data Science Paris-Saclay](https://reader030.vdocuments.us/reader030/viewer/2022020314/5868e0bf1a28ab78408c2b20/html5/thumbnails/14.jpg)
Center for Data ScienceParis-Saclay
TOOLS
14
We are designing and learning to manage tools to
accompany data science projects with different needs
![Page 15: Center for Data Science Paris-Saclay](https://reader030.vdocuments.us/reader030/viewer/2022020314/5868e0bf1a28ab78408c2b20/html5/thumbnails/15.jpg)
Center for Data ScienceParis-Saclay
TOOLS: LANDSCAPE TO ECOSYSTEM
15
Data scientist
Data trainer
Applied scientist
Domain expertSoftware engineer
Data engineer
Tool building Data domains
Data sciencestatistics
machine learning information retrieval
signal processing data visualization
databases
• interdisciplinary projects • matchmaking tool • design and innovation strategy workshops • data challenges
• coding sprints • Open Software Initiative • code consolidator and engineering projects
software engineeringclouds/grids
high-performancecomputing
optimization
energy and physical sciences health and life sciences Earth and environment
economy and society brain
• data science bootcamps • IT platform for linked data • annotation tool • SaaS data science platform
![Page 16: Center for Data Science Paris-Saclay](https://reader030.vdocuments.us/reader030/viewer/2022020314/5868e0bf1a28ab78408c2b20/html5/thumbnails/16.jpg)
Center for Data ScienceParis-Saclay16
POSTDOCS, THESES, SABBATICALS
• Common selection criteria
• scientific quality
• expected results both in domain science and data science
• relevance and feasibility
• (real) scientific data, available at the start of the project
• interdisciplinarity (PIs both from domain and data sciences, different LABEXs)
• community building
• organizing and participating in thematic days, bootcamps, workshops
![Page 17: Center for Data Science Paris-Saclay](https://reader030.vdocuments.us/reader030/viewer/2022020314/5868e0bf1a28ab78408c2b20/html5/thumbnails/17.jpg)
Center for Data ScienceParis-Saclay17
ENGINEERING AND CODE CONSOLIDATING PROJECTS
• No research
• implementation, maintenance, integration
• 1 year engineering projects
• 3-6 month code consolidating projects
• a postdoc or PhD student drops research during the project, implements his/her research code in a professional software, or integrates it into a toolbox
![Page 18: Center for Data Science Paris-Saclay](https://reader030.vdocuments.us/reader030/viewer/2022020314/5868e0bf1a28ab78408c2b20/html5/thumbnails/18.jpg)
Center for Data ScienceParis-Saclay
• A window to open data at Paris-Saclay
• We are not storing or handling existing large data sets
• Rather indexing, linking, and mapping, embedding in the worldwide linked data (RDF) ecosystem
• Storing small data sets of small teams is possible
• Subsets of large sets for prototyping
• Or simply store metadata plus pointer
18
IT PLATFORM FOR LINKED DATA
http://io.datascience-paris-saclay.fr/
![Page 19: Center for Data Science Paris-Saclay](https://reader030.vdocuments.us/reader030/viewer/2022020314/5868e0bf1a28ab78408c2b20/html5/thumbnails/19.jpg)
Center for Data ScienceParis-Saclay
IT PLATFORM FOR LINKED DATA
19
![Page 20: Center for Data Science Paris-Saclay](https://reader030.vdocuments.us/reader030/viewer/2022020314/5868e0bf1a28ab78408c2b20/html5/thumbnails/20.jpg)
Center for Data ScienceParis-Saclay20
BOOTCAMPS
• Single-day coding sessions
• 20-30 participants
• Goals
• training PhD students, postdocs, engineers, senior researchers for hands-on data science (problem types, tools)
• solving (prototyping) real data science problems
• networking, knowing each other
![Page 21: Center for Data Science Paris-Saclay](https://reader030.vdocuments.us/reader030/viewer/2022020314/5868e0bf1a28ab78408c2b20/html5/thumbnails/21.jpg)
21
BOOTCAMPS
![Page 22: Center for Data Science Paris-Saclay](https://reader030.vdocuments.us/reader030/viewer/2022020314/5868e0bf1a28ab78408c2b20/html5/thumbnails/22.jpg)
Center for Data ScienceParis-Saclay22
BOOTCAMPS
![Page 23: Center for Data Science Paris-Saclay](https://reader030.vdocuments.us/reader030/viewer/2022020314/5868e0bf1a28ab78408c2b20/html5/thumbnails/23.jpg)
23
BOOTCAMPS
![Page 24: Center for Data Science Paris-Saclay](https://reader030.vdocuments.us/reader030/viewer/2022020314/5868e0bf1a28ab78408c2b20/html5/thumbnails/24.jpg)
24
BOOTCAMPS
![Page 25: Center for Data Science Paris-Saclay](https://reader030.vdocuments.us/reader030/viewer/2022020314/5868e0bf1a28ab78408c2b20/html5/thumbnails/25.jpg)
Center for Data ScienceParis-Saclay
• A data challenge is a recently developed unconventional dissemination and communication tool
• a scientific or industrial data producer arrives with a well-defined problem and a corresponding annotated data set
• defines a quantitative goal
• makes the problem and part of the data set (the training set) public on a dedicated site
• data science experts then take the public training data and submit solutions for a test set with hidden annotations
• submissions are evaluated numerically using the quantitative measure
• contestants are listed on a leaderboard
• after a predefined time, typically a couple of months, the final results are revealed and the winners are awarded
25
DATA CHALLENGES
![Page 26: Center for Data Science Paris-Saclay](https://reader030.vdocuments.us/reader030/viewer/2022020314/5868e0bf1a28ab78408c2b20/html5/thumbnails/26.jpg)
Center for Data ScienceParis-Saclay
• The HiggsML challenge on Kaggle
• https://www.kaggle.com/c/higgs-boson
• 1785 teams, huge publicity
• significant improvement on baseline
• 18 month preparation, yet partially missing the target
26
DATA CHALLENGES
![Page 27: Center for Data Science Paris-Saclay](https://reader030.vdocuments.us/reader030/viewer/2022020314/5868e0bf1a28ab78408c2b20/html5/thumbnails/27.jpg)
Center for Data ScienceParis-Saclay
• Challenges are useful for
• generating visibility in the data science community about novel application domains
• benchmarking in a fair way state-of-the-art techniques on well-defined problems
• finding talented data scientists
• Limitations
• not necessary adapted to solving complex and open-ended data science problems in realistic environments
• emphasizes competition
27
DATA CHALLENGES
![Page 28: Center for Data Science Paris-Saclay](https://reader030.vdocuments.us/reader030/viewer/2022020314/5868e0bf1a28ab78408c2b20/html5/thumbnails/28.jpg)
28
DESIGN AND INNOVATION STRATEGY WORKSHOPS
![Page 29: Center for Data Science Paris-Saclay](https://reader030.vdocuments.us/reader030/viewer/2022020314/5868e0bf1a28ab78408c2b20/html5/thumbnails/29.jpg)
Center for Data ScienceParis-Saclay
• Putting domain scientists, data scientists, and management scientist in the same room
• Getting them understand each other
• Keeping them collectively creative
• The goal: identifying and defining projects
• low-hanging fruits
• breakthrough projects
• long-term vision
29
DESIGN AND INNOVATION STRATEGY WORKSHOPS
![Page 30: Center for Data Science Paris-Saclay](https://reader030.vdocuments.us/reader030/viewer/2022020314/5868e0bf1a28ab78408c2b20/html5/thumbnails/30.jpg)
Center for Data ScienceParis-Saclay
innovative design
=
interaction and joint expansion of concepts
and knowledge
30
DESIGN AND INNOVATION STRATEGY WORKSHOPS
C/K design theory
![Page 31: Center for Data Science Paris-Saclay](https://reader030.vdocuments.us/reader030/viewer/2022020314/5868e0bf1a28ab78408c2b20/html5/thumbnails/31.jpg)
Center for Data ScienceParis-Saclay31
DESIGN AND INNOVATION STRATEGY WORKSHOPS
[P]$Project$building!Ini3alisa3on$
[K]$Knowledge$sharing$
Workshops$
[C]$IFM?Design$Workshops$
[RUN]!
DKCP process: linearizing C-K dynamics
![Page 32: Center for Data Science Paris-Saclay](https://reader030.vdocuments.us/reader030/viewer/2022020314/5868e0bf1a28ab78408c2b20/html5/thumbnails/32.jpg)
Center for Data ScienceParis-Saclay32
TAKE HOME MESSAGE NO1
If you are interested in adapting any of these tools in
your project/site, feel free to contact us, we would be
happy to share our experience
![Page 33: Center for Data Science Paris-Saclay](https://reader030.vdocuments.us/reader030/viewer/2022020314/5868e0bf1a28ab78408c2b20/html5/thumbnails/33.jpg)
Center for Data ScienceParis-Saclay
• We need data engineers and trainers to support research
• ideally: 75% research scientists, 25% research engineers
• We are lucky in France that the position exists in public research
• Tasks
• integrating research code into general-purpose professional software (e.g., scikit-learn, Torch)
• providing an interface between computational infrastructure (e.g., clouds) and scientists
• training scientists to use the tools
33
WHAT CNRS CAN DO
![Page 34: Center for Data Science Paris-Saclay](https://reader030.vdocuments.us/reader030/viewer/2022020314/5868e0bf1a28ab78408c2b20/html5/thumbnails/34.jpg)
Center for Data ScienceParis-Saclay34
WHAT CNRS CAN DO
Incentives
![Page 35: Center for Data Science Paris-Saclay](https://reader030.vdocuments.us/reader030/viewer/2022020314/5868e0bf1a28ab78408c2b20/html5/thumbnails/35.jpg)
Center for Data ScienceParis-Saclay35
WHAT CNRS CAN DOMost data scientists, as other scientists, are trained and incentivized to do research on highly specialized domains. They search scientific visibility in their international community, which is equally highly specialized, because their carrier advancement is almost entirely based on peer-reviewed publications. Even when they would have the expertise, they have little incentive to venture into the tool builder (data engineer) role since software authorship has little value in their evaluation, and it can only serve them implicitly through the visibility they gain in the community of tool users. By the same token, they have little incentive to venture into domain sciences and to tackle economic or societal challenges. It is possible that a domain science or an industrial project requires new techniques which then can be published in data science venues, but this is not guaranteed at all. It usually takes heavy investment of time and effort to be able to understand domain problems, so excursions into domain sciences are highly risky. Even when such collaborations are established, the data scientist has a strong prior to use his/her favorite methodology which is not necessarily the best solution for a given problem. Finally, the data scientist has little incentive to bring the project to full fruition, and he/she often “runs away” with an abstract data science problem (and solution) extracted from the project. Symmetrically, domain scientist and industrial domain experts have no incentive to advance data science and to develop and publish new techniques, as long as their data science problems get solved. When they venture into tool development, they have little incentive in developing general purpose tools.
![Page 36: Center for Data Science Paris-Saclay](https://reader030.vdocuments.us/reader030/viewer/2022020314/5868e0bf1a28ab78408c2b20/html5/thumbnails/36.jpg)
Center for Data ScienceParis-Saclay
Affirmative action
36
TAKE HOME MESSAGE NO2
![Page 37: Center for Data Science Paris-Saclay](https://reader030.vdocuments.us/reader030/viewer/2022020314/5868e0bf1a28ab78408c2b20/html5/thumbnails/37.jpg)
• Overweight (e.g, double count) out-of-domain papers in every evaluation
• Make software tool ownership count
Affirmative action
37
TAKE HOME MESSAGE NO2
![Page 38: Center for Data Science Paris-Saclay](https://reader030.vdocuments.us/reader030/viewer/2022020314/5868e0bf1a28ab78408c2b20/html5/thumbnails/38.jpg)
Center for Data ScienceParis-Saclay38
TAKE HOME MESSAGE NO3
Co-locality is important.
It would be desirable that data science for scientific
data in France be grouped into 5-6 sites of similar size.
Resources and experience can and should be shared.
![Page 39: Center for Data Science Paris-Saclay](https://reader030.vdocuments.us/reader030/viewer/2022020314/5868e0bf1a28ab78408c2b20/html5/thumbnails/39.jpg)
Center for Data ScienceParis-Saclay39
THANK YOU!