center for data science paris-saclay

39
1 Center for Data Science Paris-Saclay DR / CNRS LAL & LRI CNRS & University Paris-Sud BALÁZS KÉGL Pr / UPSud LRI CÉCILE GERMAIN MdC / Telecom ParisTech LTCI ALEXANDRE GRAMFORT MdC / Mines ParisTech CGS AKIN KAZAKÇI Pr / ENSAE Laboratoire de Statistique ARNAK DALALYAN

Upload: trinhkhanh

Post on 01-Jan-2017

221 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Center for Data Science Paris-Saclay

1

Center for Data ScienceParis-Saclay

DR / CNRS LAL & LRI

CNRS & University Paris-Sud

BALÁZS KÉGL

Pr / UPSud

LRI

CÉCILE GERMAIN

MdC / Telecom ParisTech

LTCI

ALEXANDRE GRAMFORT

MdC / Mines ParisTech

CGS

AKIN KAZAKÇI

Pr / ENSAE

Laboratoire de Statistique

ARNAK DALALYAN

Page 2: Center for Data Science Paris-Saclay

2

Meta

Page 3: Center for Data Science Paris-Saclay

3

I will not talk about science

Page 4: Center for Data Science Paris-Saclay

4

I will talk about management

(of) (data) science

Page 5: Center for Data Science Paris-Saclay

Center for Data ScienceParis-Saclay

• My eight-year of experience interfacing between high-energy physics and data science

• Our one-year experience of running PSCDS

• Extensive discussions with management scientist and the MI/Mastodons/MaDICS for the last year

5

WHERE DOES IT COME FROM?

Page 6: Center for Data Science Paris-Saclay

Center for Data ScienceParis-Saclay6

UNIVERSITÉ PARIS-SACLAY

19 founding partners

Page 7: Center for Data Science Paris-Saclay

Center for Data ScienceParis-Saclay

UNIVERSITÉ PARIS-SACLAY

7

+ horizontal multi-disciplinary and multi-partner initiatives (“lidex”) to create cohesion

Page 8: Center for Data Science Paris-Saclay

Center for Data ScienceParis-Saclay8

Biology & bioinformaticsIBISC/UEvry LRI/UPSudHepatinovCESP/UPSud-UVSQ-Inserm IGM-I2BC/UPSud MIA/AgroMIAj-MIG/INRALMAS/Centrale

ChemistryEA4041/UPSud

Earth sciencesLATMOS/UVSQ GEOPS/UPSudIPSL/UVSQLSCE/UVSQLMD/Polytechnique

EconomyLM/ENSAE RITM/UPSudLFA/ENSAE

NeuroscienceUNICOG/InsermU1000/InsermNeuroSpin/CEA

Particle physics astrophysics & cosmologyLPP/Polytechnique DMPH/ONERACosmoStat/CEAIAS/UPSudAIM/CEALAL/UPSud

The Paris-Saclay Center for Data ScienceData Science for scientific Data

250 researchers in 35 laboratories

Machine learningLRI/UPSud LTCI/TelecomCMLA/Cachan LS/ENSAELIX/PolytechniqueMIA/AgroCMA/PolytechniqueLSS/SupélecCVN/Centrale LMAS/CentraleDTIM/ONERAIBISC/UEvry

VisualizationINRIALIMSI

Signal processingLTCI/TelecomCMA/PolytechniqueCVN/CentraleLSS/SupélecCMLA/CachanLIMSIDTIM/ONERA

StatisticsLMO/UPSud LS/ENSAELSS/SupélecCMA/PolytechniqueLMAS/CentraleMIA/AgroParisTech

Data sciencestatistics

machine learninginformation retrieval

signal processingdata visualization

databases

Domain sciencehuman society

life brain earth

universe

Tool buildingsoftware engineering

clouds/gridshigh-performance

computingoptimization

Data scientist

Applied scientist

Domain scientist

Data engineer

Software engineer

Center for Data ScienceParis-Saclay

datascience-paris-saclay.fr

@SaclayCDS

Center for Data ScienceParis-Saclay

A multi-disciplinary initiative to define, structure, and manage the data science ecosystem at the Université Paris-Saclay

http://www.datascience-paris-saclay.fr/

Page 9: Center for Data Science Paris-Saclay

Center for Data ScienceParis-Saclay

DATA SCIENCE

9

Design of automated methods

to analyze massive and complex data

to extract useful information

Page 10: Center for Data Science Paris-Saclay

Center for Data ScienceParis-Saclay10

CENTER FOR DATA SCIENCE=

DATA CENTER

We are focusing on inference:

data knowledge

Interfacing with HPC, cloud, storage, production

Page 11: Center for Data Science Paris-Saclay

Center for Data ScienceParis-Saclay11

PARAMETERS

• 2 years: April 2014 - June 2016, 1.2M€

• +1 year, conditional on evaluation

• Light management

• executive committee of 17 members

• work groups

• management, strategy (around objectives)

• thematic (around scientific themes),

• open to everyone to propose and to participate

Page 12: Center for Data Science Paris-Saclay

Center for Data ScienceParis-Saclay

Domain scienceenergy and physical sciences

health and life sciences Earth and environment

economy and society brain

12

THE DATA SCIENCE LANDSCAPEData scientist

Data trainer

Applied scientist

Domain scientistSoftware engineer

Data engineer

Data sciencestatistics

machine learning information retrieval

signal processing data visualization

databases

Tool building software engineering

clouds/grids high-performance

computing optimization

Page 13: Center for Data Science Paris-Saclay

Center for Data ScienceParis-Saclay

• Manpower

• especially at the interfaces

• industrial brain-drain

• Incentives

• data scientists are not incentivized to work on domain science

• scientists are not incentivized to work on tools

• Access

• no well-developed channels to identify the right experts for a given problem

• Tools

• few tools that can help domain scientists and data scientists to collaborate efficiently

13

CHALLENGES

Page 14: Center for Data Science Paris-Saclay

Center for Data ScienceParis-Saclay

TOOLS

14

We are designing and learning to manage tools to

accompany data science projects with different needs

Page 15: Center for Data Science Paris-Saclay

Center for Data ScienceParis-Saclay

TOOLS: LANDSCAPE TO ECOSYSTEM

15

Data scientist

Data trainer

Applied scientist

Domain expertSoftware engineer

Data engineer

Tool building Data domains

Data sciencestatistics

machine learning information retrieval

signal processing data visualization

databases

• interdisciplinary projects • matchmaking tool • design and innovation strategy workshops • data challenges

• coding sprints • Open Software Initiative • code consolidator and engineering projects

software engineeringclouds/grids

high-performancecomputing

optimization

energy and physical sciences health and life sciences Earth and environment

economy and society brain

• data science bootcamps • IT platform for linked data • annotation tool • SaaS data science platform

Page 16: Center for Data Science Paris-Saclay

Center for Data ScienceParis-Saclay16

POSTDOCS, THESES, SABBATICALS

• Common selection criteria

• scientific quality

• expected results both in domain science and data science

• relevance and feasibility

• (real) scientific data, available at the start of the project

• interdisciplinarity (PIs both from domain and data sciences, different LABEXs)

• community building

• organizing and participating in thematic days, bootcamps, workshops

Page 17: Center for Data Science Paris-Saclay

Center for Data ScienceParis-Saclay17

ENGINEERING AND CODE CONSOLIDATING PROJECTS

• No research

• implementation, maintenance, integration

• 1 year engineering projects

• 3-6 month code consolidating projects

• a postdoc or PhD student drops research during the project, implements his/her research code in a professional software, or integrates it into a toolbox

Page 18: Center for Data Science Paris-Saclay

Center for Data ScienceParis-Saclay

• A window to open data at Paris-Saclay

• We are not storing or handling existing large data sets

• Rather indexing, linking, and mapping, embedding in the worldwide linked data (RDF) ecosystem

• Storing small data sets of small teams is possible

• Subsets of large sets for prototyping

• Or simply store metadata plus pointer

18

IT PLATFORM FOR LINKED DATA

http://io.datascience-paris-saclay.fr/

Page 19: Center for Data Science Paris-Saclay

Center for Data ScienceParis-Saclay

IT PLATFORM FOR LINKED DATA

19

Page 20: Center for Data Science Paris-Saclay

Center for Data ScienceParis-Saclay20

BOOTCAMPS

• Single-day coding sessions

• 20-30 participants

• Goals

• training PhD students, postdocs, engineers, senior researchers for hands-on data science (problem types, tools)

• solving (prototyping) real data science problems

• networking, knowing each other

Page 21: Center for Data Science Paris-Saclay

21

BOOTCAMPS

Page 22: Center for Data Science Paris-Saclay

Center for Data ScienceParis-Saclay22

BOOTCAMPS

Page 23: Center for Data Science Paris-Saclay

23

BOOTCAMPS

Page 24: Center for Data Science Paris-Saclay

24

BOOTCAMPS

Page 25: Center for Data Science Paris-Saclay

Center for Data ScienceParis-Saclay

• A data challenge is a recently developed unconventional dissemination and communication tool

• a scientific or industrial data producer arrives with a well-defined problem and a corresponding annotated data set

• defines a quantitative goal

• makes the problem and part of the data set (the training set) public on a dedicated site

• data science experts then take the public training data and submit solutions for a test set with hidden annotations

• submissions are evaluated numerically using the quantitative measure

• contestants are listed on a leaderboard

• after a predefined time, typically a couple of months, the final results are revealed and the winners are awarded

25

DATA CHALLENGES

Page 26: Center for Data Science Paris-Saclay

Center for Data ScienceParis-Saclay

• The HiggsML challenge on Kaggle

• https://www.kaggle.com/c/higgs-boson

• 1785 teams, huge publicity

• significant improvement on baseline

• 18 month preparation, yet partially missing the target

26

DATA CHALLENGES

Page 27: Center for Data Science Paris-Saclay

Center for Data ScienceParis-Saclay

• Challenges are useful for

• generating visibility in the data science community about novel application domains

• benchmarking in a fair way state-of-the-art techniques on well-defined problems

• finding talented data scientists

• Limitations

• not necessary adapted to solving complex and open-ended data science problems in realistic environments

• emphasizes competition

27

DATA CHALLENGES

Page 28: Center for Data Science Paris-Saclay

28

DESIGN AND INNOVATION STRATEGY WORKSHOPS

Page 29: Center for Data Science Paris-Saclay

Center for Data ScienceParis-Saclay

• Putting domain scientists, data scientists, and management scientist in the same room

• Getting them understand each other

• Keeping them collectively creative

• The goal: identifying and defining projects

• low-hanging fruits

• breakthrough projects

• long-term vision

29

DESIGN AND INNOVATION STRATEGY WORKSHOPS

Page 30: Center for Data Science Paris-Saclay

Center for Data ScienceParis-Saclay

innovative design

=

interaction and joint expansion of concepts

and knowledge

30

DESIGN AND INNOVATION STRATEGY WORKSHOPS

C/K design theory

Page 31: Center for Data Science Paris-Saclay

Center for Data ScienceParis-Saclay31

DESIGN AND INNOVATION STRATEGY WORKSHOPS

[P]$Project$building!Ini3alisa3on$

[K]$Knowledge$sharing$

Workshops$

[C]$IFM?Design$Workshops$

[RUN]!

DKCP process: linearizing C-K dynamics

Page 32: Center for Data Science Paris-Saclay

Center for Data ScienceParis-Saclay32

TAKE HOME MESSAGE NO1

If you are interested in adapting any of these tools in

your project/site, feel free to contact us, we would be

happy to share our experience

Page 33: Center for Data Science Paris-Saclay

Center for Data ScienceParis-Saclay

• We need data engineers and trainers to support research

• ideally: 75% research scientists, 25% research engineers

• We are lucky in France that the position exists in public research

• Tasks

• integrating research code into general-purpose professional software (e.g., scikit-learn, Torch)

• providing an interface between computational infrastructure (e.g., clouds) and scientists

• training scientists to use the tools

33

WHAT CNRS CAN DO

Page 34: Center for Data Science Paris-Saclay

Center for Data ScienceParis-Saclay34

WHAT CNRS CAN DO

Incentives

Page 35: Center for Data Science Paris-Saclay

Center for Data ScienceParis-Saclay35

WHAT CNRS CAN DOMost data scientists, as other scientists, are trained and incentivized to do research on highly specialized domains. They search scientific visibility in their international community, which is equally highly specialized, because their carrier advancement is almost entirely based on peer-reviewed publications. Even when they would have the expertise, they have little incentive to venture into the tool builder (data engineer) role since software authorship has little value in their evaluation, and it can only serve them implicitly through the visibility they gain in the community of tool users. By the same token, they have little incentive to venture into domain sciences and to tackle economic or societal challenges. It is possible that a domain science or an industrial project requires new techniques which then can be published in data science venues, but this is not guaranteed at all. It usually takes heavy investment of time and effort to be able to understand domain problems, so excursions into domain sciences are highly risky. Even when such collaborations are established, the data scientist has a strong prior to use his/her favorite methodology which is not necessarily the best solution for a given problem. Finally, the data scientist has little incentive to bring the project to full fruition, and he/she often “runs away” with an abstract data science problem (and solution) extracted from the project. Symmetrically, domain scientist and industrial domain experts have no incentive to advance data science and to develop and publish new techniques, as long as their data science problems get solved. When they venture into tool development, they have little incentive in developing general purpose tools.

Page 36: Center for Data Science Paris-Saclay

Center for Data ScienceParis-Saclay

Affirmative action

36

TAKE HOME MESSAGE NO2

Page 37: Center for Data Science Paris-Saclay

• Overweight (e.g, double count) out-of-domain papers in every evaluation

• Make software tool ownership count

Affirmative action

37

TAKE HOME MESSAGE NO2

Page 38: Center for Data Science Paris-Saclay

Center for Data ScienceParis-Saclay38

TAKE HOME MESSAGE NO3

Co-locality is important.

It would be desirable that data science for scientific

data in France be grouped into 5-6 sites of similar size.

Resources and experience can and should be shared.

Page 39: Center for Data Science Paris-Saclay

Center for Data ScienceParis-Saclay39

THANK YOU!