data science team collaboration: forget about meeting me halfway, take me the last mile |...
TRANSCRIPT
DATA SCIENCETEAM COLLABORATION
FORGET ABOUT MEETING ME HALFWAY,TAKE ME THE LAST MILE
#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
OGT molecular dynamics simulationProtein “mouth” opening, 1us
#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokesCERN computing facilityGeneva, Switzerland
#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
SUCCESS COMES FROM TEAM WORK
http://bit.ly/ac17-collab
#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
SUCCESS COMES FROM TEAM WORK
#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
IAN: ENGINEER, PHYSICIST, BIOLOGIST?
• Ian Stokes-Rees, @ijstokes• Product Marketing Manager• Computational Scientist• Passionate advocate of
Open Data Science• Educator and evangelist for use of
Python and Anaconda
#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
FIRST TASTE OF “BIG DATA” COMPUTING
• 100,000 acoustic tri-phone models• 100 parameters per model• 10 million parameters to estimate• adaptation = real-time adjustment• computation = tricky!
#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
PhD on CERN LHCb COMPUTING TEAM
Distributed computing infrastructure• 1000s of concurrent users• 100s of federated computing centers
• no centralized control• 1M+ servers with software installed• 20+ year life span• 20 GB of data per second• 14 hours per day• 7 days a week• 7 months of the year
March 26, 2010 LHCb first physics at 3.5 TeV
#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
HOW DO CERN PHYSICISTS DO THIS?
• Some smart people over there• Who brought us the Web, HTTP, and HTML?
• Big Data• Multi-PB per year
• Large collaborating teams• 1000s of people accessing systems
• Computation critical• Or there is no way to make sense of the data• And discover new physics December 2, 2016
LHCb proton-lead collisions
#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
CERN ATLAS detectorCalorimeter end cap wiring harnessMillions of data feeds @ 40 MHz signal rate
#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
HOW WOULD YOU DO IT?
Custom hardware:CMS L0 muon trigger ASIC
Giant compute and storage clusters
Wicked fast algorithmswritten in Fortran and C
Python: the Swiss army knife for computational physics
#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
PYTHON: LINGUA FRANCA FOR DATA SCIENCE
• Human readable• Easy to learn• Object oriented• Cleanly wraps C and Fortran• Amazing foundation of high
quality data science libraries• Suitable for scripting,
algorithms, data processing and applications
#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
THE CALCULUS OF NEWTON AND LEIBNIZ
SOMETIMES ESOTERIC IS OK
http://bit.ly/ac17-collab
#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
HERMITS AND HIGH PRIESTS
NPS, Richard Proenneke 1985
#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
MOLECULAR BIOLOGY:FROM PROTONS TO PROTEINS
• It takes 3-9 months in the wet lab to prepare protein samples
• Once prepared it is only a few days to ”image” those samples and produce digitized representations
• However the “images” aren’t yet 3D atomic models
• That takes from weeks to months to complete, sitting behind a computer
• You may know it as protein folding
Nature, 2011 PMID: 21240259Lazarus, Nam, Jiang, Sliz, Walker
HOW DO WE ACCELERATETHE TIME TO INSIGHT?
http://bit.ly/ac17-collab
SUCCESS COMES FROM TEAM WORK
http://bit.ly/ac17-collab
#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
WHAT DOES “HALF WAY” LOOK LIKE?
Today’s “good” data science environment:• Provide high performance computing resources
• For example, Hadoop infrastructure• Deploy a wide selection of the most popular analysis software
• Training and documentation• Technical support
#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
FISH OUT OF WATER
• Why would we take an expert biochemist and force them to be
• A software engineer?• An IT system administrator?• A statistician?
• What can we do to let them focus on being a great biochemist?
#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
FISH OUT OF WATER
• Why would we take an expert business analyst and force them to be
• A software engineer?• An IT system administrator?• A statistician?
• What can we do to let them focus on being a great business analyst?
SUCCESS COMES FROM TEAM WORK
http://bit.ly/ac17-collab
#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
TAKE ME THE LAST MILE
• DevOps engineer pre-configures scalable computation• Laptop to server to cluster• DevOps team is a partner, not a service provider
• Software engineer creates and customizes software for the task, project or individual
• Avoiding generic, static software setups
• Data scientist composes workflow• Analyst is provided simple high level interface
• With option to “drill down”
#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
WHAT ABOUT THOSE PROTEINS?
• Normally it takes 10-200 hours of computing time to match a ”template” protein fragment to the imaging data
• There are 100k templates (known protein “folds”) to choose from• ”Be stupid” and just try them all – sometimes you’ll be surprised!• I spent 18 months working with biochemists and IT sys admins across
the country to create a sensible parallel & distributed workflow• 4-40 hours wall clock time to run 2k-20k hour parallel computation• Real-time updates of results• Web based interface to access summary and detailed data viz• Analysis performed in Jupyter Notebook, allowing customization• File-system based to enable “drill down” and direct access• 6M hours per year (~700 years), peak parallelism 20k cores
#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
DATA SCIENCE PATTERN
• How is it done today?• What is the opportunity for improvement?• Prototype and evaluate – is it better? Rinse and repeat• Standardize and automate the workflow/model• Scale the workflow/model• Preprocess and distribute the data• Instrument execution and set quality metrics• Establish easy access interface• Create programmatic APIs
FIN
SUCCESS COMES FROM TEAM WORK
Remember the footnote?Collaborative cross-functional teams
http://bit.ly/ac17-collab
#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
BREAKING DATA SCIENCE OPEN
ANACONDA & COLLABORATION
http://bit.ly/ac17-collab
#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
STEP 1: ANACONDA
http://continuum.io/downloads
#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
NOTEBOOKS FOR DATA SCIENCE COLLABORATION
Do you understand why notebooks are so popular?There are many angles to this, but my take:
• Visual record of the data science process• They tell a story, and support rich hyperlinked prose• Data can be embedded• Algorithms or analysis techniques are captured• Interactive visualizations are inline• Sharable• Reproducible*
#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
STEP 2: ANACONDA CLOUD
http://anaconda.org
#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
STEP 2: ANACONDA CLOUD
#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
STEP 2: (MY) ANACONDA CLOUD
http://anaconda.org/ijstokes
#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
STEP 2: (MY) ANACONDA CLOUD
#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
STEP 2: (MY) ANACONDA CLOUD
#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
STEP 3: ANACONDA ENTERPRISE (TODAY)
#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
STEP 3: ANACONDA ENTERPRISE (COMING SOON)
ANACONDA:GIVING SUPERPOWERS TO THE PEOPLEWHO CHANGE THE WORLD
TEAMS
http://bit.ly/ac17-collab
THANK YOU! QUESTIONS?
Ian Stokes-Rees @ijstokes
http://bit.ly/ac17-collab