data scientists are analysts are also software engineers
TRANSCRIPT
DATA SCIENTISTS AND ANALYSTS ARE ALSO SOFTWARE ENGINEERSW. Whipple NeelyDirector of Data Science, EA
THIS TALK IS ABOUT …..Moving data science and analytics teams to a software development model.• The motivation is so that we can created repeatable,
verifiable processes.• It also means that we can bring powerful but “personal”
analysis environments (such as R) into producing enterprise level systems, to create work that typical dashboarding systems cannot achieve.
• In many ways this is a story about one set of teams, it may not apply to all groups, but it has helped ours.
THE TYPICAL VENN DIAGRAM: WHO IS A DATA SCIENTIST
Statistics
Some Version of Domain Expertise
Computer Science
“hacker skills”
Data Scienc
e
“What kind of person does all this? What abilities make a data scientist successful? Think of him or her as a hybrid of data hacker, analyst, communicator, and trusted adviser.”Davenport and Patil, Data Scientist: The Sexiest Job of the 21st Century , Harvard Business Review, 2012
“Hacker skills” is the wrong term
Click to add call out
GOOGLE IMAGE SEARCH: “WHO DATA SCIENTIST VENN DIAGRAM”
WHAT WE DO INSTEAD OF WHO WE ARE
Engineering
Collaboration Science
Data Scienc
e
data engineering, coding discipline,
software engineering, style guides
reproducibility, source code control,
regression tests
math, stats, computer science, machine
learning, probability models, economics, “substantive domain
expertise”, vast quantities of common
sense
Rules of engagement, empathy,
communication and listening skills,
flexibility, reliability, extreme social skills
THE PROBLEMS
We have a team of data scientists who are experts at probability modeling, machine learning, and a few of them are pretty good at programming in R, Matlab or Python on a laptop. However … 1. Most have no experience of team programming2. Many come without experience of creating software that others
can use, or that is robust enough of to run 3. Creating an enterprise-level repeatable process can’t be left to
the kind of programming that most of us do on our laptops 4. There is no easy intermediate step between working on a laptop
and creating something that works on the enterprise platform.
WHERE WE STARTEDWrite R or
Python Script
Run Script Manually
Update Report
Write R or Python Script
Run Script Manually
Update A Static Model Implementat
ion
OR
THE PROBLEMS WITH WHERE WE STARTED
• Code/methods/models got lost.• Lots of manual work.• No automated checks for correctness or
robustness of models or predictions.
WE TALKED TO THE TEAMS ABOUT WHAT WAS WRONG
“Our analysts are pretty good at writing scripts and generating reports, but our team needs help with the bookends: scheduling tasks and serving
the reports automatically” – Colleen Chrisco, Director of Analytics, PopCap Games
IN TERMS OF OUR DIAGRAM
Engineering
Collaboration Science
Data Scienc
e
data engineering, coding discipline,
software engineering, style guides
reproducibility, source code control,
regression tests
math, stats, computer science, machine
learning, probability models, economics, “substantive domain
expertise”, vast quantities of common
sense
Rules of engagement, empathy,
communication and listening skills,
flexibility, reliability, extreme social skills
Click to add call out
THIS WAS A LITTLE SCARY FOR SOME OF OUR TEAMS ….
We’re not programmers.
I don’t even know where to
start
I’ve never scheduled a job
before.
Click to add call out
SO, TO ANSWER THESE CONCERNS WE DID THE FOLLOWING…
Perforce R Server
Script Inputs:csv, DBs, URL, logs,
RDS
Script Outputs:csv, DBs, email, doc, pdf, html,
shiny, RDS
1. Check in CodeP4V, R-Checkin
2. Submit JobSchedule file, API, Web
3. Run Script Reporting, Models, ETLs, Forecasting
R Script
By “we did the following’ I really mean that we hired a brilliant computer scientist named Ben Weber who became part of the team. Ben learned the workflows of the team members and created this system for us.
WHERE IT LANDED US• We’d automated.• We’d gotten the “bookends” covered.• Many analytics teams, including the data science team are
using the system.
As a result … • Teams started using the technology to improve their work• Teams became more efficient: “I no longer have to be a
walking dashboard.” • Astonishingly these teams now have their routine code in
source control.
BUT IT DIDN’T SOLVE EVERYTHING
• We had produced more tools, simplified tasks, but hadn’t really created a culture of being a software producing organization.
• We had extended the laptop model … a little by introducing VMs that could run the code.
And giving teams more tools had introduced some issues … • A proliferation of models/predictions being run without
curating the processes. • People leave, and their work continues to be run automatically
…. This is not always a bad thing, but it is often not a good thing either.
WHAT WE KNEW WE HAD TO DO NEXT
We needed to make a cultural change from what is essentially “hacking” to engineering. • So, we did start hiring people with more software
engineering skills.• Introduced a style guide for our R code. • We started code and project reviews.• Hired a very non-technical writer to start helping the team
produce documentation on our internal Confluence site.• Start providing training in team programming,
engineering, new languages (Spark, Python).• Assign some of the positions on the team to be the
software/coding gurus.
WHAT’S NEXT• Dev/Test/Prod environments.• Upgrading our toolset to work with Rstudio
Server and Git.• Pair programming: a team member with
software skills as their primary background team programming with a data scientist who has focused on statistical modeling and machine learning.