pay no attention to the man behind the curtain - the unseen work behind data science
TRANSCRIPT
Pay no attention to the man behind the curtain… The unseen work behind data science and analytics
Accelerate Data Science conference October 18, 2017 Mark Madsen www.ThirdNature.net @markmadsen
Copyright Third Nature, Inc.
INTRO The problem we’re (really) trying to solve, current state
Copyright Third Nature, Inc. Copyright Third Nature, Inc.
The focus is largely on machine learning today
You are here
Copyright Third Nature, Inc.
The craft model of information delivery does not scale
Copyright Third Nature, Inc.
So we shifted to data publishing
Industrialized data delivery for self-service access.
Copyright Third Nature, Inc. Copyright Third Nature, Inc.
Increased data capture and BI maturity leads to more data-intensive practices, rising complexity
Pareto analysis of the share of buyers who make up 80% of sales volume for products, in this case Coke.
Data source: CMO council
Copyright Third Nature, Inc. Copyright Third Nature, Inc.
What makes these customers different? How does this affect a new product launch, or line extensions?
These are not the type of questions you can answer with only queries and reporting.
Data source: CMO council
Copyright Third Nature, Inc.
Compounding the problem: observations, not transactions
Event data doesn’t fit well with current methods of collection and
storage, or with the technology to process and analyze it.
Copyright Third Nature, Inc.
Copyright Third Nature, Inc.
The old problem was access, the new one is analysis
Copyright Third Nature, Inc.
The applied view of data science
Five basic things you can do:
▪Prediction – what is most likely to happen?
▪Estimation – what’s the future value of a variable?
▪Description – what relationships exist in the data?
▪ Simulation – what could happen?
▪Prescription – what should you do?
Slide 10 Copyright Third Nature, Inc.
Copyright Third Nature, Inc.
Applying analytics isn’t just putting them on a screen There are different models of use at machine and human speed
Decision-Action
Human decision support
Humans moderating
machine decisions
Machine decisions
Monitor-Alert
Human monitoring
Machine monitoring
Copyright Third Nature, Inc.
THE NATURE OF THE PROBLEM FOR ORGANIZATIONS
Implementing data science is a problem of multiple perspectives
Copyright Third Nature, Inc.
We don’t have an analytics problem, just like we didn’t have a BI problem
The origin of analytics as “business intelligence” was stated well in 1958:
…the ability to apprehend the interrelationships of presented facts in such a way as to guide action towards a desired goal. ~ H. P. Luhn
“A Business Intelligence System”, http://altaplana.com/ibmrd0204H.pdf
”
“
Our goal is analytics as a capability, not a technology
Copyright Third Nature, Inc.
Three constituencies
Stakeholder Analyst Builder aka the recipient aka the data scientist aka the engineer
Copyright Third Nature, Inc.
Starting points
Many organizations choose to start with the analysts. Create a data science team. Turn them loose to find a problem.
Many more start with builders: technology solutions looking for problems, e.g. 55% of the IT driven Hadoop and Spark projects over the last five years.
The right place to start? Stakeholders. The goal to achieve, the problem to solve.
Copyright Third Nature, Inc.
NATURE OF THE PROBLEM FROM THE STAKEHOLDER’S PERSPECTIVE
Each constituency has their own set of problems to deal with
Copyright Third Nature, Inc.
The myth that still drives analytics – analytic gold
All we need is a fat
pipe and pans
working in parallel…
Copyright Third Nature, Inc.
Analytic insights that result in no action are expensive trivia.
It’s not the insight, but what you do with it, that matters As a manager: what would you do in this situation?
Copyright Third Nature, Inc.
Perennially difficult: What question do you address?
What’s possible?
How do you know what’s feasible and what isn’t? (both technically and financially)
You don’t, unless you know the data science and the business (and even then maybe not, ML makes no guarantees)
It takes domain expertise and analytic expertise and intuition - that’s why you need analysts.
Copyright Third Nature, Inc.
Important questions for managers
1. What is the goal?
2. Is the goal worth achieving?
3. Do you have a clearly stated, measureable goal?
4. Do you have the data required?
If they don’t realize this is important, they complain about analysts asking them a bunch of (obvious*) questions.
There are processes you can put in place to find problems to address, prioritize them and determine how to deploy the solutions for them.
*Not really
Copyright Third Nature, Inc.
Applying analytics is not an analytics problem
Applying analytics is not in the analyst’s control.
It’s not in the engineer’s control.
It’s in the control of the people involved in the process.
Failures are often in execution, not in analytics development.
For example, we saw unexpectedly poor performance in a number of geographies. Was it the new analytics we tried? Was it a data problem? No, it was a simple compliance problem.
Copyright Third Nature, Inc.
NATURE OF THE PROBLEM FROM THE ANALYST’S PERSPECTIVE
Copyright Third Nature, Inc.
The analytics process at a high level
Diagram: Kate Matsudaira
Copyright Third Nature, Inc.
The nature of analytics problems is researching the unknown rather than accessing the known.
Repeat for each new problem
Diagram: Kate Matsudaira
Copyright Third Nature, Inc.
Important: no two analytics projects are entirely alike
Different goals = different data, preparation, algorithm
Different algorithms have different resource consumption profiles and scaling ability.
Each requires it’s own custom engineered data features
Copyright Third Nature, Inc.
Starting at the start: Do you have a clearly stated, measureable goal?
Copyright Third Nature, Inc.
The main hurdle: just getting the data
Do you know where to find it? Because it’s
unlikely to be in the data warehouse.
Do you have access to it?
Is access fast enough? Because DWs are for
QRD, not for moving huge piles of data. And
ERP systems and SaaS apps are right out.
Copyright Third Nature, Inc.
Do you have the right data?
Many machine learning techniques require labeled (known good) training data:
Supervised learning: a person has to define the correct output for some portion of the data. Data is divided into training sets used for model building and test sets for validating the results.
• What is spam and what isn’t?
• What does a fraudulent transaction look like
Third Nature 28
Copyright Third Nature, Inc.
Do you have enough of the right data?
ML needs a lot, you may be disappointed in your own efforts
Copyright Third Nature, Inc.
Define the business problem
Translate the problem into an analytic context Select
appropriate data
Learn the data
Create a model set
Fix problems with data
Transform data
Build models
Assess models
Deploy models
Assess results
Source: Michael Berry, Data Miners Inc.
Slide 30 Copyright Third Nature, Inc.
What does an expert analyst really do?
Copyright Third Nature, Inc. Copyright Third Nature, Inc.
What does an expert analyst do?
You can’t model data for this in advance.
Copyright Third Nature, Inc. Copyright Third Nature, Inc.
Where do analysts spend their time? mostly data work
Define the business problem
Translate the problem into an analytic context
Select appropriate data
Learn the data
Create a model set
Fix problems with data
Transform data
Build models
Assess models
Deploy models
Assess results
% of time spent
70% 30%
Source: Michael Berry, Data Miners Inc.
Slide 32
Copyright Third Nature, Inc.
Feature engineering is the core of the process
Lots of data (as attributes) makes things harder
Lots of data (instances) makes things slow
Often, the raw data is not in a form that is amenable to learning, but you can construct features from it that are.
Cleaning up data, choosing attributes, deriving features is not a technical problem as much as a creative one.
The best way to enable data scientists is to remove data management obstacles.
Copyright Third Nature, Inc. Copyright Third Nature, Inc.
Where do most of the analytics tools focus?
Define the business problem
Translate the problem into an analytic context Select
appropriate data
Learn the data
Create a model set
Fix problems with data
Transform data
Build models
Assess models
Deploy models
Assess results
Source: Michael Berry, Data Miners Inc.
Slide 34
Copyright Third Nature, Inc. Copyright Third Nature, Inc.
Where do most of the analytics aaS focus?
Define the business problem
Translate the problem into an analytic context Select
appropriate data
Learn the data
Create a model set
Fix problems with data
Transform data
Build models
Assess models
Deploy models
Assess results
Source: Michael Berry, Data Miners Inc.
Slide 35
Copyright Third Nature, Inc.
The analyst’s workspace in BI is relatively spare
Copyright Third Nature, Inc.
The analyst’s workspace needs to be more like a kitchen than like BI vending machines
Copyright Third Nature, Inc.
NATURE OF THE PROBLEM FROM THE BUILDER’S PERSPECTIVE
Copyright Third Nature, Inc.
IT and Ops people want to know “what to build?”
Giant data platform? Self service tools?
Copyright Third Nature, Inc.
Analytics requires different processes and workloads
None of this analytics work is the same as what IT considered “analysis” to be, which is usually equated with BI or ad-hoc query.
Ad-hoc analysis =
Exploratory data analysis =
Batch analytics =
Real-time analytics
A real analytics production workflow
Hatch, CIKM ‘11 Slide 40
Copyright Third Nature, Inc.
Embedding analytics: less voodoo, more engineering
Copyright Third Nature, Inc.
Things engineering and operations worry about
Engineering time and effort ▪ Introduction of new technology, complexity
▪ Integration - Deployment of models requirements linking different types of environments, creating supportable workflows for the analysts
▪ Ability to develop and deploy at the required speed
Supportability ▪ Automation
▪ The environment requires additional monitoring, other technology and processes, particularly for customer-facing work
▪ Support costs (time and money)
SLAs: ▪ Availability – if analytics are tied to production operations, particularly
customer facing, this becomes important and difficult because it’s not standard application work
▪ Performance and scalability – have to manage unpredictable workloads, resource conflicts between model development with model execution
Copyright Third Nature, Inc.
The world changes, do the models?
In BI you maintain ETL and schemas, in ML you maintain models.
“Model decay” happens as the assumptions around which a model is built change, e.g. spam techniques change.
When you adjust the model you need to know it is better again
▪ Better save the data used to build the model
▪ Better save the model
▪ Baseline and measurements
Copyright Third Nature, Inc.
You need a system of record for analytics
Copyright Third Nature, Inc.
THREE PERSPECTIVES, ONE SOLUTION?
There are requirements from all constituents. You need to put them together to have a complete picture of what’s needed.
Copyright Third Nature, Inc. Copyright Third Nature, Inc.
The missing stakeholder
There is another stakeholder: analytics management - the CAO, CDO, VP of analytics, aka “your boss” if you’re a data scientist.
The perspective and problems of the person responsible for oversight of the team and efforts is across the organization and across multiple projects
Copyright Third Nature, Inc.
Repeatability
Copyright Third Nature, Inc.
Operational predictability
Copyright Third Nature, Inc.
Reproducibility
Copyright Third Nature, Inc.
Analytics solutions are interdisciplinary
Team composition is best when the skills and backgrounds are mixed.
Domain knowledge is still valuable – ignore the AI and ML hype saying that it’s all math and engineering.
Data management and engineering is a necessary part for much of this work.
Copyright Third Nature, Inc. Copyright Third Nature, Inc.
Data scientists and engineers work from opposing directions
exploration
modeling
integration
applications
infrastructure
help people ask the right questions, frame them, define measurable goals
define models that run to determine answers or carry out actions
deliver the results / product in production, at scale
build data science models into applications and delivery systems
provide the systems and practices to build and run the desired models
Diagram concept: Paco Nathan
Using a matrix to plan the project team
Image: Paco Nathan
This is a team sport, not a solo act
Image: Paco Nathan
Copyright Third Nature, Inc.
We already know the craft model doesn’t scale. How do we industrialize like we did for BI?
Copyright Third Nature, Inc. Copyright Third Nature, Inc.
There is an extensive list of requirements to support
Primary requirements needed by constituents S D E
Data catalog and ability to search it for datasets X X
Self-service access to curated data X
Self-service access to uncurated (unknown, new) data X X
Temporary storage for working with data X
Data integration, cleaning, transformation, preparation tools and environment X X
Persistent storage for source data used by production models X X
Persistent storage for training, testing, production data used by models X X
Storage and management of models X X
Deployment, monitoring, decommissioning models X
Lineage, traceability of changes made for data used by models X X
Lineage, traceability for model changes X X X
Managing baseline data / metrics for comparing model performance X X X
Managing ongoing data / metrics for tracking ongoing model performance X X X
S = stakeholder, user, D = data scientist, analyst, E = engineer, developer
Copyright Third Nature, Inc.
Non-answer #1: “Innovation as Procurement”
Software vendors want to sell you one thing: high margin software.
Most assume the data is there and ready to use by their application – just load it.
Most of the work lies in data integration, cleaning and data management.
Embedding analytics in a process adds infrastructure that most organizations don’t have and can’t support. It takes new infrastructure.
Copyright Third Nature, Inc.
Non-answer #2: Best Practices
“78% of high performing companies have a centralized data science team in place in their organization” – follow their lead!
This is called survival bias. Flipping a coin is often as effective as “Do what they did.”
The problem: you have directions to cross a minefield but no map of where to start.
Copyright Third Nature, Inc.
The enterprise focus needs to be on repeatability - where it can be supported
Copyright Third Nature, Inc.
Key focus for the organization: Infrastructure vs Application
Infrastructure enables value, applications deliver value.
Enable applications by pushing the reusable elements down into the platform.
The infrastructure is a hidden combination of technology, process and methods.
Copyright Third Nature, Inc.
Data management is a key element of infrastructure
Multiple contexts of use, differing quality levels
You need to keep the original because just like baking, you can’t unmake dough once it’s mixed.
Copyright Third Nature, Inc.
Manage your data (or it will manage you)
Data management is where both analysts and developers are weakest.
Modern engineering practices are where data management is weakest.
You need to bridge the groups and practices in the organization if you want to make this work repeatable.
Copyright Third Nature, Inc.
Conclusion: new stuff eventually becomes old stuff
Copyright Third Nature, Inc.
About the Presenter
Mark Madsen is president of Third Nature, an advisory firm focused on analytics, data and technology strategy.
Mark is an award-winning author, architect and CTO who has received awards for his work from the American Productivity & Quality Center, Smithsonian Institute and industry associations.
He is an international speaker, a contributor to Forbes, and member of the O’Reilly Artificial Intelligence and Strata program committees. For more information or to contact Mark, follow @markmadsen on Twitter or visit http://ThirdNature.net
Copyright Third Nature, Inc.
About Third Nature
Third Nature is an advisory firm focused on practices and technology in
analytics, information strategy, business intelligence and data management.
Our goal is to help organizations solve problems using data. We offer
education, advisory and research services to support business and IT
organizations. We also provide product-related consulting to software
vendors in the data industry.
We specialize in strategy and architecture, so we look at emerging
technologies and markets, evaluating how technologies are applied to solve
problems rather than simply comparing product features. We fill the gap
between what industry analyst firms cover and what organizations need.