from raw data to deployed product. fast & agile with crisp-dm

36
From Raw Data to Deployed Product Fast & Agile with CRISP-DM Michał Łopuszyński AnalyticsConf, Gdańsk, 2016.11.15

Upload: michal-lopuszynski

Post on 16-Apr-2017

207 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: From Raw Data to Deployed Product. Fast & Agile with CRISP-DM

From Raw Data to

Deployed Product

Fast & Agile with CRISP-DM

Michał Łopuszyński

AnalyticsConf, Gdańsk, 2016.11.15

Page 2: From Raw Data to Deployed Product. Fast & Agile with CRISP-DM

About me

I work at ICM UW•

Our group = Applied Data Analysis Lab•

Supercomputing centre, weather forecast , virtual library, open science platform, visualization solutions, ...

Involved in modelling and data analysis projects from cosmology, medicine, bioinformatics, quantum chemistry, biophysics, fluid dynamics, materialsscience, social network analysis ...

Automatic information extraction from PDFs •

Text-mining in scientific literature •

Variety of application projects (analysis of court judgments, aviation, deploying solutions on the big data stack Spark/Hadoop, trainings)

About me

adalab.icm.edu.pl

Page 3: From Raw Data to Deployed Product. Fast & Agile with CRISP-DM

Introduction

Page 4: From Raw Data to Deployed Product. Fast & Agile with CRISP-DM

Data Product – Components

Problem

Data

ModellingProcess

Metrics End UserExposition

Data Product

Page 5: From Raw Data to Deployed Product. Fast & Agile with CRISP-DM

Data Product – Components

Problem

Data

ModellingProcess

Metrics

End UserExposition

Data Product

Page 6: From Raw Data to Deployed Product. Fast & Agile with CRISP-DM

What is CRISP-DM?

Cross Industry Standard Process for Data Mining

SPSS, Teradata, Daimler, OCHRA, NCR

Developed in 1996 by big playersin data analysis

I follow "CRISP-DM 1.0 Step-by-step data mining guide"•

01001110010101011100100111000110

100101110101100010011101001

1000000011100000110000110110110

110000110010010001

DATA

BusinessUnderstanding

DataUnderstanding

DataPreparation

Modelling

Evaluation

DeploymentMost popular methodologyfor data-centric projects

See KDNuggets Polls •Runner-up SEMMA•

I find it agile •Introduces almost no overhead •Emphasizes adaptive transitionsbetween project phases

2007, 2014

Page 7: From Raw Data to Deployed Product. Fast & Agile with CRISP-DM

Business Understanding

Page 8: From Raw Data to Deployed Product. Fast & Agile with CRISP-DM

Business Understanding

Determine business objectives•

Resources (data!), risks, costs & benefitsAssess situation•

Ideally with quantitative success criteriaDetermine data mining goals•

Estimate time line, budget, but also tools andtechniques

Develop project plan•

01001110010101011100100111000110

100101110101100010011101001

1000000011100000110000110110110

110000110010010001

DATA

BusinessUnderstanding

DataUnderstanding

DataPreparation

Modelling

Evaluation

Deployment

Page 9: From Raw Data to Deployed Product. Fast & Agile with CRISP-DM

Business Understanding

Difficult!•

Often, you have to enter a new field•

You have to explain data science limitations to non-experts

Source: http://xkcd.com/1425

No, performance will not be 100% •

We need much more data to train an accurate model

For tomorrow, it is impossible•

Page 10: From Raw Data to Deployed Product. Fast & Agile with CRISP-DM

Business Understanding – my DOs and DON'Ts

Have a lot of patience for vaguely defined problems•

Do not waste your time on ill-defined, unrealistic projects•

Learn to concretize or even reduce the scope of the initial idea•Data sample •

Real-life use cases•

Quantitative success metrics•

Try to talk as much as possible with domain experts•

Page 11: From Raw Data to Deployed Product. Fast & Agile with CRISP-DM

Data Understanding

Page 12: From Raw Data to Deployed Product. Fast & Agile with CRISP-DM

Data Understanding

Collect initial data•

Persist resultsDescribe data•

Persist resultsExplore data•

Carefully document problems and issues found! Verify data quality•

01001110010101011100100111000110

100101110101100010011101001

1000000011100000110000110110110

110000110010010001

DATA

BusinessUnderstanding

DataUnderstanding

DataPreparation

Modelling

Evaluation

Deployment

Page 13: From Raw Data to Deployed Product. Fast & Agile with CRISP-DM

Data Understanding – Validate Everything

<judgement id="..."> <date>3013-12-04 00:00:00.0 CET</date> <publicationDate>2014-07-23 02:52:17.0 CEST</publicationDate> <courtId>15250000</courtId> <departmentId>503</departmentId> <chairman>Małgorzata ...</chairman> <judges> <judge>Małgorzata ...</judge>

</judges> ...

</judgement><judgement id="..."> <date>2012-10-01 00:00:00.0 CEST</date> <publicationDate>2014-12-31 18:15:05.0 CET</publicationDate> <courtId>15450500</courtId> <departmentId>6027</departmentId> <judges> <judge>Piotr ...</judge> <judge>wskazał</judge> <judge>czego wymaga art. 17a ust. 2 ustawy</judge> ... </judges></judgement>

Page 14: From Raw Data to Deployed Product. Fast & Agile with CRISP-DM

Data Understanding – Spot Anomalies

Histogram of certain smooth quantity measured using "precise equipment"

Explanation – effect of human interface between precise equipment & db

Page 16: From Raw Data to Deployed Product. Fast & Agile with CRISP-DM

Data Understanding – my DOs and DON'Ts

Do not trust data quality estimates provided by your customer•

Verify as far as you can, if your data is correct, complete, coherent,deduplicated, representative, independent, up-to-date, stationary

Understand anomalies, outliers, missing data•

Do not economize on this phase•The earlier you discover issues with your data the better (yes, your data will have issues!)

Data understanding leads to domain understanding, it will pay off in the modelling phase

Investigate what sort of processing was applied to the raw data•

Page 17: From Raw Data to Deployed Product. Fast & Agile with CRISP-DM

Data Preparation

Page 18: From Raw Data to Deployed Product. Fast & Agile with CRISP-DM

Data Preparation

Select data•

Clean data•

Generate derived attributesConstruct data•

Merge information from different sources Integrate data•

Convert to format convenient for modelling Format data•

01001110010101011100100111000110

100101110101100010011101001

1000000011100000110000110110110

110000110010010001

DATA

BusinessUnderstanding

DataUnderstanding

DataPreparation

Modelling

Evaluation

Deployment

Page 19: From Raw Data to Deployed Product. Fast & Agile with CRISP-DM

Data Preparation

Tedious!•

Make, Drake

Use workflow tools to document, automate & parallelize data prep.•

classification-jsonl

data-aux/class-riffle

data-clean/joind-jsonl

data-aux/metad-riffle data-aux/priis-json data-aux/prinf-json

stat/basic stat/basic-fp7 stat/collab

metadata-jsonl projects-from-iis-jsonl projects-from-infspace-jsonlmetadata-extracted-jsonl

Oozie, Azkaban, Luigi, Airflow, ...

Page 20: From Raw Data to Deployed Product. Fast & Agile with CRISP-DM

Data Preparation

Data understanding and preparation will usually consume half or more of your project time!

20% 20%14%

10% 10%10%

What % of time in your data mining project(s) is spent on data cleaning and preparation?

8%

4%

25%

25%

39%

Percentage of responses

Percentage of time

Source: M.A.Munson, A Study on the Importance of and Time Spent Different Modeling Steps, ACM SIGKDD Explorations Newsletter 13, 65-71 (2011)

Source: KDNuggets Poll 2003

Page 21: From Raw Data to Deployed Product. Fast & Agile with CRISP-DM

Data Preparation – my DOs and DON'Ts

Use workflow tools to help you with the above •

Prepare your customer that data understanding and preparationtake considerable amount of time

Automate this phase as far as possible•

When merging multiple sources, track provenance of your data•

Page 22: From Raw Data to Deployed Product. Fast & Agile with CRISP-DM

Modelling

Page 23: From Raw Data to Deployed Product. Fast & Agile with CRISP-DM

Modelling

Generate test design•

Feature eng., optimize model parametersBuild model•

Iterate the aboveAssess model•

Assumptions, measure of accuracySelect modelling technique•

01001110010101011100100111000110

100101110101100010011101001

1000000011100000110000110110110

110000110010010001

DATA

BusinessUnderstanding

DataUnderstanding

DataPreparation

Modelling

Evaluation

Deployment

Page 24: From Raw Data to Deployed Product. Fast & Agile with CRISP-DM

Modelling – Tooling Selection

Where your model will be deployed?•

Do you need to distribute your computations? (avoid!)

Breadth = performance, lots of general purpose libraries and tooling, easy creation of web services

Should I use general purpose language?•

C++JavaC#

RMatlab

Mathematica

PythonScala

ClojureF#

BreadthD

epth

(quality of general purpose tooling)(q

ualit

y of

dat

a an

alys

is to

olin

g)

Depth = easy data manipulation, latest models and statistical techniques available

Should I use data analysis language?•

Can I afford a prototype?•

Page 25: From Raw Data to Deployed Product. Fast & Agile with CRISP-DM

Modelling – Resist the Hype

We have to use X for this project! X is the best software/method/technology ever!

Hadoop

SparkDeep Learning

NoSQL

HBase

Be adventurous, but also critical when it comes to technology/method choice! None is silver bullet for everything!

XGBoost

Cloud

ROI not the hype should drive your choices!•

Page 26: From Raw Data to Deployed Product. Fast & Agile with CRISP-DM

Modelling – my DOs and DON'Ts

Develop your model with deployment conditions in mind•

Allocate time for hyperparameter optimization•

• Whenever possible, peek inside your model and consult it withdomain expert

Assess feature importance•

Run your model on simulated data•

Be creative with your features (feature engineering)•Esp. from textual data or time-series you can generate a lot of std. features •Make conscious decision about missing data (NAs) and outliers (regression!)•

Page 27: From Raw Data to Deployed Product. Fast & Agile with CRISP-DM

Evaluation

Page 28: From Raw Data to Deployed Product. Fast & Agile with CRISP-DM

Evaluation

Review process•

To deploy or not to deploy?Determine next steps• Determine next steps

Business success criteria fulfilled?Evaluate results•

01001110010101011100100111000110

100101110101100010011101001

1000000011100000110000110110110

110000110010010001

DATA

BusinessUnderstanding

DataUnderstanding

DataPreparation

Modelling

Evaluation

Deployment

Page 29: From Raw Data to Deployed Product. Fast & Agile with CRISP-DM

Evaluation – watch out for overfitting & leakage

Overfitting & leakage are lethal dangers for every model•

Data leakage = artificially injecting parts of solution to input data •

Time series – mixing past and future•

Meaningful identifiers•

Overfitting = learning too much from data•Well-known danger, a lot of techniques to avoid it(cross-validation, regularization, early stopping ...)

Hard to define precisely, best understood by example •

Using parts of training set in test set•

Much lower awareness, not many techniques to avoid •

Page 30: From Raw Data to Deployed Product. Fast & Agile with CRISP-DM

Evaluation – watch out for overfitting & leakage

Good overview of leakage problem is presented in this paper.

Page 31: From Raw Data to Deployed Product. Fast & Agile with CRISP-DM

Evaluation – my DOs and DON'Ts

Work with the performance criteria dictated by your customer'sbusiness model

Assess not only performance, but also practical aspects, related todeployment, for example:

Training and prediction speed•

Robustness and maintainability (tooling, dependence on other subsystems, library vs. homegrown code)

Keep in mind the dreadful modelling dangers leakage & overfitting•

Consider pre-deployment (a la paper trading) as a part of evaluationstrategy

Remember "too good to be true principle" (useful, but crude filter) •

Page 32: From Raw Data to Deployed Product. Fast & Agile with CRISP-DM

Deployment

Page 33: From Raw Data to Deployed Product. Fast & Agile with CRISP-DM

Deployment

Plan monitoring and maintenance•

Produce final report•

Plan deployment•

Collect lessons learned!Review project•

01001110010101011100100111000110

100101110101100010011101001

1000000011100000110000110110110

110000110010010001

DATA

BusinessUnderstanding

DataUnderstanding

DataPreparation

Modelling

Evaluation

Deployment

Page 34: From Raw Data to Deployed Product. Fast & Agile with CRISP-DM

Deployment – my DOs and DON'Ts

Read this paper, for excellent insights!

Page 35: From Raw Data to Deployed Product. Fast & Agile with CRISP-DM

Summary

01001110010101011100100111000110

100101110101100010011101001

1000000011100000110000110110110

110000110010010001

DATA

BusinessUnderstanding

DataUnderstanding

DataPreparation

Modelling

Evaluation

Deployment

Page 36: From Raw Data to Deployed Product. Fast & Agile with CRISP-DM

Thank you!

Questions?

@lopusz