"what we learned from 5 years of building a data science software that actually works for...

33
Building a data science software that works for everybody: Learnings

Upload: dataconomy-media

Post on 08-Jan-2017

35 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: "What we learned from 5 years of building a data science software that actually works for everybody." Dr. Dennis Proppe, CTO and Chief Data Scientist at GPredictive GmbH

Building a data science software that works for everybody: Learnings

Page 2: "What we learned from 5 years of building a data science software that actually works for everybody." Dr. Dennis Proppe, CTO and Chief Data Scientist at GPredictive GmbH

Statistics SaaS solution

Predictive Models

Customer Journey

Retail & eCommerce

One single purpose

Page 3: "What we learned from 5 years of building a data science software that actually works for everybody." Dr. Dennis Proppe, CTO and Chief Data Scientist at GPredictive GmbH

Outline

● What did we build?

We wanted to automate predictive model building

for everyone, not just statistics nerds.

● Why do we want to build such a thing? Everybody said, it would be impossible

● What did we do wrong at first? Almost everything.

● What did we learn? A lot.

3

Page 4: "What we learned from 5 years of building a data science software that actually works for everybody." Dr. Dennis Proppe, CTO and Chief Data Scientist at GPredictive GmbH

Learning #1

Page 5: "What we learned from 5 years of building a data science software that actually works for everybody." Dr. Dennis Proppe, CTO and Chief Data Scientist at GPredictive GmbH

Choose your domainand own it

Page 6: "What we learned from 5 years of building a data science software that actually works for everybody." Dr. Dennis Proppe, CTO and Chief Data Scientist at GPredictive GmbH

Learning #1: Choose your domain and own it!

● No free lunch

○ all algorithms equal if averaged out

● Double down on your niche

○ encode its business & domain specifics into your product

→ otherwise no lunch at all, paid or free.

6

Page 7: "What we learned from 5 years of building a data science software that actually works for everybody." Dr. Dennis Proppe, CTO and Chief Data Scientist at GPredictive GmbH

Learning #1: Choose your domain and own it!

● No free lunch

○ all algorithms equal if averaged out

● Double down on your niche

○ encode its business & domain specifics into your product

→ otherwise no lunch at all, paid or free.

7

Page 8: "What we learned from 5 years of building a data science software that actually works for everybody." Dr. Dennis Proppe, CTO and Chief Data Scientist at GPredictive GmbH

Product Offering in 2012

● “You name it, we have it”

○ Predictive Pricing

○ Predictive Inventory Management

○ Predictive Lead Scoring

○ Predictive Customer Intelligence

8

Page 9: "What we learned from 5 years of building a data science software that actually works for everybody." Dr. Dennis Proppe, CTO and Chief Data Scientist at GPredictive GmbH

Product Offering in 2013

● “Well, we can’t do everything!”

○ Predictive Pricing

○ Predictive Inventory Management

○ Predictive Lead Scoring

○ Predictive Customer Intelligence

9

Page 10: "What we learned from 5 years of building a data science software that actually works for everybody." Dr. Dennis Proppe, CTO and Chief Data Scientist at GPredictive GmbH

Product Offering in 2014

● “Let’s double down on the CRM thing.”

○ Predictive Pricing

○ Predictive Inventory Management

○ Predictive Lead Scoring

○ Predictive Customer Intelligence

10

Page 11: "What we learned from 5 years of building a data science software that actually works for everybody." Dr. Dennis Proppe, CTO and Chief Data Scientist at GPredictive GmbH

Product Offering in 2015

● “Lead scoring is just not our thing…”

○ Predictive Pricing

○ Predictive Inventory Management

○ Predictive Lead Scoring

○ Predictive Customer Intelligence

11

Page 12: "What we learned from 5 years of building a data science software that actually works for everybody." Dr. Dennis Proppe, CTO and Chief Data Scientist at GPredictive GmbH

Product Offering in 2016

● “We are the retail experts.”

○ Predictive Pricing

○ Predictive Inventory Management

○ Predictive Lead Scoring

○ Predictive Customer Intelligence

12

Page 13: "What we learned from 5 years of building a data science software that actually works for everybody." Dr. Dennis Proppe, CTO and Chief Data Scientist at GPredictive GmbH

Product Offering in 2016

● “We are the retail experts.”

○ Predictive Pricing

○ Predictive Inventory Management

○ Predictive Lead Scoring

○ Predictive Customer Intelligence,

but only retail or e-tail with

> 20 million € annual revenue

13

Page 14: "What we learned from 5 years of building a data science software that actually works for everybody." Dr. Dennis Proppe, CTO and Chief Data Scientist at GPredictive GmbH

Learning #2

Page 15: "What we learned from 5 years of building a data science software that actually works for everybody." Dr. Dennis Proppe, CTO and Chief Data Scientist at GPredictive GmbH

Obsess about the Data

Page 16: "What we learned from 5 years of building a data science software that actually works for everybody." Dr. Dennis Proppe, CTO and Chief Data Scientist at GPredictive GmbH

Learning #2: Be obsessed about the data you get

● If you have doubled down on your niche, lay out an exact plan what your

software allows in terms of data and stick to it.

● If possible, design a fixed scheme that still is flexible

● Programmatically reject every data that does not 100% comply to the scheme

● Because...

16

Page 17: "What we learned from 5 years of building a data science software that actually works for everybody." Dr. Dennis Proppe, CTO and Chief Data Scientist at GPredictive GmbH

Learning #2: Be obsessed about the data you get

17

Page 18: "What we learned from 5 years of building a data science software that actually works for everybody." Dr. Dennis Proppe, CTO and Chief Data Scientist at GPredictive GmbH

Learning #2: Practice Detail

Wait, how can you have a scheme that is fixed and still flexible?

● We go with a concept of a set of tables that are very narrow and very long.

● We call this the “event” format with basically only three infos per row:

○ Timestamp (“When did the event happen”)?

○ Category (“What event did happen”)?

○ Who (“Which customer is linked to this event”)

● Combined with an intelligent feature derivation engine, one can extract a

surprising amount of very good features from such a table.

18

Page 19: "What we learned from 5 years of building a data science software that actually works for everybody." Dr. Dennis Proppe, CTO and Chief Data Scientist at GPredictive GmbH

Learning #3

Page 20: "What we learned from 5 years of building a data science software that actually works for everybody." Dr. Dennis Proppe, CTO and Chief Data Scientist at GPredictive GmbH

Automate and Test

Page 21: "What we learned from 5 years of building a data science software that actually works for everybody." Dr. Dennis Proppe, CTO and Chief Data Scientist at GPredictive GmbH

Learning #3: Automate and Test everything

Doing Data Science is not for the faint at heart

● Between data acquisition, data preparation, modelling, scoring and evaluation

there are hundreds of small, tedious tasks that all have one thing in common.

● One simple error in just one of them renders the complete result wrong.

○ Extract the wrong target column during data extraction: Wrong!

○ Screw up one SQL statement during data prep: Wrong!

○ Mess up one line of code in modelling and scoring: Wrong!

21

Page 22: "What we learned from 5 years of building a data science software that actually works for everybody." Dr. Dennis Proppe, CTO and Chief Data Scientist at GPredictive GmbH

Learning #3: Automate and Test everything

22

Think like a developer, not like a data wrangler

● Data Science features are developed test-driven

● A test suite tests the complete process chain every

night and at every deploy

● There is no manual process management. The

process is in the software

● Hero-style “Save-the-day” mentality is frowned upon

Page 23: "What we learned from 5 years of building a data science software that actually works for everybody." Dr. Dennis Proppe, CTO and Chief Data Scientist at GPredictive GmbH

Learning #4

Page 24: "What we learned from 5 years of building a data science software that actually works for everybody." Dr. Dennis Proppe, CTO and Chief Data Scientist at GPredictive GmbH

Algorithmsare not that important

Page 25: "What we learned from 5 years of building a data science software that actually works for everybody." Dr. Dennis Proppe, CTO and Chief Data Scientist at GPredictive GmbH

Learning #4: Algorithms are not that important

25

Page 26: "What we learned from 5 years of building a data science software that actually works for everybody." Dr. Dennis Proppe, CTO and Chief Data Scientist at GPredictive GmbH

Learning #4: Algorithms are not that important

26

Page 27: "What we learned from 5 years of building a data science software that actually works for everybody." Dr. Dennis Proppe, CTO and Chief Data Scientist at GPredictive GmbH

Learning #4: Algorithms are not that important

● We found out that the difference between bootstrapped reruns of the same

algorithm is larger than the difference between algorithms.

● So, we ditched ensembles and deliver pure, stock Random Forests today.

● Caution: This is not the free lunch theorem contradicted. We found this out for

our domain after carefully running thousands of experiments.

27

Page 28: "What we learned from 5 years of building a data science software that actually works for everybody." Dr. Dennis Proppe, CTO and Chief Data Scientist at GPredictive GmbH

Learning #5

Page 29: "What we learned from 5 years of building a data science software that actually works for everybody." Dr. Dennis Proppe, CTO and Chief Data Scientist at GPredictive GmbH

Runtime does matter

Page 30: "What we learned from 5 years of building a data science software that actually works for everybody." Dr. Dennis Proppe, CTO and Chief Data Scientist at GPredictive GmbH

Learning #5: Runtime does matter

● # of iterations matter in Data Science.

● You need to be able to fail fast well

before the deadline to win.

● When using software, people really

don’t like to wait.

● Speed is a genuine feature!

30

Page 31: "What we learned from 5 years of building a data science software that actually works for everybody." Dr. Dennis Proppe, CTO and Chief Data Scientist at GPredictive GmbH

Learning #5: Runtime does matter

● 2012: We deliver results on the complete data

science process in 4 weeks. Awesome!

● 2015: Software now includes data acquisition,

down to a week.

● 2016: Our software needs between an hour

and 8 hours for the complete process

● 2017 (Our vision): Each step completes in < 60

seconds. Total runtime: < 10 minutes.

31

Page 32: "What we learned from 5 years of building a data science software that actually works for everybody." Dr. Dennis Proppe, CTO and Chief Data Scientist at GPredictive GmbH

32

Page 33: "What we learned from 5 years of building a data science software that actually works for everybody." Dr. Dennis Proppe, CTO and Chief Data Scientist at GPredictive GmbH

Contact details

Gpredictive GmbH • Lilienstr. 11 • 20095 Hamburg

+49 40 209316212

+49 176 84022723 gpredictive.de

[email protected]

Co-Founder & CTO

Dr. Dennis Proppe