"what we learned from 5 years of building a data science software that actually works for...
TRANSCRIPT
Building a data science software that works for everybody: Learnings
Statistics SaaS solution
Predictive Models
Customer Journey
Retail & eCommerce
One single purpose
Outline
● What did we build?
We wanted to automate predictive model building
for everyone, not just statistics nerds.
● Why do we want to build such a thing? Everybody said, it would be impossible
● What did we do wrong at first? Almost everything.
● What did we learn? A lot.
3
Learning #1
Choose your domainand own it
Learning #1: Choose your domain and own it!
● No free lunch
○ all algorithms equal if averaged out
● Double down on your niche
○ encode its business & domain specifics into your product
→ otherwise no lunch at all, paid or free.
6
Learning #1: Choose your domain and own it!
● No free lunch
○ all algorithms equal if averaged out
● Double down on your niche
○ encode its business & domain specifics into your product
→ otherwise no lunch at all, paid or free.
7
Product Offering in 2012
● “You name it, we have it”
○ Predictive Pricing
○ Predictive Inventory Management
○ Predictive Lead Scoring
○ Predictive Customer Intelligence
8
Product Offering in 2013
● “Well, we can’t do everything!”
○ Predictive Pricing
○ Predictive Inventory Management
○ Predictive Lead Scoring
○ Predictive Customer Intelligence
9
Product Offering in 2014
● “Let’s double down on the CRM thing.”
○ Predictive Pricing
○ Predictive Inventory Management
○ Predictive Lead Scoring
○ Predictive Customer Intelligence
10
Product Offering in 2015
● “Lead scoring is just not our thing…”
○ Predictive Pricing
○ Predictive Inventory Management
○ Predictive Lead Scoring
○ Predictive Customer Intelligence
11
Product Offering in 2016
● “We are the retail experts.”
○ Predictive Pricing
○ Predictive Inventory Management
○ Predictive Lead Scoring
○ Predictive Customer Intelligence
12
Product Offering in 2016
● “We are the retail experts.”
○ Predictive Pricing
○ Predictive Inventory Management
○ Predictive Lead Scoring
○ Predictive Customer Intelligence,
but only retail or e-tail with
> 20 million € annual revenue
13
Learning #2
Obsess about the Data
Learning #2: Be obsessed about the data you get
● If you have doubled down on your niche, lay out an exact plan what your
software allows in terms of data and stick to it.
● If possible, design a fixed scheme that still is flexible
● Programmatically reject every data that does not 100% comply to the scheme
● Because...
16
Learning #2: Be obsessed about the data you get
17
Learning #2: Practice Detail
Wait, how can you have a scheme that is fixed and still flexible?
● We go with a concept of a set of tables that are very narrow and very long.
● We call this the “event” format with basically only three infos per row:
○ Timestamp (“When did the event happen”)?
○ Category (“What event did happen”)?
○ Who (“Which customer is linked to this event”)
● Combined with an intelligent feature derivation engine, one can extract a
surprising amount of very good features from such a table.
18
Learning #3
Automate and Test
Learning #3: Automate and Test everything
Doing Data Science is not for the faint at heart
● Between data acquisition, data preparation, modelling, scoring and evaluation
there are hundreds of small, tedious tasks that all have one thing in common.
● One simple error in just one of them renders the complete result wrong.
○ Extract the wrong target column during data extraction: Wrong!
○ Screw up one SQL statement during data prep: Wrong!
○ Mess up one line of code in modelling and scoring: Wrong!
21
Learning #3: Automate and Test everything
22
Think like a developer, not like a data wrangler
● Data Science features are developed test-driven
● A test suite tests the complete process chain every
night and at every deploy
● There is no manual process management. The
process is in the software
● Hero-style “Save-the-day” mentality is frowned upon
Learning #4
Algorithmsare not that important
Learning #4: Algorithms are not that important
25
Learning #4: Algorithms are not that important
26
Learning #4: Algorithms are not that important
● We found out that the difference between bootstrapped reruns of the same
algorithm is larger than the difference between algorithms.
● So, we ditched ensembles and deliver pure, stock Random Forests today.
● Caution: This is not the free lunch theorem contradicted. We found this out for
our domain after carefully running thousands of experiments.
27
Learning #5
Runtime does matter
Learning #5: Runtime does matter
● # of iterations matter in Data Science.
● You need to be able to fail fast well
before the deadline to win.
● When using software, people really
don’t like to wait.
● Speed is a genuine feature!
30
Learning #5: Runtime does matter
● 2012: We deliver results on the complete data
science process in 4 weeks. Awesome!
● 2015: Software now includes data acquisition,
down to a week.
● 2016: Our software needs between an hour
and 8 hours for the complete process
● 2017 (Our vision): Each step completes in < 60
seconds. Total runtime: < 10 minutes.
31
32
Contact details
Gpredictive GmbH • Lilienstr. 11 • 20095 Hamburg
+49 40 209316212
+49 176 84022723 gpredictive.de
Co-Founder & CTO
Dr. Dennis Proppe