a predictive model factory picks up steam
TRANSCRIPT
1© 2014 Cisco and/or its affiliates. All rights reserved.
A Predictive Model Factory Picks Up Steam
H2O and Cisco’s Propensity to Buy Factory
Lou Carvalheira
H2O World - Nov2014
© 2014 Cisco and/or its affiliates. All rights reserved. 2
Who we are:
• 20 professionals with advanced degrees in Statistics, Mathematics and Econometrics
• Contributed to more than $3 Billion in additional bookings to Cisco since 2007 (measured with the use of control groups in models deployed by the Marketing and Sales organizations)
• Recipients of the 2008 Gold Award for Analytical Modeling from the National Conference for Database Marketing
What we do:
• Predictive Modeling, Customer Valuation, Forecasting, Optimization
Our Mission:
• Deliver insights that will influence and improve Sales and Marketing initiatives and that are derived through the use of Statistics and Data Mining
3© 2014 Cisco and/or its affiliates. All rights reserved.
© 2014 Cisco and/or its affiliates. All rights reserved. 4
Build
Statistical
Model
Historical
• Firmographics
• Past Purchase Behavior
• Contacts (# and types)
• Marketing Interactions
• Cust Sat surveys
• Macroeconomic indicators
• Purchase / Non Purchase
• SCORE: probability that a company
will buy a specific technology in the
next quarter
• VALUE: bookings amount that
Cisco will likely see IF the company
in fact buys the technology
t
…
Q1 Q2Q4
(most recent
closed quarter)
Q3Q2Q1Q4
past purchase behavior
and marketing interactions
firmographics
and contact data…
predicted
purchase
window
Scoring
happens
here
Latest
• Firmographics
• Past Purchase Behavior
• Contacts (# and types)
• Marketing Interactions
• Cust Sat surveys
• Macroeconomic indicators
© 2014 Cisco and/or its affiliates. All rights reserved. 5
Too many models (60,000!)
From scratch every time
Users: “my region is different”
Cisco is constantly introducing new products and services
2 distinct universe of companies: internal and external (160M)
models by product, country, company size, and mktg objective
Tech business changes a lot: new patterns arise every time
Companies change: mergers, acquisitions, in & out of business
Improvements in data collection may make more info available
The truth: too few data miners for “artisanal” approach to modeling
© 2014 Cisco and/or its affiliates. All rights reserved. 6
Country&
Regional
End of
Quarter
Results
Assess.
SAS
Data
Warehouse,
Salesforce,
etc
Deploy-
ment
Embedding
Control
Groups
Whennew data
is available
ScoringModel
Training
For all potential products
Massive
Data Prep
…
Naïve,
Random,
Challenger
SAS + SAS Ent.Miner SAS, Teradata, BO, Tableau
…
• Busy SAS Environment, shared by other groups
(using mostly EG)
• Training and Scoring would take more than 4
weeks sometimes !!!
• Decision Trees only
Challenges:
© 2014 Cisco and/or its affiliates. All rights reserved. 7
© 2014 Cisco and/or its affiliates. All rights reserved. 8
Results
Assess.
Deploy-
ment
Embedding
Control
Groups
Massive
Data Prep
H2O in small cluster
• 4 nodes running on CentOS
• 24 cores, 128GB memory each
• Using R to control flow of process
ScoringModel
Training
……
Results in
• 2 days to train and
score all models !!
• More data, more
patterns being
identified
• More techniques
compared
• more accuracy
Training with
• Many 10M’s of observations
• GLM, Random Forest, Gradient Boosting
• different algosused in ensemble and compared
© 2014 Cisco and/or its affiliates. All rights reserved. 9
Q1 Q2
P2B Training
Scoring models
Data Refresh Q2
Data Refresh Q1
Prepare, execute Mktg & Sales
activities
Before without H2O
Q1 Q2
Train &
score
Data Refresh
Prepare, execute Mktg & Sales
activities
Train &
score
Data Refresh
Prepare, execute Mktg & Sales
activities
Now with H2O
Without H2O:
• Models needed to be
prepared in advance, not
to delay scoring
• More time preparing
models, less time left for
using the scores in the
sales activities
With H2O:
• Newer Buying Patterns
incorporated
immediately into models
• Scores are published
sooner: more time for
planning and executing
activities
© 2014 Cisco and/or its affiliates. All rights reserved. 11
1) Define environment and main
parameters
2) Read training and scoring files
• Reserve subset of training for validation
3) Define list of predictors and target
variables
4) Train first stage for each target product
(what is the probability of purchase?)
• Train GLM and evaluate model against validation
• Train a couple of Random Forests using different
architectures (fewer, deeper trees vs more,
shallower trees) and evaluate model against
validation
• If product has traditionally been hard to predict,
then train a GBM and evaluate model against
validation
• Use best model (AUC) to score. If more than one
has good result, use ensemble to compose
probability of purchase
5) Train second stage for each target
product (how much will be purchased?)
• Train a GLM and evaluate results on validation set
• Train a GBM and evaluate results on validation set
• Choose the best model to score and predict
purchase value
6) Save intermediate results and treat the
next target product (step 4)
7) Save final score files and clean things
up
© 2014 Cisco and/or its affiliates. All rights reserved. 12
Improvements
• P2B factory is 15x faster with H2O
• Quicker techniques for simpler problems, deeper for harder ones (grid
searches!)
• Ensembles improved accuracy and stability of models significantly
Lessons Learned
• Memory is your friend! Even with few nodes speed improvement over
traditional data mining tools is substantial
• H2O becomes really powerful and robust when combined with R
• Rely on Hexadata’s extremely responsive support
• Anxious to see more data preparation capabilities in H2O
© 2014 Cisco and/or its affiliates. All rights reserved. 13
Throughout the last decade Cisco has increasingly relied on
advanced analytics to drive marketing and sales efforts
The P2B Factory has been a fundamental component of that drive but
it needed to expand, predict new products and services, increase its
accuracy and do it all in less time
H2O allowed that improvement to happen with its powerful in-memory
distributed computing algorithms, great support team and cost
effective solution
Thank you.