quick and dirty: scaling out predictive models using revolution analytics on teradata

10
Rapid Productionalization of Predictive Models In-database Modeling with Revolution Analytics on Teradata Skylar Lyon Accenture Analytics

Upload: revolution-analytics

Post on 10-Nov-2014

5.089 views

Category:

Technology


0 download

DESCRIPTION

[Presentation by Skylar Lyon at DataWeek 2014, September 17 2014.] I recently faced the task of how to scale out an existing analytics process. The schedule was compressed - it always is in my world. The data was big - 400+ million rows waiting in database. What did I do? I offered my favorite type of solution - quick and dirty. At the outset, I wasn't sure how easy it would be. Nor was I certain of realized performance gains. But the concept seemed sound and the exercise fun. Let's move the compute to the data via Revolution R Enterprise for Teradata. This presentation outlines my approach in leveraging a colleague's R models as I experimented with running R in-database. Would my path lead to significant improvement? Could it be used to productionalize the workflow?

TRANSCRIPT

Page 1: Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on Teradata

Rapid Productionalization of Predictive Models

In-database Modeling with Revolution Analytics on Teradata

Skylar Lyon

Accenture Analytics

Page 2: Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on Teradata

Copyright © 2014 Accenture. All rights reserved. 2

• 7 years of experience with focus on big data and predictive analytics - using discrete choice modeling, random forest classification, ensemble modeling, and clustering

• Technology experience includes: Hadoop, Accumulo, PostgreSQL, qGIS, JBoss, Tomcat, R, GeoMesa, and more

• Worked from Army installations across the nation and also had the opportunity to travel twice to Baghdad to deploy solutions downrange.

Skylar Lyon

Accenture Analytics

Introduction

Page 3: Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on Teradata

Copyright © 2014 Accenture. All rights reserved. 3

• New Customer Analytics team for Silicon Valley Internet eCommerce giant

• Data scientists developing predictive models

• Deferred focus on productionalization

• Joined as Big Data Infrastructure and Analytics Lead

Project background and my involvement

How we got here

Page 4: Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on Teradata

Copyright © 2014 Accenture. All rights reserved. 4

• 50+ Independent variables including categorical with indicator variables

• Train from small sample (many thousands) – not a problem in and of itself

• Scoring across entire corpus (many hundred millions) – slightly more challenging

Binomial logistic regression

Colleague‘s CRAN R model

Page 5: Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on Teradata

Copyright © 2014 Accenture. All rights reserved. 5

We moved compute to data

We optimized the current productionalization process

Before After

Reduced 5+ hour process to 40 seconds

Page 6: Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on Teradata

Copyright © 2014 Accenture. All rights reserved. 6

5+ hours to 40 seconds: Recommendation is that this now become the defacto productionalization process

Benchmarking our optimized process

rows

min

ute

s

Page 7: Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on Teradata

Copyright © 2014 Accenture. All rights reserved. 7

Beforetrainit <- glm(as.formula(specs[[i]]), data = training.data, family='binomial', maxit=iters)fits <- predict(trainit, newdata=test.data, type='response')Aftertrainit <- rxGlm(as.formula(specs[[i]]), data = training.data, family='binomial', maxIterations=iters)fits <- rxPredict(trainit, newdata=test.data, type='response')

Recode CRAN R to Rx R

Optimization process

Page 8: Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on Teradata

Copyright © 2014 Accenture. All rights reserved. 8

• Train in-database on much larger set – reduces need to sample

• Nearly “native” R language – decrease deploy time

• Hadoop support – score in multiple data warehouses

Technology is increasing data science team’s options and opportunities

Additional benefits to new process

Page 9: Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on Teradata

Copyright © 2014 Accenture. All rights reserved. 9

• Technical Considerations

Table of Contents

Appendix

Page 10: Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on Teradata

Copyright © 2014 Accenture. All rights reserved. 10

• Teradata environment – 4 node, 1700 series appliance server

• Revolution R Enterprise – version 7.1, running R 3.0.2

Environment setup

Technical considerations