quick and dirty: scaling out predictive models using revolution analytics on teradata
DESCRIPTION
[Presentation by Skylar Lyon at DataWeek 2014, September 17 2014.] I recently faced the task of how to scale out an existing analytics process. The schedule was compressed - it always is in my world. The data was big - 400+ million rows waiting in database. What did I do? I offered my favorite type of solution - quick and dirty. At the outset, I wasn't sure how easy it would be. Nor was I certain of realized performance gains. But the concept seemed sound and the exercise fun. Let's move the compute to the data via Revolution R Enterprise for Teradata. This presentation outlines my approach in leveraging a colleague's R models as I experimented with running R in-database. Would my path lead to significant improvement? Could it be used to productionalize the workflow?TRANSCRIPT
Rapid Productionalization of Predictive Models
In-database Modeling with Revolution Analytics on Teradata
Skylar Lyon
Accenture Analytics
Copyright © 2014 Accenture. All rights reserved. 2
• 7 years of experience with focus on big data and predictive analytics - using discrete choice modeling, random forest classification, ensemble modeling, and clustering
• Technology experience includes: Hadoop, Accumulo, PostgreSQL, qGIS, JBoss, Tomcat, R, GeoMesa, and more
• Worked from Army installations across the nation and also had the opportunity to travel twice to Baghdad to deploy solutions downrange.
Skylar Lyon
Accenture Analytics
Introduction
Copyright © 2014 Accenture. All rights reserved. 3
• New Customer Analytics team for Silicon Valley Internet eCommerce giant
• Data scientists developing predictive models
• Deferred focus on productionalization
• Joined as Big Data Infrastructure and Analytics Lead
Project background and my involvement
How we got here
Copyright © 2014 Accenture. All rights reserved. 4
• 50+ Independent variables including categorical with indicator variables
• Train from small sample (many thousands) – not a problem in and of itself
• Scoring across entire corpus (many hundred millions) – slightly more challenging
Binomial logistic regression
Colleague‘s CRAN R model
Copyright © 2014 Accenture. All rights reserved. 5
We moved compute to data
We optimized the current productionalization process
Before After
Reduced 5+ hour process to 40 seconds
Copyright © 2014 Accenture. All rights reserved. 6
5+ hours to 40 seconds: Recommendation is that this now become the defacto productionalization process
Benchmarking our optimized process
rows
min
ute
s
Copyright © 2014 Accenture. All rights reserved. 7
Beforetrainit <- glm(as.formula(specs[[i]]), data = training.data, family='binomial', maxit=iters)fits <- predict(trainit, newdata=test.data, type='response')Aftertrainit <- rxGlm(as.formula(specs[[i]]), data = training.data, family='binomial', maxIterations=iters)fits <- rxPredict(trainit, newdata=test.data, type='response')
Recode CRAN R to Rx R
Optimization process
Copyright © 2014 Accenture. All rights reserved. 8
• Train in-database on much larger set – reduces need to sample
• Nearly “native” R language – decrease deploy time
• Hadoop support – score in multiple data warehouses
Technology is increasing data science team’s options and opportunities
Additional benefits to new process
Copyright © 2014 Accenture. All rights reserved. 9
• Technical Considerations
Table of Contents
Appendix
Copyright © 2014 Accenture. All rights reserved. 10
• Teradata environment – 4 node, 1700 series appliance server
• Revolution R Enterprise – version 7.1, running R 3.0.2
Environment setup
Technical considerations