software architecture and predictive models in r
TRANSCRIPT
Software Architecture & Predictive Models in R
Harlan D. Harris, PhD Director, Data Science, EAB
April 2015
Architecture = Choiceshigh level structure of a software system, the discipline of creating such structures, and the documentation of these structures - Wikipedia
Architecture = Choices
Choices
technologies
boundaries
what does what
who does what
high level structure of a software system, the discipline of creating such structures, and the documentation of these structures - Wikipedia
Architecture = Choices
Choices
technologies
boundaries
what does what
who does what
Questions
high level structure of a software system, the discipline of creating such structures, and the documentation of these structures - Wikipedia
Is It a Data Product?Systems Development Life Cycle
Model &/vs. Application
Microservices (Martin Fowler)
Predict Daily? Right Now?• Batch or On-demand?
• How far in advance do you know what entities you need to predict?
Predict Daily? Right Now?• Batch or On-demand?
• How far in advance do you know what entities you need to predict?
On-demand Prediction Performannce
Fit Yearly? Daily? Ongoing?• Yearly = mostly-manual, by experts
• Daily = automatic, with monitoring
• Ongoing = specialized algorithms
Fit Yearly? Daily? Ongoing?• Yearly = mostly-manual, by experts
• Daily = automatic, with monitoring
• Ongoing = specialized algorithms
Concept/Covariate Drift
How Many Models?• One?• One per product line?• One per user?
• How much design / feature engineering?
How Many Models?• One?• One per product line?• One per user?
• How much design / feature engineering?
• At what granularity do you need to apply domain knowledge?
The Data Science Venn Diagram (Drew Conway)
Logging? Monitoring? Error Handling?
• What happens if it breaks?
• What happens (to you) if it’s been broken for a month and you didn’t know?
Logging? Monitoring? Error Handling?
• What happens if it breaks?
• What happens (to you) if it’s been broken for a month and you didn’t know?
There's only one right answer and starting point for a data product: Understanding how will you evaluate performance, and
building evaluation tools. —Ruslan Belkin
Everything We Wish We'd Known About Building Data Products
Case 1: Annotating Hourly
https://github.com/HarlanH/r-server-template
Answers…• Data product — lots of maintenance • 100s to 1000s of users • Small component of several applications • R has to be in the loop • Predict hourly (if change), daily (otherwise) • Fit just quarterly • One model per product line (~10) • Logged, monitored, robust error handling
Architecture
biz data
predictions
web services
applications
model files
config files
log monitoring systemlog filesquality
monitoring
Prediction System
offline modeling
scripts
web status
quarterly
on data update, or every 12 h
Storing Predictions• entity ID (FK)
• last-predicted timestamp
• expected value
• cumulative probability vector
Storing Predictions• entity ID (FK)
• last-predicted timestamp
• expected value
• cumulative probability vector
Imprecise = Trustworthy!
Nitty-Gritty• Configuration
• source() model & environment config files
• Logging
• log4r to Zenoss
• Error Handling
• tryCatch(), quit(status=1)
Web Server for Status• R has a built-in web
server!
• Store state (counts, errors, uptime, etc.) in global variable
• Web page serves state to admins
• Not performant for serving predictions
ws <- Rhttpd$new()
ws$add(RhttpdApp$new(name='TestApp',
app=TestAppStatus))
ws$start(port=app.port, quiet=TRUE)
info(lgr, paste('web server at:',
ws$full_url(1)))
TestAppStatus <- function(req.env) {
req <- Request$new(req.env)
res <- Response$new()
res.html.str <- '<HTML><head><title>%s</title></head>
<body><h1>%s %s</h1>
<p>%s...<p>
</body>
</HTML>'
res$write(sprintf(res.html.str,
app.name, app.name, app.version,
foo))
res$finish()
}
Testing• Pre-deployment testing
• testthat for unit testing • held-out historical data, test algorithms / params • recent production data, test for edge cases
• Data-flow testing • working with QA staff • change entity data, watch for updated predictions • monitor logs
• Post-deployment testing • store predictions, then compare with reality
Case 1b: Refitting & Scoring Daily
Jeremy Stanley @ Sailthru
More answers…• Data product — lots of maintenance • 1000s of users • Small component of several applications • R has to be in the loop • Predict batch and on demand • Fit annually • One model per customer (~150) • Logged, monitored, robust error handling
Decision Process
technologies
boundaries
what does what
who does what
• System Cost • Developer Cost • Performance & Scaling • Model Flexibility • Operational Complexity • Vendor Support & Direction
Rubric
Sci’Ops API Architecture
SQL Data Refinery
Applications
log monitoring
system
log files
ScienceOps
modeling scripts
model
Hadoop Data Lake
logged predictions
modelmodelmodelmodels
API