software architecture and predictive models in r

50
Software Architecture & Predictive Models in R Harlan D. Harris, PhD Director, Data Science, EAB April 2015

Upload: harlan-harris

Post on 17-Jul-2015

88 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Software Architecture & Predictive Models in R

Harlan D. Harris, PhD Director, Data Science, EAB

April 2015

Architecture = Choices

Architecture = Choiceshigh level structure of a software system, the discipline of creating such structures, and the documentation of these structures - Wikipedia

Architecture = Choices

Choices

technologies

boundaries

what does what

who does what

high level structure of a software system, the discipline of creating such structures, and the documentation of these structures - Wikipedia

Architecture = Choices

Choices

technologies

boundaries

what does what

who does what

Questions

high level structure of a software system, the discipline of creating such structures, and the documentation of these structures - Wikipedia

Is It a Data Product?

Is It a Data Product?

Is It a Data Product?Systems Development Life Cycle

How Many Users?

How Many Users?

How Many Users?

Model &/vs. Application

Model &/vs. Application

Microservices (Martin Fowler)

R in the Loop?

R in the Loop?

Fitting Scoring

Predict Daily? Right Now?

Predict Daily? Right Now?• Batch or On-demand?

Predict Daily? Right Now?• Batch or On-demand?

• How far in advance do you know what entities you need to predict?

Predict Daily? Right Now?• Batch or On-demand?

• How far in advance do you know what entities you need to predict?

On-demand Prediction Performannce

Fit Yearly? Daily? Ongoing?

Fit Yearly? Daily? Ongoing?• Yearly = mostly-manual, by experts

• Daily = automatic, with monitoring

• Ongoing = specialized algorithms

Fit Yearly? Daily? Ongoing?• Yearly = mostly-manual, by experts

• Daily = automatic, with monitoring

• Ongoing = specialized algorithms

Concept/Covariate Drift

How Many Models?

How Many Models?• One?

How Many Models?• One?• One per product line?

How Many Models?• One?• One per product line?• One per user?

How Many Models?• One?• One per product line?• One per user?

• How much design / feature engineering?

How Many Models?• One?• One per product line?• One per user?

• How much design / feature engineering?

• At what granularity do you need to apply domain knowledge?

The Data Science Venn Diagram (Drew Conway)

Logging? Monitoring? Error Handling?

Logging? Monitoring? Error Handling?

• What happens if it breaks?

Logging? Monitoring? Error Handling?

• What happens if it breaks?

• What happens (to you) if it’s been broken for a month and you didn’t know?

Logging? Monitoring? Error Handling?

• What happens if it breaks?

• What happens (to you) if it’s been broken for a month and you didn’t know?

There's only one right answer and starting point for a data product: Understanding how will you evaluate performance, and

building evaluation tools. —Ruslan Belkin

Everything We Wish We'd Known About Building Data Products

Case 1: Annotating Hourly

https://github.com/HarlanH/r-server-template

Answers…• Data product — lots of maintenance • 100s to 1000s of users • Small component of several applications • R has to be in the loop • Predict hourly (if change), daily (otherwise) • Fit just quarterly • One model per product line (~10) • Logged, monitored, robust error handling

Architecture

biz data

predictions

web services

applications

model files

config files

log monitoring systemlog filesquality

monitoring

Prediction System

offline modeling

scripts

web status

quarterly

on data update, or every 12 h

Storing Predictions• entity ID (FK)

• last-predicted timestamp

• expected value

• cumulative probability vector

Storing Predictions• entity ID (FK)

• last-predicted timestamp

• expected value

• cumulative probability vector

Imprecise = Trustworthy!

Nitty-Gritty• Configuration

• source() model & environment config files

• Logging

• log4r to Zenoss

• Error Handling

• tryCatch(), quit(status=1)

Web Server for Status• R has a built-in web

server!

• Store state (counts, errors, uptime, etc.) in global variable

• Web page serves state to admins

• Not performant for serving predictions

ws <- Rhttpd$new()

ws$add(RhttpdApp$new(name='TestApp',

app=TestAppStatus))

ws$start(port=app.port, quiet=TRUE)

info(lgr, paste('web server at:',

ws$full_url(1)))

TestAppStatus <- function(req.env) {

req <- Request$new(req.env)

res <- Response$new()

res.html.str <- '<HTML><head><title>%s</title></head>

<body><h1>%s %s</h1>

<p>%s...<p>

</body>

</HTML>'

res$write(sprintf(res.html.str,

app.name, app.name, app.version,

foo))

res$finish()

}

Testing• Pre-deployment testing

• testthat for unit testing • held-out historical data, test algorithms / params • recent production data, test for edge cases

• Data-flow testing • working with QA staff • change entity data, watch for updated predictions • monitor logs

• Post-deployment testing • store predictions, then compare with reality

Case 1b: Refitting & Scoring Daily

Jeremy Stanley @ Sailthru

Case 2: Many models, fit annually, scoring now

More answers…• Data product — lots of maintenance • 1000s of users • Small component of several applications • R has to be in the loop • Predict batch and on demand • Fit annually • One model per customer (~150) • Logged, monitored, robust error handling

Decision Process

technologies

boundaries

what does what

who does what

• System Cost • Developer Cost • Performance & Scaling • Model Flexibility • Operational Complexity • Vendor Support & Direction

Rubric

YHat’s ScienceOps

Sci’Ops API Architecture

SQL Data Refinery

Applications

log monitoring

system

log files

ScienceOps

modeling scripts

model

Hadoop Data Lake

logged predictions

modelmodelmodelmodels

API

Thanks!

@HarlanH