velox at sf data mining meetup

Post on 17-Jul-2015

529 Views

Category:

Data & Analytics

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

VELOX: MODELS IN ACTION

Dan Crankshaw UC Berkeley AMPLab

crankshaw@cs.berkeley.edu

Marin Software 2015

Algorithms, Machines, and People

Algorithms, Machines, and People

“deep questions over dirty and heterogenous data”

Algorithms, Machines, and People

“deep questions over dirty and heterogenous data”

Algorithms, Machines, and People

“deep questions over dirty and heterogenous data”

Algorithms, Machines, and People

“deep questions over dirty and heterogenous data”

GraphX

Algorithms, Machines, and People

“deep questions over dirty and heterogenous data”

GraphX

Algorithms, Machines, and People

“deep questions over dirty and heterogenous data”

GraphX

Algorithms, Machines, and People

“deep questions over dirty and heterogenous data”

BERKELEY DATA ANALYTICS STACK (BDAS)

Spark

SparkStreaming Spark SQL

BlinkDBGraphX

MLlib

MLBase

HDFS, S3, … Tachyon

Mesos Hadoop Yarn

Catify: Music for Cats

MODELING TASK

Rating

Songs

MODELING TASK

Ratings

Songs

Prediction

Catify: Music for Cats

Catify: Music for CatsCatID Song Score

1 16 2.1

1 14 3.7

3 273 4.2

4 14 1.9

Catify: Music for Cats

Pipeline

CatID Song Score

1 16 2.1

1 14 3.7

3 273 4.2

4 14 1.9

Catify: Music for Cats

Tachyon + HDFS

Pipeline

CatID Song Score

1 16 2.1

1 14 3.7

3 273 4.2

4 14 1.9

Catify: Music for Cats

Tachyon + HDFS

Pipeline

CatID Song Score

1 16 2.1

1 14 3.7

3 273 4.2

4 14 1.9

Pipeline

Tachyon + HDFS

Node.js App Server

Apache Web Server

Catify: Music for Cats

Catify: Music for Cats

Songs

Users

Songs

Users

O(users * songs)

Catify: Music for Cats

Pipeline

Tachyon + HDFS

Node.js App Server

Apache Web Server

Catify: Music for Cats

Pipeline

Tachyon + HDFS

Node.js App Server

Apache Web Server

PrecomputedRatings

Catify: Music for Cats

Pipeline

Tachyon + HDFS

Node.js App Server

Apache Web Server

PrecomputedRatings

Catify: Music for Cats

Black box

Pipeline

Tachyon + HDFS

Node.js App Server

Apache Web Server

Training Data

PrecomputedRatings

Catify: Music for Cats

Black box

Pipeline

Tachyon + HDFS

Node.js App Server

Apache Web Server

Training Data

PrecomputedRatings

Catify: Music for Cats

Black box

What’s wrong?

1. Serving system: low-latency but high staleness

What’s wrong?

1. Serving system: low-latency but high staleness

2. Batch training: slow incremental maintenance, no serving

What’s wrong?

1. Serving system: low-latency but high staleness

2. Batch training: slow incremental maintenance, no serving

3. Ad-hoc model management

What’s wrong?

VELOX GOALS

VELOX GOALS

1. Low latency and fresh predictions

VELOX GOALS

1. Low latency and fresh predictions2. Break the abstraction: model-

specific optimizations

VELOX GOALS

1. Low latency and fresh predictions2. Break the abstraction: model-

specific optimizations3. Unified system eases operation

Spark

SparkStreaming Spark SQL

BlinkDBGraphX

MLlib

MLBase

HDFS, S3, … Tachyon

THE MISSING PIECE IN BDAS

Mesos

Spark Streaming Spark

SQL

Graph X ML

library

BlinkDB MLbase

Training

THE MISSING PIECE IN BDAS

Spark

HDFS, S3, … Tachyon

Mesos

Spark Streaming Spark

SQL

Graph X ML

library

BlinkDB MLbase

Training Management + Serving

THE MISSING PIECE IN BDAS

Spark

HDFS, S3, … Tachyon

Mesos

Spark Streaming Spark

SQL

Graph X ML

library

BlinkDB MLbase VeloxTraining Management + Serving

THE MISSING PIECE IN BDAS

Spark

HDFS, S3, … Tachyon

Mesos

Spark Streaming Spark

SQL

Graph X ML

library

BlinkDB MLbase VeloxTraining Management + Serving

Spark

HDFS, S3, … Tachyon

THE MISSING PIECE IN BDAS

Mesos

Spark Streaming Spark

SQL

Graph X ML

library

BlinkDB MLbase VeloxTraining Management + Serving

Spark

HDFS, S3, … Tachyon

ModelManager

THE MISSING PIECE IN BDAS

Mesos

Spark Streaming Spark

SQL

Graph X ML

library

BlinkDB MLbase VeloxTraining Management + Serving

Spark

HDFS, S3, … Tachyon

ModelManager

PredictionService

THE MISSING PIECE IN BDAS

Mesos

VELOX ARCHITECTURE

VELOX ARCHITECTUREStandalone Scala

Service

VELOX ARCHITECTUREStandalone Scala

Service

Automatic Integration with Spark

VELOX ARCHITECTUREStandalone Scala

Service

Personalized Predictions as a Service

Automatic Integration with Spark

VELOX ARCHITECTUREStandalone Scala

Service

Shared-Nothing Serving Cluster

Personalized Predictions as a Service

Automatic Integration with Spark

SYSTEM ARCHITECTURE

uuid: 01-10

uuid: 11-20

uuid: 20-30

SYSTEM ARCHITECTURE

frontend.js

uuid: 01-10

uuid: 11-20

uuid: 20-30

SYSTEM ARCHITECTURE

frontend.js

uuid: 01-10

uuid: 11-20

uuid: 20-30

uuid: 4

SYSTEM ARCHITECTURE

Predictions via RESTfrontend.js

uuid: 01-10

uuid: 11-20

uuid: 20-30

uuid: 4

SYSTEM ARCHITECTURE

Predictions via RESTfrontend.js

Returns score

uuid: 01-10

uuid: 11-20

uuid: 20-30

uuid: 4

SYSTEM ARCHITECTURE

frontend.js

uuid: 01-10

uuid: 11-20

uuid: 20-30

uuid: 4

SYSTEM ARCHITECTURE

frontend.js

uuid: 01-10

uuid: 11-20

uuid: 20-30

Feedback via REST

uuid: 4

SYSTEM ARCHITECTURE

frontend.js

uuid: 01-10

uuid: 11-20

uuid: 20-30

Feedback via REST

Model updated

in realtime

uuid: 4

SYSTEM ARCHITECTURE

frontend.js

uuid: 01-10

uuid: 11-20

uuid: 20-30

SYSTEM ARCHITECTURE

master

workerworker

worker

frontend.js

uuid: 01-10

uuid: 11-20

uuid: 20-30

SYSTEM ARCHITECTURE

master

workerworker

worker

Batch train RPC

frontend.js

uuid: 01-10

uuid: 11-20

uuid: 20-30

SYSTEM ARCHITECTURE

master

workerworker

worker

Batch train RPC

frontend.js

uuid: 01-10

uuid: 11-20

uuid: 20-30Returns batch trained model

Mesos Mesos

HDFS, S3, … Tachyon

Hadoop Yarn

Spark Straming Shark

SQL

Graph X ML

library

BlinkDB MLbase

Spark

VeloxTraining Management + Serving

PREDICTION SERVICE

ModelManager

PredictionService

PREDICTION API

GET  /velox/catify/predict?userid=22&song=27632Simple point queries:

PREDICTION API

GET  /velox/catify/predict_top_k?userid=22&k=100

GET  /velox/catify/predict?userid=22&song=27632Simple point queries:

More complex ordering queries:

PREDICTION API

GET  /velox/catify/predict_top_k?userid=22&k=100

GET  /velox/catify/predict?userid=22&song=27632Simple point queries:

More complex ordering queries:

Low-latency andscalable partitioning

Personalized Predictions

PREDICTION API

GET  /velox/catify/predict_top_k?userid=22&k=100

GET  /velox/catify/predict?userid=22&song=27632Simple point queries:

More complex ordering queries:

Low-latency andscalable partitioning

Personalized Predictions

Intelligent Caching

Sharing and re-use of model partial-state

SYSTEM ARCHITECTURE

Predictions via RESTfrontend.js

uuid: 01-10

uuid: 11-20

uuid: 20-30

uuid: 4

PREDICTION EXECUTION

def  predict(  u:  UUID,  x:  Context  )

uuid model

PREDICTION EXECUTION

def  predict(  u:  UUID,  x:  Context  )

uuid model

Look up user model

Read

PREDICTION EXECUTION

def  predict(  u:  UUID,  x:  Context  )

uuid model

Look up user model

Primary key lookup

Read

PREDICTION EXECUTION

def  predict(  u:  UUID,  x:  Context  )

uuid model

Look up user model

Primary key lookup

Partition queries by user : always local

Read

Compute Features

PREDICTION EXECUTION

def  predict(  u:  UUID,  x:  Context  )

user independent

}f( )

Compute Features

PREDICTION EXECUTION

def  predict(  u:  UUID,  x:  Context  )

Feature computation could be costly

user independent

}f( )

Compute Features

PREDICTION EXECUTION

def  predict(  u:  UUID,  x:  Context  )

Feature computation could be costly

user independent

}Cache features forreuse across users

f( )

TOP-K QUERIESQuery predicate to pre-filter candidate set

All Songs

TOP-K QUERIESQuery predicate to pre-filter candidate set

All Songs Playlist Keywords

TOP-K QUERIESQuery predicate to pre-filter candidate set

All Songs Playlist Keywords CandidateSongs

TOP-K QUERIESQuery predicate to pre-filter candidate set

All Songs Playlist Keywords CandidateSongs

Score andrank allcandidates

TOP-K QUERIESQuery predicate to pre-filter candidate set

All Songs Playlist Keywords CandidateSongs

By exploiting split model design we can leverage:

Score andrank allcandidates

TOP-K QUERIESQuery predicate to pre-filter candidate set

All Songs Playlist Keywords CandidateSongs

By exploiting split model design we can leverage:

Score andrank allcandidates

A. Shrivastava, P. Li. “Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS).” NIPS’14 Best Paper

TOP-K QUERIESQuery predicate to pre-filter candidate set

All Songs Playlist Keywords CandidateSongs

By exploiting split model design we can leverage:

Score andrank allcandidates

A. Shrivastava, P. Li. “Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS).” NIPS’14 Best Paper

Y. Low and A. X. Zheng. “Fast Top-K Similarity Queries Via Matrix Compression.” CIKM 2012

SYSTEM ARCHITECTURE

frontend.js

uuid: 01-10

uuid: 11-20

uuid: 20-30

uuid: 4

SYSTEM ARCHITECTURE

frontend.js

Returns score

uuid: 01-10

uuid: 11-20

uuid: 20-30

uuid: 4

Mesos Mesos

HDFS, S3, … Tachyon

Hadoop Yarn

Spark Straming Shark

SQL

Graph X ML

library

BlinkDB MLbase

Spark

VeloxTraining Management + Serving

ModelManager

PredictionService

MODEL MANAGER

PERSONALIZED MODELING

PERSONALIZED MODELING

PERSONALIZED MODELING

A Separate Model for Each User?

PERSONALIZED MODELING

Computationally Inefficient many complex models

A Separate Model for Each User?

PERSONALIZED MODELING

Statistically Inefficient not enough data per user

Computationally Inefficient many complex models

A Separate Model for Each User?

Input(Song) Rating

Input(Song) Rating

Input(Song) Rating

Split

Rating

Split

Input(Song)

PERSONALIZED SPLIT MODEL

Input(Song)

PERSONALIZED SPLIT MODEL

Input(Song)

Shared Basis Feature Model

PERSONALIZED SPLIT MODEL

Input(Song)

Shared Basis Feature ModelBig Data

PERSONALIZED SPLIT MODEL

Input(Song)

Shared Basis Feature ModelBig Data

Changes Slowly

PERSONALIZED SPLIT MODEL

Input(Song)

Shared Basis Feature ModelBig Data

Changes SlowlyTrain in Batch!

PERSONALIZED SPLIT MODEL

Input(Song)

Shared Basis Feature ModelBig Data

Changes SlowlyTrain in Batch!

PersonalizedUser Model

PERSONALIZED SPLIT MODEL

Input(Song)

Shared Basis Feature ModelBig Data

Changes SlowlyTrain in Batch!

Small Data

PersonalizedUser Model

PERSONALIZED SPLIT MODEL

Input(Song)

Shared Basis Feature ModelBig Data

Changes SlowlyTrain in Batch!

Small DataChanges Quickly

PersonalizedUser Model

PERSONALIZED SPLIT MODEL

Input(Song)

Shared Basis Feature ModelBig Data

Changes SlowlyTrain in Batch!

Small DataChanges Quickly

Train Online!

PersonalizedUser Model

Input(Song)

PersonalizedUser Model

Shared Basis Feature Model

PERSONALIZED SPLIT MODEL

Input(Song)

PersonalizedUser Model

Input(Song)

Shared Basis Feature Model

PERSONALIZED SPLIT MODEL

Input(Song)

PersonalizedUser Model

Input(Song)

Shared Basis Feature Model

PERSONALIZED SPLIT MODEL

Input(Song)

PersonalizedUser Model

Meow

Input(Song)

Shared Basis Feature Model

PERSONALIZED SPLIT MODEL

PersonalizedUser Model

Meow

Input(Song)Input

(Song)

Shared Basis Feature Model

PERSONALIZED SPLIT MODEL

PersonalizedUser Model

Meow

Terrible

Input(Song)Input

(Song)

Shared Basis Feature Model

PERSONALIZED SPLIT MODEL

MATHEMATICAL FORMULATION

Input(Song)

MATHEMATICAL FORMULATION

Input(Song)

x

Shared BasisFeature Models

Changes slowly

MATHEMATICAL FORMULATION

Input(Song)

x

Shared BasisFeature Models

Changes slowly

MATHEMATICAL FORMULATION

Input(Song)

f(x; ✓)x

Shared BasisFeature Models

PersonalizedUser Model

Changes slowly

Highly dynamic

MATHEMATICAL FORMULATION

Input(Song)

f(x; ✓)x

Shared BasisFeature Models

PersonalizedUser Model

Changes slowly

Highly dynamic

MATHEMATICAL FORMULATION

Input(Song)

f(x; ✓) ·wu

x

Shared BasisFeature Models

PersonalizedUser Model

Changes slowly

Highly dynamic

= Rating

MATHEMATICAL FORMULATION

Input(Song)

f(x; ✓) ·wu

x

Meow

FEEDBACK API

POST  /velox/catify/observe?userid=22&song=27&score=3.7

Simple direct value feedback:

FEEDBACK API

POST  /velox/catify/observe?userid=22&song=27&score=3.7

Simple direct value feedback:

Continuously update user models in Velox

Online Learning

FEEDBACK API

POST  /velox/catify/observe?userid=22&song=27&score=3.7

Simple direct value feedback:

Continuously update user models in Velox

Online Learning Offline LearningLogged to Tachyon for

feature learning in Spark

FEEDBACK API

POST  /velox/catify/observe?userid=22&song=27&score=3.7

Simple direct value feedback:

Continuously update user models in Velox

Online Learning Offline LearningLogged to Tachyon for

feature learning in Spark

EvaluationContinuously assessmodel performance

SYSTEM ARCHITECTURE

frontend.js

uuid: 01-10

uuid: 11-20

uuid: 20-30

Feedback via REST

Model updated

in realtime

uuid: 4

ONLINE LEARNING

velox.jar

user model

def  observe(u:  UUID,  x:  Context,  y:  Score)

ONLINE LEARNING

velox.jar

user model

def  observe(u:  UUID,  x:  Context,  y:  Score)

Update user model with new

training data Write

ONLINE LEARNING

velox.jar

user model

def  observe(u:  UUID,  x:  Context,  y:  Score)

Stochastic gradient descent

Update user model with new

training data Write

ONLINE LEARNING

velox.jar

user model

def  observe(u:  UUID,  x:  Context,  y:  Score)

Stochastic gradient descent

Incremental linear algebra

Update user model with new

training data Write

SYSTEM ARCHITECTURE

master

workerworker

worker

Batch train RPC

frontend.js

uuid: 01-10

uuid: 11-20

uuid: 20-30

OFFLINE OR NEARLINE LEARNING

def  retrain(trainingData:  RDD)

Spark BasedTraining Algs.

wu · f(x; ✓)

Automated retraining policies

Efficient batch training using Spark

Incremental learning using Spark Streaming

Data Model

Data Model

Sample Bias: model affects the training data.

ALWAYS SERVE THE BEST SONG?

Songs

PredictedRating

ALWAYS SERVE THE BEST SONG?

Songs

PredictedRating

VELOX SOLUTION

PredictedRating

Songs

With prob. 1- ϵ serve the best predicted song

VELOX SOLUTION

PredictedRating

Songs

With prob. 1- ϵ serve the best predicted song

PredictedRating

Songs

With prob. 1- ϵ serve the best predicted songWith prob. ϵ pick a random song

VELOX SOLUTION

PredictedRating

Songs

With prob. 1- ϵ serve the best predicted songWith prob. ϵ pick a random song

VELOX SOLUTION

PredictedRating

Songs

With prob. 1- ϵ serve the best predicted songWith prob. ϵ pick a random song

Epsilon Greedy

VELOX SOLUTION

PredictedRating

Songs

With prob. 1- ϵ serve the best predicted songWith prob. ϵ pick a random song

Epsilon Greedy

Active Learning Opportunity to explore new systems for

this emerging analytics workload

VELOX SOLUTION

BEYOND RECOMMENDER SYSTEMS

1. Spam and anomaly detection

BEYOND RECOMMENDER SYSTEMS

1. Spam and anomaly detection

2. Device/location specific modeling

BEYOND RECOMMENDER SYSTEMS

1. Spam and anomaly detection

2. Device/location specific modeling

3. YOUR machine learning application

BEYOND RECOMMENDER SYSTEMS

Spark Streaming Spark

SQL

Graph X ML

library

BlinkDB MLbase VeloxTraining Management + Serving

Spark

HDFS, S3, … Tachyon

ModelManager

PredictionService

THE MISSING PIECE IN BDAS

Mesos

SUMMARY

Today: model training and serving relies on ad-hoc, manual processes spread across multiple systems

SUMMARY

Today: model training and serving relies on ad-hoc, manual processes spread across multiple systems

The Velox system automatically maintains multiple models while providing low latency, fresh, and personalized predictions

SUMMARY

Today: model training and serving relies on ad-hoc, manual processes spread across multiple systems

The Velox system automatically maintains multiple models while providing low latency, fresh, and personalized predictions

Velox will be open-source: coming soon to BDAS

SUMMARY

Today: model training and serving relies on ad-hoc, manual processes spread across multiple systems

The Velox system automatically maintains multiple models while providing low latency, fresh, and personalized predictions

Velox will be open-source: coming soon to BDAShttps://amplab.cs.berkeley.edu/projects/velox/

SUMMARY

Today: model training and serving relies on ad-hoc, manual processes spread across multiple systems

The Velox system automatically maintains multiple models while providing low latency, fresh, and personalized predictions

Velox will be open-source: coming soon to BDAShttps://amplab.cs.berkeley.edu/projects/velox/crankshaw@cs.berkeley.edu

SUMMARY

QUESTIONS?

top related