imcsummit 2015 - day 1 developer track - building fast, scalable machine learning pipelines

55
Building Fast, Scalable Machine Learning Pipelines Vlad Giverts Sr Director of Software Engineering, Workday

Upload: 2015-in-memory-computing-summit

Post on 15-Aug-2015

298 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

Building Fast, Scalable Machine Learning Pipelines

Vlad GivertsSr Director of Software Engineering, Workday

Page 2: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines
Page 3: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines
Page 4: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines
Page 5: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines
Page 6: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines
Page 7: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines
Page 8: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines
Page 9: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

Identified Recruit

Page 10: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

Identified Recruit

HDFS

Page 11: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

Identified Recruit

Web Crawlers

HDFS

Page 12: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

Identified Recruit

Web Crawlers

HDFSMR1

Hadoop

Page 13: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

Identified Recruit

Web Crawlers

HDFSMR1

DataPipeline

Hadoop

Page 14: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

Identified Recruit

Web Crawlers

HDFSMR1

DataPipeline

Hadoop

Solr

Page 15: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

Identified Recruit

Web Crawlers

HDFSMR1

DataPipeline

Hadoop

Solr

Page 16: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

16

Facebook Data

Identified Data Pipeline 1.0

Page 17: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

17

ParseFacebook Data

Identified Data Pipeline 1.0

Page 18: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

18

Parse NormalizeFacebook Data

Identified Data Pipeline 1.0

Page 19: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

19

Parse NormalizeFacebook Data

Identified Data Pipeline 1.0

Index

Page 20: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

20

Parse NormalizeFacebook Data

Identified Data Pipeline 1.0

Index Publish

Page 21: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

Facebook Data

Identified Data Pipeline 2.0

Twitter Data

DoximityData

Page 22: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

ParseFacebook Data

Identified Data Pipeline 2.0

ParseTwitter Data

ParseDoximityData

Page 23: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

Parse NormalizeFacebook Data

Identified Data Pipeline 2.0

Parse NormalizeTwitter Data

Parse NormalizeDoximityData

Page 24: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

Parse NormalizeFacebook Data

Identified Data Pipeline 2.0

Parse NormalizeTwitter Data

Parse NormalizeDoximityData

Merge (ML)

Page 25: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

Parse NormalizeFacebook Data

Identified Data Pipeline 2.0

IndexParse NormalizeTwitter

Data

Parse NormalizeDoximityData

Merge (ML)

Page 26: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

Parse NormalizeFacebook Data

Identified Data Pipeline 2.0

Index

Publish

Parse NormalizeTwitter Data

Parse NormalizeDoximityData

Merge (ML)

Page 27: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines
Page 28: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines
Page 29: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines
Page 30: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines
Page 31: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

Retention Risk

Page 32: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

Retention Risk

ElasticSearch

Page 33: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

Retention Risk

HDFS

ElasticSearch

Page 34: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

Retention Risk

HDFS

ElasticSearch

Kafka

Page 35: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

Retention Risk

Spark

HDFS

ElasticSearch

YARN

Kafka

Indexing

Page 36: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

Retention Risk

Spark

HDFS

ElasticSearch

YARN

MLPipeline

Kafka

Indexing

Page 37: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

Retention Risk

Spark

HDFS

ElasticSearch

YARN

MLPipeline

Kafka

Indexing

Page 38: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

ML Pipeline

Snapshot Data

Page 39: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

ML Pipeline

Snapshot Data

Feature Extraction

Page 40: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

Data and “Features”

Tenure

Time in Current Function

Pay Range Penetration

Manager Attrition Rate

Num Promotions

Avg Time Between Promotions

Page 41: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

ML Pipeline

Snapshot Data

Feature Extraction

Page 42: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

ML Pipeline

Feature Extraction

Model Training

Snapshot Data

Page 43: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

ML Pipeline

Feature Extraction

Model Training

Model Validation

Snapshot Data

Page 44: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

Training and Validation

BarryRaise: $1,000

2014 2016

RaviLeft :(

JohnLeft :(

AlbertPromoted!

YuryHired

TejasChanged Teams

Page 45: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

Training and Validation

Q1 ‘14 Q2 ‘14 Q3 ‘14 Q4 ‘14 Q1 ‘15 Q2 ‘15

Page 46: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

Training and Validation

Q1 ‘14 Q2 ‘14 Q3 ‘14 Q4 ‘14 Q1 ‘15 Q2 ‘15

TRAINING VALIDATION

Page 47: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

ML Pipeline

Feature Extraction

Model Training

Model Validation

Snapshot Data

Page 48: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

ML Pipeline

Feature Extraction

Model Training

Model Validation

Snapshot Data

Evaluation

Page 49: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

Evaluation

Q1 ‘14 Q2 ‘14 Q3 ‘14 Q4 ‘14 Q1 ‘15 Q2 ‘15

Page 50: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

Evaluation

Q1 ‘14 Q2 ‘14 Q3 ‘14 Q4 ‘14 Q1 ‘15 Q2 ‘15 Q3 ‘15 Q4 ‘15

Page 51: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

Evaluation

Q1 ‘14 Q2 ‘14 Q3 ‘14 Q4 ‘14 Q1 ‘15 Q2 ‘15 Q3 ‘15 Q4 ‘15

PREDICTION

Page 52: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

ML Pipeline

Feature Extraction

Model Training

Model Validation

Snapshot Data

Evaluation

Page 53: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines

ML Pipeline

Feature Extraction

Model Training

Model Validation

Snapshot Data

Evaluation Publish Results

Page 54: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines
Page 55: IMCSummit 2015 - Day 1 Developer Track - Building Fast, Scalable Machine Learning Pipelines