distributed time travel for feature generation ... - db tsai · distributed time travel for feature...
TRANSCRIPT
![Page 1: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and](https://reader033.vdocuments.us/reader033/viewer/2022042307/5ed488617b07fd02f0350a31/html5/thumbnails/1.jpg)
Distributed Time Travel for Feature Generation
Prasanna Padmanabhan DB Tsai
Mohammad H. Taghavi
![Page 2: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and](https://reader033.vdocuments.us/reader033/viewer/2022042307/5ed488617b07fd02f0350a31/html5/thumbnails/2.jpg)
Turn on Netflix, and the absolute best content for you would automatically
start playing
![Page 3: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and](https://reader033.vdocuments.us/reader033/viewer/2022042307/5ed488617b07fd02f0350a31/html5/thumbnails/3.jpg)
Ranking
Everything is a RecommendationR
ows
Over 80% of what members watch comes from our recommendations
Recommendations are driven by Machine Learning Algorithms
![Page 4: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and](https://reader033.vdocuments.us/reader033/viewer/2022042307/5ed488617b07fd02f0350a31/html5/thumbnails/4.jpg)
Data Driven
• Try an idea offline using historical data to see if it would have made better recommendations
• If it did, deploy a live A/B test to see if it performs well in Production
![Page 5: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and](https://reader033.vdocuments.us/reader033/viewer/2022042307/5ed488617b07fd02f0350a31/html5/thumbnails/5.jpg)
Why build a Time Machine?
![Page 6: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and](https://reader033.vdocuments.us/reader033/viewer/2022042307/5ed488617b07fd02f0350a31/html5/thumbnails/6.jpg)
Quickly try ideas on historical data and transition to online A/B test
![Page 7: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and](https://reader033.vdocuments.us/reader033/viewer/2022042307/5ed488617b07fd02f0350a31/html5/thumbnails/7.jpg)
The Past•Generate features based on event data logged in Hive
– Need to reimplement features for online A/B test– Data discrepancies between offline and online sources
• Log features online where the model will be used– Need to deploy each idea into production
• Feature generation calls online services and filters data past a certain time
– Works only when a service records a log of historical events– Additional load on online services
![Page 8: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and](https://reader033.vdocuments.us/reader033/viewer/2022042307/5ed488617b07fd02f0350a31/html5/thumbnails/8.jpg)
DeLorean image by JMortonPhoto.com & OtoGodfrey.com
![Page 9: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and](https://reader033.vdocuments.us/reader033/viewer/2022042307/5ed488617b07fd02f0350a31/html5/thumbnails/9.jpg)
Time Travel using Snapshots
• Snapshot online services and use the snapshot data offline to generate features
• Share facts and features between experiments without calling live systems
![Page 10: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and](https://reader033.vdocuments.us/reader033/viewer/2022042307/5ed488617b07fd02f0350a31/html5/thumbnails/10.jpg)
How to build a Time Machine
![Page 11: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and](https://reader033.vdocuments.us/reader033/viewer/2022042307/5ed488617b07fd02f0350a31/html5/thumbnails/11.jpg)
Context Selection
Data Snapshots
APIs for Time Travel
![Page 12: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and](https://reader033.vdocuments.us/reader033/viewer/2022042307/5ed488617b07fd02f0350a31/html5/thumbnails/12.jpg)
Context Selection
Context Selection
Runs once a day
Hive
S3
Context SetStratified
SamplingContexts tagged with meta data
![Page 13: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and](https://reader033.vdocuments.us/reader033/viewer/2022042307/5ed488617b07fd02f0350a31/html5/thumbnails/13.jpg)
Data SnapshotsS3
Context Set
Data Snapshots Runs once
a day
S3
Snapshot
Prana (Netflix
Libraries)
Viewing History Service
MyList Service
Ratings Service
Snapshot data for each Context
Thrift
Parquet
![Page 14: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and](https://reader033.vdocuments.us/reader033/viewer/2022042307/5ed488617b07fd02f0350a31/html5/thumbnails/14.jpg)
APIs for Time Travel
![Page 15: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and](https://reader033.vdocuments.us/reader033/viewer/2022042307/5ed488617b07fd02f0350a31/html5/thumbnails/15.jpg)
Data Architecture
S3
Snapshot
S3
Context Set
Runs once a day
Prana (Netflix
Libraries)
Viewing History Service
MyList Service
RatingsService
Context Selection
Runs once a day
HiveStratified Sampling
Contexts tagged with meta data
Thrift
Context Selection
Data Snapshots
Batch APIs
RDD of Snapshot Objects
Data Snapshots
Batch APIs
![Page 16: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and](https://reader033.vdocuments.us/reader033/viewer/2022042307/5ed488617b07fd02f0350a31/html5/thumbnails/16.jpg)
Generating Features via Time Travel
![Page 17: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and](https://reader033.vdocuments.us/reader033/viewer/2022042307/5ed488617b07fd02f0350a31/html5/thumbnails/17.jpg)
Great Scott!
• DeLorean: A time-traveling vehicle
– uses data snapshots to travel in time
– scales with Apache Spark
– prototypes new ideas with Zeppelin
– requires minimal code changes from
experimentation to A/B test to production
https://en.wikipedia.org/wiki/Emmett_Brown
There’s the DeLorean!
![Page 18: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and](https://reader033.vdocuments.us/reader033/viewer/2022042307/5ed488617b07fd02f0350a31/html5/thumbnails/18.jpg)
Running Time Travel Experiment
Select the destination time
Bring it up to 88 miles per hour!
![Page 19: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and](https://reader033.vdocuments.us/reader033/viewer/2022042307/5ed488617b07fd02f0350a31/html5/thumbnails/19.jpg)
Running Time Travel Experiment
Design Experiment
Collect Label Dataset
DeLorean: Offline Feature Generation
Distributed Model Training
Parallel training of individual
models using different
executors
Compute Validation Metrics
Model Testing
Choose best model
Design a New Experiment to Test Out Different Ideas
GoodMetrics
Offline Experiment
OnlineSystem
Online AB Testing
Bad Metrics
Selected Contexts
![Page 20: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and](https://reader033.vdocuments.us/reader033/viewer/2022042307/5ed488617b07fd02f0350a31/html5/thumbnails/20.jpg)
DeLorean Input Data
• Contexts: The setting for evaluating a set of items (e.g. tuples of member profiles, country, time, device, etc.)
• Items: The elements to be trained on, scored, and/or ranked (e.g. videos, rows, search entities).
• Labels: For supervised learning, this will be the label (target) for each item.
![Page 21: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and](https://reader033.vdocuments.us/reader033/viewer/2022042307/5ed488617b07fd02f0350a31/html5/thumbnails/21.jpg)
Feature Encoders
• Compute features for each item in a given context
• Each type of raw data element has its own data key
• Data map is a map from data keys to data objects in a
given context
• Data map is consumed by feature encoder to compute
features
![Page 22: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and](https://reader033.vdocuments.us/reader033/viewer/2022042307/5ed488617b07fd02f0350a31/html5/thumbnails/22.jpg)
Two type of Data Elements• Context-dependent data elements
– Viewing History
– Mylist
– ...
• Context-independent data elements– Video Metadata
– Genre Metadata
– ...
![Page 23: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and](https://reader033.vdocuments.us/reader033/viewer/2022042307/5ed488617b07fd02f0350a31/html5/thumbnails/23.jpg)
Video Country of Origin Matching Fraction
Context-Items
Context: s
Items:
Context: s
Items:
Context Dependent
Data ElementViewing History
Context: s
Items:
Context: s
Items:
Context: s Items:
= 0.5
= 0.5
= 0.5
Context IndependentData Element
Video Metadata
Context: s Items:
= 1.0
= 0.0
= 1.0
Features
![Page 24: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and](https://reader033.vdocuments.us/reader033/viewer/2022042307/5ed488617b07fd02f0350a31/html5/thumbnails/24.jpg)
Feature GenerationS3
Snapshot
Model Training
Label Features
Feature EncodersLabel Data Feature EncodersData Elements
Feature Model(JSON) Feature EncodersFeature EncodersFeature Encoders
RequiredFeature Keys
Data
Data Map
Features
Data in POJOs
Data Keys
Data Keys
![Page 25: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and](https://reader033.vdocuments.us/reader033/viewer/2022042307/5ed488617b07fd02f0350a31/html5/thumbnails/25.jpg)
Features
•Represented in Spark’s DataFrames
•In nested structure to avoid data shuffling in ranking process
•Stored with Parquet format in S3
![Page 26: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and](https://reader033.vdocuments.us/reader033/viewer/2022042307/5ed488617b07fd02f0350a31/html5/thumbnails/26.jpg)
FeaturesContext
Item, label, and features
![Page 27: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and](https://reader033.vdocuments.us/reader033/viewer/2022042307/5ed488617b07fd02f0350a31/html5/thumbnails/27.jpg)
Going OnlineS3
Snapshot
DeLorean: Offline Feature Generation
Online Ranking / Scoring Service
Model Training / Validation / Testing
Offline Experiment
Online SystemViewing History Service
MyList Service
RatingsService
Online Feature Generation
Deploymodels
Shared Feature Encoders
![Page 28: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and](https://reader033.vdocuments.us/reader033/viewer/2022042307/5ed488617b07fd02f0350a31/html5/thumbnails/28.jpg)
Conclusion
Spark helped us significantly reduce the time from an idea to an AB Test
![Page 29: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and](https://reader033.vdocuments.us/reader033/viewer/2022042307/5ed488617b07fd02f0350a31/html5/thumbnails/29.jpg)
Future work
Event Driven Data Snapshots
Time Travel to the Future!!
![Page 30: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and](https://reader033.vdocuments.us/reader033/viewer/2022042307/5ed488617b07fd02f0350a31/html5/thumbnails/30.jpg)
We’re hiring!(come talk to us)
https://jobs.netflix.com/
Tech Blog: http://bit.ly/sparktimetravel