building a scalable recommendation engine with apache spark
TRANSCRIPT
![Page 1: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/1.jpg)
Nov / 14 / 16
Building a ScalableRecommender System with Apache Spark,Apache Kafkaand Elasticsearch
Nick Pentreath
![Page 2: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/2.jpg)
§ @MLnick§ Principal Engineer, IBM§ Apache Spark PMC§ Focused on machine learning§ Author of Machine Learning with Spark
About
![Page 3: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/3.jpg)
§ Recommender systems & the machine learning workflow
§ Data modelling for recommender systems
§ Why Spark, Kafka & Elasticsearch?§ Kafka & Spark Streaming§ Spark ML for collaborative filtering§ Deploying & scoring recommender
models with Elasticsearch§ Monitoring, feedback & re-training§ Scaling model serving§ Demo
Agenda
![Page 4: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/4.jpg)
Recommender Systems & the ML Workflow
![Page 5: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/5.jpg)
RecommenderSystems
Overview
![Page 6: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/6.jpg)
The Machine Learning Workflow
Perception
Data ??? MachineLearning ??? $$$
![Page 7: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/7.jpg)
The Machine Learning Workflow
Reality
Data
• Historical• Streaming
Ingest Data Processing
• Feature transformation & engineering
Model Training
• Model selection & evaluation
Deploy
• Pipelines, not just models
• Versioning
Live System
• Predict given new data
• Monitoring & live evaluation
Feedback Loop
Spark DataFrames
Spark ML
Various ???
Stream (Kafka)
Missing piece!
![Page 8: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/8.jpg)
The Machine Learning Workflow
Recommender Version
Data Ingest Data Processing
• Aggregation• Handle implicit
data
Model Training
• ALS• Ranking-style
evaluation
Deploy
• Model size & complexity
Live System
•User & item recommendations•Monitoring, filters
Feedback => another Event Type
Spark DataFrames
Spark ML
Elasticsearch
• User & Item Metadata
• Events
Elasticsearch
Stream (Kafka)
![Page 9: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/9.jpg)
Data Modeling for Recommender Systems
![Page 10: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/10.jpg)
Data modelUser and Item Metadata
! !
![Page 11: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/11.jpg)
System RequirementsUser and Item Metadata
! !
Filtering & Grouping
Business Rules
![Page 12: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/12.jpg)
User interactions
Implicit preference data
• Page view• eCommerce - cart, purchase• Media – preview, watch, listen
Intent data
• Search query
Anatomy of a User Event
Explicit preference data
• Rating• Review
Social network interactions
• Like• Share• Follow
User Interactions
!
!
!
!
!
!
!
!
![Page 13: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/13.jpg)
Data modelAnatomy of a User Event
!!
! !! !
!
![Page 14: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/14.jpg)
How to handle implicit feedback?Anatomy of a User Event
!!
! !! !
!!
![Page 15: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/15.jpg)
Why Kafka, Spark & Elasticsearch?
![Page 16: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/16.jpg)
Scalability§ De facto standard for a centralized
enterprise message / event queue
Integration§ Integrates with just about every storage
& processing system§ Good Spark Streaming integration – 1st
class citizen§ Including for Structured Streaming (but
still very new & rough!)
Why Kafka?
![Page 17: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/17.jpg)
DataFrames§ Events & metadata are “lightly
structured” data§ Suited to DataFrames§ Pluggable external data source support
Spark ML§ Spark ML pipelines – including scalable
ALS model for collaborative filtering§ Implicit feedback & NMF in ALS§ Cross-validation§ Custom transformers & algorithms
Why Spark?
![Page 18: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/18.jpg)
Storage§ Native JSON§ Scalable§ Good support for time-series / event data§ Kibana for data visualisation§ Integration with Spark DataFrames
Scoring§ Full-text search§ Filtering§ Aggregations (grouping)§ Search ~== recommendation (more
later)
Why Elasticsearch?
![Page 19: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/19.jpg)
Kafka for Recommender Systems
![Page 20: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/20.jpg)
Event Data Pipeline
Kafka SparkStreaming
!
!Item analytics & aggregation
User analytics & aggregation
!
Event store
!
Dashboards
![Page 21: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/21.jpg)
Write to Event Store
SparkStreaming
Event store
!
![Page 22: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/22.jpg)
KibanaDashboards
SparkStreaming
!Dashboards
![Page 23: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/23.jpg)
Item Metadata Analytics
SparkStreaming
!Item analytics & aggregation
Aggregated activity metrics
![Page 24: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/24.jpg)
User Metadata Analytics
SparkStreaming
!User analytics & aggregation
Aggregated activity metrics &
item exclusions
![Page 25: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/25.jpg)
Structured Streaming
Status
§ Still early days§ Initial Kafka support in Spark 2.0.2§ No ES support yet – not clear if it will be
a full-blown datasource or ForeachWriter
§ For now, you can create a custom ForeachWriter for your needs
![Page 26: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/26.jpg)
Spark ML for Collaborative Filtering
![Page 27: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/27.jpg)
Matrix FactorizationCollaborative Filtering
315 2
12 1
!
!
−1.1 3.2 4.30.2 1.4 3.12.5 0.3 2.34.3 −2.4 0.53.6 0.3 1.2
0.2 1.7 2.3 0.11.9 0.4 0.8 −0.31.5 −1.2 0.3 1.2
! !
![Page 28: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/28.jpg)
PredictionCollaborative Filtering
315 2
12 1
!
!
−1.1 3.2 4.30.2 1.4 3.12.5 0.3 2.34.3 −2.4 0.53.6 0.3 1.2
0.2 1.7 2.3 0.11.9 0.4 0.8 −0.31.5 −1.2 0.3 1.2
! !
![Page 29: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/29.jpg)
Loading Data in Spark MLCollaborative Filtering
![Page 30: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/30.jpg)
Implicit Preference DataAlternating Least Squares
![Page 31: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/31.jpg)
Deploying & Scoring Recommendation Models
![Page 32: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/32.jpg)
Full-text Search & SimilarityPrelude: Search
“cat videos”
!!
cat videos0 0 ⋯ 0 1 ⋯0 1 ⋯ 1 1 ⋯1 1 ⋯ 0 0 ⋯1 0 ⋯ 0 1 ⋯
Similarity
Sort results
0 1 ⋯ 1 0 ⋯
Scoring RankingAnalysis Term vectors
![Page 33: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/33.jpg)
Can we use the same machinery?Recommendation
!0 0 ⋯ 0 1 ⋯0 1 ⋯ 1 1 ⋯1 1 ⋯ 0 0 ⋯1 0 ⋯ 0 1 ⋯
Sort results
1.2 ⋯ −0.2 0.3
Dot product & cosine similarity… the same as we need for recommendations!
Scoring RankingAnalysis Term vectors!!!
SimilarityUser (or item) vector
?
!
![Page 34: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/34.jpg)
Delimited Payload FilterElasticsearchTerm Vectors
Raw vector
1.2 ⋯ −0.2 0.3
Term vector with payloads
0|1.2 ⋯ 3|-0.2 4|0.3
Custom analyzer
![Page 35: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/35.jpg)
Custom scoring function
• Native script (Java), compiled for speed• Scoring function computes dot product by:
§ For each document vector index (“term”), retrieve payload
§ score += payload * query(i)
• Normalizes with query vector norm and document vector norm for cosine similarity
ElasticsearchScoring
![Page 36: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/36.jpg)
Can we use the same machinery?Recommendation
User (or item) vector
! Sort results
1.2 ⋯ −0.2 0.3
Scoring RankingAnalysis Term vectors
!
!!
Custom scoring function
!!
Delimited payload filter
−1.1 1.3 ⋯ 0.41.2 −0.2 ⋯ 0.30.5 0.7 ⋯ −1.30.9 1.4 ⋯ −0.8
![Page 37: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/37.jpg)
We get search engine functionality for free!ElasticsearchScoring
![Page 38: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/38.jpg)
Deploying to ElasticsearchAlternating Least Squares
![Page 39: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/39.jpg)
Monitoring & Feedback
![Page 40: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/40.jpg)
Logging Recommendations ServedSystem Events
! !!!
!!
!
![Page 41: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/41.jpg)
Logging Recommendation ActionsSystem Events
!! !
!
![Page 42: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/42.jpg)
Tracking Performance
Kafka SparkStreaming
!Impression capping / fatigue
Performance monitoring & alerts
!
Event store
!
Dashboards
!! !
!
!! !!
!
![Page 43: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/43.jpg)
Scaling Model Scoring
![Page 44: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/44.jpg)
Scoring Performance
0
100
200
300
400
500
600
100,000 1,000,000
Tim
e (m
s)
Size of item set
Scoring time per query, by factor dimension & number of items
k=20 k=50 k=100
*3x nodes, 30x shards
![Page 45: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/45.jpg)
Scoring Performance
050
100150200250300350400450500
100,000 1,000,000
Tim
e (m
s)
Size of item set
Scoring time per query, by number of shards & number of items
10 shards 30 shards
60 shards 90 shards
*3x nodes, k=50
Increasing number of shards
![Page 46: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/46.jpg)
Scoring Performance
Locality Sensitive Hashing
• LSH hashes each input vector into L “hash tables”. Each table contains a “hash signature” created by applying k hash functions.
• Standard for cosine similarity is Sign Random Projections
• At indexing time, create a “bucket” by combining hash table id and hash signature
• Store buckets as part of item model metadata• At scoring time, filter candidate set using term
filter on buckets of query item• Tune LSH parameters to trade off speed /
accuracy• LSH coming soon to Spark ML – SPARK-5992
![Page 47: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/47.jpg)
Scoring Performance
0
50
100
150
200
250
Brute force LSH
Tim
e (m
s)
Scoring time per query - brute force vs LSH
*3x nodes, 30x shards, k=50, 1,000,000 items
Locality Sensitive Hashing
![Page 48: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/48.jpg)
Scoring Performance
0
50
100
150
200
250
Brute force LSH Score-then-search
Tim
e (m
s)
Scoring time per query – LSH vs score-then-search
Score Sort Search
*3x nodes, 30x shards, k=50, 1,000,000 items
Comparison to “score then search”
![Page 49: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/49.jpg)
Demo
![Page 50: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/50.jpg)
Future Work
![Page 51: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/51.jpg)
Future Work • Apache Solr version of scoring plugin (any takers?)
• Investigate ways to improve Elasticsearchscoring performance§ Performance for LSH-filtered scoring should be better!§ Can we dig deep into ES scoring internals to combine
efficiency of matrix-vector math with ES search & filter capabilities?
• Investigate more complex models§ Factorization machines & other contextual recommender
models§ Scoring performance
• Spark Structured Streaming with Kafka, Elasticsearch & Kibana§ Continuous recommender application including data,
model training, analytics & monitoring
![Page 52: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/52.jpg)
References • Elasticsearch
• Elasticsearch Spark Integration
• Spark ML ALS for Collaborative Filtering
• Collaborative Filtering for Implicit Feedback Datasets
• Factorization Machines
• Elasticsearch Term Vectors & Payloads
• Delimited Payload Filter
• Vector Scoring Plugin
• Kafka & Spark Streaming
• Kibana
![Page 53: Building a Scalable Recommendation Engine with Apache Spark](https://reader031.vdocuments.us/reader031/viewer/2022013120/58a1a7731a28ab5c788bf1c2/html5/thumbnails/53.jpg)
Thanks!https://github.com/MLnick/elasticsearch-vector-scoring