REAL-TIME RECOMMENDATIONS FOR RETAIL: ARCHITECTURE, ALGORITHMS, AND DESIGN
Juliet Hougland and Jonathan Natkins
Who Are We?
Jonathan NatkinsField Engineer at WibiDataBefore that, Cloudera Software EngineerBefore that, Vertica Software/Field Engineer
Juliet HouglandData Scientist, previously at WibiDataMS in Applied MathBA in Math-Physics
Recommendations in Retail
Personalized versus Non-Personalized
Recommendations in Retail
Personalized versus Non-Personalized
Recommendations in Retail
Personalized versus Non-Personalized
Recommender ContextsTaste History
Based on everything you know about a userInterests over months/years
Current TasteBased on a user’s immediate historyInterests over minutes/hours
EphemeralExtreme version of current tasteFor example, location
Demographic*Similar to taste history, but less subjectiveGeographic region, age bracket, etc.
Why Does Real-Time Matter?
Relevancy
I am a Special Snowflake
Natty
Requirements for a Real-Time System
General System RequirementsHandle millions of customers/usersSupport collection and storage of complex data
Static and event-series
Real-Time System RequirementsQuickly retrieve subsets of data for a single userAggregate/derive new, first-class data per user
What is Kiji?
The Kiji project is a modular, open-source framework for building real-time applications that collect, store, and analyze entity-centric data
kiji.orggithub.com/kijiproject
What is Kiji?
The Kiji project is a modular, open-source framework for building real-time applications that collect, store, and analyze entity-centric data
kiji.orggithub.com/kijiproject
Three Challenges
Developing models for use in real-timeScoring models in real-timeDeploying models into a production environment
How Can We Make Real-Time Models?
Population interests change slowly
Individual interests change quickly
How Can We Make Real-Time Models?
Population interests change slowly
Individual interests change quickly
Models don’t need to be retrained
frequently
How Can We Make Real-Time Models?
Population interests change slowly
Individual interests change quickly
Models don’t need to be retrained
frequently
Application of a model should be fast
A Common Workflow
Train a model over the entire datasetSave fitted model parameters to a file or another tableAccess the model parameters when generating new recommendations based on new data
This is EXPENSIVE
Developing Models
KijiExpressScala interface for interacting with Kiji dataUses Scalding for designing complex dataflows
Model LifecycleAllows analysts and data scientists to break apart a model into phases
Scoring Models in Real-Time
Batch isn’t real-time
Scoring Models in Real-Time
Batch isn’t real-time
Number ofUsers
Number of Interactions
Scoring Models in Real-Time
Batch isn’t real-time
Number ofUsers
Number of Interactions
A few users withmany interactions
Scoring Models in Real-Time
Batch isn’t real-time
Number ofUsers
Number of Interactions
A few users withmany interactions
A lot of users withfew interactions
Fresheners Compute Lazily
Client
KijiScoring Server HBase
Read a column
Get from HBase
Fresheners Compute Lazily
Client
KijiScoring Server HBase
Read a column
Get from HBase
Freshness Policy
Fresheners Compute Lazily
Client
KijiScoring Server HBase
Read a column
Get from HBase
Freshness PolicyYes, return to client
Fresheners Compute Lazily
NO
Client
KijiScoring Server HBase
Read a column
Get from HBase
Freshness Policy
Scorer
Fresheners Compute Lazily
Client
KijiScoring Server HBase
Read a column
Get from HBase
Freshness Policy
ScorerYes, return to client
Write back for next time
Kiji Application Stack
Deployment Challenges
Kiji Model Repository
Link between application and modelsStores Freshener metadata
FreshnessPolicy, Scorer, attached columnLocation of trained model
Stores Scorer codeCode repository makes model scoring code available to the application from a central location
New models can be deployed to the Model Repository and made immediately available to the application
Kiji Model Repository
Retail Recommendation
Types of Recommenders
RecommendationAlgorithms
CollaborativeFilteringMethods
ContentBased
Methods
MemoryBased
ModelBased
Content-Based Recommenders
Orange-Nosed
Lab Assistant
Meeps a lot
Build models around entities using features that we think reflect inherent characteristics
Content-Based Recommenders
safer
faster knife
Pandora: Content-Based
Expertly-CharacterizedMusic
Collaborative Filtering
Represent users-itemaffinities as a sparsematrix
Beaker
BananaSlicer
PineappleSlicerUsers ≈ Rows
Items ≈ Columns
Aspirational Ratings
I put in my queue… I actually watch
Collaborative Filtering
Represent users-itemaffinities as a sparsematrix
Beaker
BananaSlicer
PineappleSlicerUsers ≈ Rows
Items ≈ Columns
Simple aggregate predictors
Collaborative Filtering: How It WorksSimilar Users Similar Products
Similar Entities
What do we mean by similar?Jaccard Index: a measure of set similarityCosine Similarity: the angle between two vectorsPearson Correlation: statistical measure, similar to cosine
Naively, we could compare every entity to each other
…But that would not scale will with increasing numbers of entities
Building the Similarity Matrix
Collaborative Filtering: Is This Useful?
Problem: Too much data!Tracking user preferences and all their events generates huge amounts of data
Problem: Too little data!Dimensions of user-space and item-space are usually very largeMore variables makes it more difficult to generate user preferences
Problem: Cold startIf you don’t know anything about a user, what should you recommend?
Problem: More ratings means slower computationsIdentifying neighborhoods of entities is expensive
Collaborative Filtering: Why Is It Useful?
Because it worksContent-agnostic
All that matters is co-occurrence of events
Amazon: Item-Item Collaborative Filtering
Used for personalized recommendationsFill screen real estate with related itemsProduces specific, but non-creepy recommendations
Linden, G.; Smith, B.; York, J., "Amazon.com recommendations: item-to-item collaborative filtering," Internet Computing, IEEE , vol.7, no.1, pp.76,80, Jan/Feb 2003
>
Item-Item Collaborative Filtering
Beaker buys a banana slicerThen:
Generate list of candidate items to predict ratings forPredict ratings for candidate itemsSelect Top-N items
Accessing External Data
KeyValueStore API enables external data access when applying a modelExternal data might be…
Trained model parametersHierarchical/Taxonomic dataGeo-lookup
Store external data flexiblyText files, sequence files, Kiji tables, etc.Data access is decoupled from use during execution
If the data doesn’t fit in memory, put it in a table
How Much Less Work Can We Do?
We can choose a predictor that allows us to truncate a sum
There are two ways terms in the sum of our predictor can be small
No ratingSmall similarity
How Much Less Work Can We Do?
We can choose a predictor that allows us to truncate a sum
There are two ways terms in the sum of our predictor can be small
No ratingSmall similarity
How Much Less Work Can We Do?
We can choose a predictor that allows us to truncate a sum
There are two ways terms in the sum of our predictor can be small
No ratingSmall similarity
Ignore unrated items
How Much Less Work Can We Do?
We can choose a predictor that allows us to truncate a sum
There are two ways terms in the sum of our predictor can be small
No ratingSmall similarity
Ignore dissimilar items
How Much Less Work Can We Do?
If we only present a few recommendations, we don’t need to predict ratings for all itemsChoose your candidate set to estimate ratings wisely or infer from nearest neighbors
Organizing Data in Item-Item CF
Accessing Data During Freshening
Want to Know More?
The Kiji Projectkiji.orggithub.com/kijiproject
Questions about this presentation?Twitter: @JulietHougland or @nattyiceEmail: [email protected]