real time, streaming advanced analytics, approximations, and recommendations using apache spark...

118
Flux Capacitor AI Bringing AI Back to the Future! Bringing AI Back to the Future! Flux Capacitor AI advancedspark.com

Upload: hadoop-summit

Post on 07-Jan-2017

397 views

Category:

Technology


1 download

TRANSCRIPT

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

advancedspark.com

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Who Am I?

2

Streaming Data EngineerNetflix OSS Committer

Data Solutions EngineerApache Contributor

Principal Data Solutions EngineerIBM Technology Center

Meetup OrganizerAdvanced Apache Meetup

Book AuthorAdvanced .

Due 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Advanced Apache Spark Meetuphttp://advancedspark.com

Meetup MetricsTop 10 Most-active Spark Meetup!3200+ Members in just 9 mos!!3700+ Docker downloads (demos)

Meetup MissionCode deep-dive into Spark and related open source projectsSurface key patterns and idiomsFocus on distributed systems, scale, and performance

3

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Live, Interactive Demo!Audience Participation Required!!Cell Phone Compatible!!!

demo.advancedspark.com4

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

http://demo.advancedspark.com

End User ->

ElasticSearch ->

Spark ML ->

Data Scientist ->

5

<- Kafka

<- SparkStreaming

<- Cassandra,Redis

<- Zeppelin, iPython

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Presentation Outline① Scaling

② Similarities

③ Recommendations

④ Approximations

⑤ Netflix Recommendations6

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Scaling with Parallelism

7

Peter

O(log n)O(log n)

WorkerNodes

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Parallelism with ComposabilityWorker 1 Worker 2

Max (a max b max c max d) == (a max b) max (c max d)

Set Union (a U b U c U d) == (a U b) U (c U d)

Addition (a + b + c + d) == (a + b) + (c + d)

Multiply (a * b * c * d) == (a * b) * (c * d)

8

What about Division and Average?Collect at Driver

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

What about Division?Division (a / b / c / d) != (a / b) / (c / d)

(3 / 4 / 7 / 8) != (3 / 4) / (7 / 8) (((3 / 4) / 7) / 8) != ((3 * 8) / (4 * 7))

0.134 != 0.857

9

What were the Egyptians thinking?!Not Composable

“Divide like an Egyptian”

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

What about Average?

Overall AVG(3, 1) (3 + 5 + 5 + 7) 20

+ (5, 1) == -------------------- == --- == 5+ (5, 1) (1 + 1 + 1 + 1) 4+ (7, 1)

10

values

counts

Pairwise AVG(3 + 5) (5 + 7) 8 12 20------- + ------- == --- + --- == --- == 10 != 5

2 2 2 2 2

Divide, Add, Divide?Not Composable

Single-Node Divide at the End?Doesn’t need to be Composable!

AVG (3, 5, 5, 7) == 5

Add, Add, Add?Composable!

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Presentation Outline① Scaling

② Similarities

③ Recommendations

④ Approximations

⑤ Netflix Recommendations11

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Similarities

12

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Euclidean SimilarityExists in Euclidean, flat spaceBased on Euclidean distance Linear measureBias towards magnitude

13

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Cosine SimilarityAngular measureAdjusts for Euclidean magnitude biasNormalize to unit vectors in all dimensionsUsed with real-valued vectors (versus binary)

14

org.jblas.DoubleMatrix

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Jaccard SimilaritySet similarity measurementSet intersection / set union Bias towards popularityWorks with binary vectors

15

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Log Likelihood SimilarityAdjusts for popularity biasNetflix “Shawshank” problem

16

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Word SimilarityEdit Distance

Misspellings and autocorrect

Word2VecSimilar words are defined by similar contexts in vector space

17

English Spanish

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Demo!Find Synonyms with Word2Vec

18

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Find Synonyms using Word2Vec

19

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Document SimilarityTF/IDF

Term Freq / Inverse Document FreqUsed by most search engines

Doc2VecSimilar documents are determined by similar contexts

20

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Bonus! Text Rank Document SummaryText Rank (aka Sentence Rank)

Surface summary sentences TF/IDF + Similarity Graph + PageRank

Most similar sentence to all other sentencesTF/IDF + Similarity Graph

Most influential sentencesPageRank

21

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Similarity Pathways (Recommendations)Best recommendations for 2 (or more) people

“You like Max Max. I like Message in a Bottle.We might like a movie similar to both.”

Item-to-Item Similarity Graph + Dijkstra Heaviest Path

22

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Demo!Similarity Pathway for Movie Recommendations

23

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Load Movies with Tags into DataFrame

24

My Choice

TheirChoice

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Item-to-Item Tag Jaccard SimilarityBased on Tags

25

Calculate Jaccard Similarity(Tag Set Similarity)

Must be Above the Given Jaccard Similarity Threshold

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Item-to-Item Tag Similarity Graph

26

Edge Weights ==

Jaccard Similarity(Based on Tag Sets)

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Use Dijkstra to Find Heaviest Pathway

27

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Calculating Exact SimilarityBrute-Force Similarity

Cartesian ProductO(n^2) shuffle and computeaka. All-pairs, Pair-wise,

Similarity Join

28

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Calculating Approximate SimilarityGoal: Reduce Shuffle

Approximate SimilaritySamplingBucketing or ClusteringIgnore low-similarity probability

Locality Sensitive Hashing Twitter Algebird MinHash

29

BucketBy Genre

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Presentation Outline① Scaling

② Similarities

③ Recommendations

④ Approximations

① Netflix Recommendations30

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Recommendations

31

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Basic TerminologyUser: User seeking recommendationsItem: Item being recommendedExplicit User Feedback: user knows they are rating or liking, can choose to dislikeImplicit User Feedback: user not explicitly aware, cannot dislike (click, hover, etc)Instances: Rows of user feedback/input dataOverfitting: Training a model too closely to the training data & hyperparametersHold Out Split: Holding out some of the instances to avoid overfittingFeatures: Columns of instance rows (of feedback/input data)Cold Start Problem: Not enough data to personalize (new)Hyperparameter: Model-specific config knobs for tuning (tree depth, iterations)Model Evaluation: Compare predictions to actual values of hold out splitFeature Engineering: Modify, reduce, combine featuresLoss Function: Function we’re trying to minimize such as least-squared error for Linear RegressionCross Entropy: Loss function used for classification algorithms such as Logistic RegressionOptimizer: Technique to optimize loss function such as Stochastic Gradient Descent (SGD)

32

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Stochastic Gradient Descent (SGD)Optimizes Loss Function

Least Squared Error b/w predicted and actual valueCross Entropy Log Likelihood b/w predicted and actual probability

33

2-Dimensional 3-Dimensional

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

FeaturesBinary: True or FalseNumeric Discrete: Integers

Numeric: Real Values

Binning: Convert Continuous into Discrete (Time of Day->Morning, Afternoon)

Categorical Ordinal: Size (Small->Medium->Large), Ratings (1->5)

Categorical Nominal: Independent, Favorite Sports Teams, Dating SpotsTemporal: Time-based, Time of Day, Binge Viewing

Text: Movie Titles, Genres, Tags, Reviews (Tokenize, Stop Words, Stemming)

Media: Images, Audio, Video

Geographic: (Longitude, Latitude), Geohash

Latent: Hidden Features within Data (Collaborative Filtering)Derived: Age of Movie, Duration of User Subscription

34

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Feature EngineeringDimension Reduction

Reduce number of features in feature spacePrinciple Component Analysis (PCA)

Find principle features that best describe data variancePeel dimensional layers back

One-Hot EncodingConvert nominal categorical feature values into 0’s and 1’sRemove any numerical relationship between categories

Bears -> 1 Bears -> [1.0, 0.0, 0.0]49’ers -> 2 --> 49’ers -> [0.0, 1.0, 0.0]Steelers-> 3 Steelers-> [0.0, 0.0, 1.0]

35

Convert Each Item to Binary Vector

with Single 1.0 Column

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Feature Normalization & StandardizationGoal

Scale features to standard sizePrevent boundless featuresHelps avoid overfittingRequired by many ML algos

Normalize FeaturesCalculate L1 (or L2, etc) norm, then divide into each element

Standardize FeaturesApply standard normal transformation (mean->0, stddev->1)

org.apache.spark.ml.feature.[Normalizer, StandardScaler]36

http://www.mathsisfun.com/data/standard-normal-distribution.html

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Non-Personalized Recommendations

37

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Cold Start Problem“Cold Start” problem

New user, don’t know their preferences, must show something!

Movies with highest-rated actorsTop K aggregations

Facebook social graphFriend-based recommendations

Most desirable singlesPageRank of likes and dislikes

38

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Demo!GraphFrame PageRank

39

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Example: Dating Site “Like” Graph

40

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

PageRank of Top Influencers

41

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Personalized Recommendations

42

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Demo!Personalized PageRank

43

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Personalized PageRank: Outbound Links

44

0.15 = (1 - 0.85 “Damping Factor”)85% Probability: Choose Among Outbound Network

15% Probability: Choose Self or Random

85% AmongOutboundNetwork

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Personalized PageRank: No Outbound

45

0.15 = (1 - 0.85 “Damping Factor”)85% Probability: Choose Among Outbound Network

15% Probability: Choose Self or Random

85% Among No

OutboundNetwork!!

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

User-to-User ClusteringUser Similarity

Time-basedPattern of viewing (binge or casual)Time of viewing (am or pm)

Ratings-basedContent ratings or number of viewsAverage rating relative to others (critical or lenient)

Search-basedSearch terms

46

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Item-to-Item ClusteringItem Similarity

Profile text (TF/IDF, Word2Vec, NLP)Categories, tags, interests (Jaccard Similarity, LSH)Images, facial structures (Neural Nets, Eigenfaces)

Dating Site Example…

47Cluster Similar Eigen-facesCluster Similar Profiles Cluster Similar Categories

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Bonus: NLP Conversation Starter Bot

48

“If your responses to my generic opening lines are positive, I may read your profile.”

Spark ML, Stanford CoreNLP,TF/IDF, DecisionTrees, Sentiment

http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Bonus: Demo!Spark + Stanford CoreNLP Sentiment Analysis

49

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Bonus: Top 100 Country Song Sentiment

50

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Bonus: Surprising Results…?!

51

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Item-to-Item Based RecommendationsBased on Metadata: Genre, Description, Cast, City

52

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Demo!Item-to-Item-based Recommendations

One-Hot Encoding + K-Means Clustering

53

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

One-Hot Encode Tag Feature Vectors

54

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Cluster Movie Tag Feature Vectors

55

HyperparameterTuning

(K Clusters?)

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Analyze Movie Tag Clusters

56

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

User-to-Item Collaborative FilteringMatrix Factorization① Factor the large matrix (left) into 2 smaller matrices (right)② Lower-rank matrices approximate original when multiplied③ Fill in the missing values of the large matrix④ Surface k (rank) latent features from user-item interactions

57

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Item-to-Item Collaborative FilteringFamous Amazon Paper circa 2003

ProblemAs users grew, user-to-item collaborative filtering didn’t scale

SolutionItem-to-item similarity, nearest neighbors Offline (Batch)

Generate itemId->List[userId] vectorsOnline (Real-time)

From cart, recommend nearest-neighbors in vector space58

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Demo!Collaborative Filtering-based Recommendations

59

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Fitting the Matrix Factorization Model

60

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Show ItemFactors Matrix from ALS

61

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Show UserFactors Matrix from ALS

62

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Generating Individual Recommendations

63

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Generating Batch Recommendations

64

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Clustering + Collaborative Filtering RecsCluster matrix output from Matrix FactorizationLatent features derived from user-item interaction

Item-to-Item SimilarityCluster item-factor matrix->

User-to-User Similarity<-Cluster user-factor matrix

65

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Demo!Clustering + Collaborative Filtering-based Recommendations

66

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Show ItemFactors Matrix from ALS

67

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Convert to Item Factors -> mllib.VectorRequired by K-Means Clustering Algorithm

68

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Fit and Evaluate K-Means Cluster Model

69

Measures ClosenessOf Points Within Clusters

K = 5 Clusters

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Netflix Genres and ClustersTypical Genres

Documentary, Romance, Comedy, Horror, Action, Adventure

Latent (Hidden) ClustersEmotionally-Independent Dramas for Hopeless RomanticsWitty Dysfunctional-Family TV Animated ComediesRomantic Crime Movies based on Classic LiteratureLatin American Forbidden-Love MoviesCritically-acclaimed Emotional Drug MovieCerebral Military Movie based on Real LifeSentimental Movies about Horses for Ages 11-12Gory Canadian Revenge MoviesRaunchy Mad Scientist Comedy

70

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Presentation Outline① Scaling

② Similarities

③ Recommendations

④ Approximations

⑤ Netflix Recommendations71

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

When to Approximate?Memory or time constrained queries

Relative vs. exact counts are OK (approx # errors after a release)

Using machine learning or graph algosInherently probabilistic and approximate

Streaming aggregationsInherently sloppy collection (exactly once?)

72

Approximate as much as you can get away with!Ask for forgiveness later !!

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

When NOT to Approximate?If you’ve ever heard the term…

“Sarbanes-Oxley”

…at the office.

73

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

A Few Good Algorithms

74

You can’t handle the approximate!

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Common to These Algos & Data StructsLow, fixed size in memoryStore large amount of dataKnown error boundsTunable tradeoff between size and errorLess memory than Java/Scala collectionsRely on multiple hash functions or operationsSize of hash range defines error

75

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Bloom FilterSet.contains(key): Boolean

“Hash Multiple Times and Flip the Bits Wherever You Land”

76

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Bloom FilterApproximate Set.contains(key)

No means No, Yes means Maybe

Elements can only be addedNever updated or removed

77

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Bloom Filter in Action

78

set(key) contains(key): Boolean

Images by @avibryant

Set.contains(key): TRUE -> maybe contains (other key hashes may overlap)Set.contains(key): FALSE -> definitely does not contain (no key flipped all bits)

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

CountMin SketchFrequency Count and TopK

“Hash Multiple Times and Add 1 Wherever You Land”

79

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

CountMin Sketch (CMS)Approximate frequency count and TopK for keyie. “Heavy Hitters” on Twitter

80

Matei Zaharia Martin Odersky Donald Trump

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

CountMin Sketch In Action (TopK Count)

81

Images derived from @avibryant

Find minimum of all rows

……

Can overestimate, but never underestimate

Multiple hash functions(1 hash function per row)

Binary hash output(1 element per column)

x 2 occurrences of “Top Gun” for slightly additional complexity

Top GunTop Gun

Top Gun(x 2)

A FewGood Men

Taps

Top Gun(x 2)

add(Top Gun, 2)

getCount(Top Gun): Long

Use Case: TopK movies using total views

add(A Few Good Men, 1)

add(Taps, 1)

A FewGood Men

Taps

Overlap Top Gun

Overlap A Few Good Men

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

HyperLogLogCount Distinct

“Hash Multiple Times and Uniformly Distribute Where You Land”

82

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

HyperLogLog (HLL)Approximate count distinctSlight twist

Special hash function creates uniform distributionHash subsets of data with single, special hash func

Error estimate14 bits for size of rangem = 2^14 = 16,384 hash slotserror = 1.04/(sqrt(16,384)) = .81%

83

Not many of these

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

HyperLogLog In Action (Count Distinct)Use Case: Number of distinct users who view a movie

84

0 32

Top Gun: Hour 2user2001

user4009

user3002

user7002

user1005

user6001

User8001

User8002

user1001

user2009

user3005

user3003

Top Gun: Hour 1user3001

user7009

0 16

Uniform Distribution:Estimate distinct # of users by inspecting just the beginning

0 32

Top Gun: Hour 1 + 2user2001

user4009

user3002

user7002

user1005

user6001

User8001

User8002

Combine across different scales

user7009

user1001

user2009

user3005

user3003

user3001

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Locality Sensitive HashingSet Similarity

“Pre-process Items into Buckets, Compare Within Buckets”

85

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Locality Sensitive Hashing (LSH)Approximate set similarityPre-process m rows into b buckets

b << m; b = buckets, m = rowsHash items multiple times

** Similar items hash to overlapping buckets** Hash designed to cluster similar items

Compare just contents of bucketsMuch smaller cartesian compare ** Compare in parallel !!

Avoids huge cartesian all-pairs compare86

Chapter 3: LSH

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

DIMSUMSet Similarity

“Pre-process and ignore data that is unlikely to be similar.”

87

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

DIMSUM“Dimension Independent Matrix Square Using MR”Remove vectors with low probability of similarity

RowMatrix.columnSimiliarites(threshold)Twitter DIMSUM Case Study

40% efficiency gain over bruce-force Cosine Sim

88

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Common Tools to Approximate

Twitter Algebird

Redis

Apache Spark

89

Composable Library

Distributed Cache

Big Data Processing

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Twitter AlgebirdAlgebraic Fundamentals

Parallel

Associative

ComposableExamples

Min, Max, AvgBloomFilter (Set.contains(key))HyperLogLog (Count Distinct)CountMin Sketch (TopK Count)

90

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

RedisImplementation of HyperLogLog (Count Distinct)

12KB per item count2^64 max # of items0.81% error

Add user views for given moviePFADD TopGun_Hour1_HLL user1001 user2009 user3005PFADD TopGun_Hour1_HLL user3003 user1001

Get distinct count (cardinality) of setPFCOUNT TopGun_Hour1_HLLReturns: 4 (distinct users viewed this movie)

Union 2 HyperLogLog Data StructuresPFMERGE TopGun_Hour1_HLL TopGun_Hour2_HLL

91

ignore duplicates

Tunable

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Approximations in Spark LibrariesSpark Core

countByKeyApprox(timeout: Long, confidence: Double)PartialResult

Spark SQLapproxCountDistinct(column: Column, targetResidual: Float)approxQuantile(column: Column, quantiles: Seq[Float], targetResidual: Float)

Spark MLStratified sampling

sampleByKey(fractions: Map[K, Double])DIMSUM sampling

Probabilistic sampling reduces amount of shuffleRowMatrix.columnSimilarities(threshold: Double)

92

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Demo!Exact Count vs. Approximate HLL and CMS Count

93

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

HashSet vs. HyperLogLog (Memory)

94

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

HashSet vs. CountMin Sketch (Memory)

95

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Demo!Exact Similarity vs. Approximate LSH Similarity

96

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Brute Force Cartesian All Pair Similarity

97

47 seconds

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Locality Sensitive Hash All Pair Similarity

98

6 seconds

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Many More Demos!

or

Download Docker Clone on Github

99

http://advancedspark.com

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Presentation Outline① Scaling

② Similarities

③ Recommendations

④ Approximations

⑤ Netflix Recommendations100

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Netflix RecommendationsFrom Ratings to Real-time

101

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Netflix Has a Lot of DataNetflix has a lot of data about a lot of users and a lot of movies.

Netflix can use this data to buy new movies.

Netflix is global.

Netflix can use this data to choose original programming.

Netflix knows that a lot of people like politics and Kevin Spacey.

102

The UK doesn’t have White Castle.Renamed my favourite movie to:

“Harold and Kumar Get the Munchies”

My favorite movie:“Harold and Kumar Go to White Castle”

Summary: Buy NFLX Stock!

This broke my unit tests!

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Netflix Data Pipeline - Then

103

v1.0

v2.0

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Netflix Data Pipeline – Now (Keystone)

104

v3.0

9 million events per second22 GB per second!!

EC2 D2XLDisk: 6 TB, 475 MB/sRAM: 30 GNetwork: 700 Mbps

Auto-scaling,Fault tolerance

A/B Tests,Trending Now

SAMZA

Splits high andnormal priority

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Netflix Recommendation Data Pipeline

105

Throw away batch user factors (U)

Keep batch video factors (V)

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Netflix Trending Now (Time-based Recs)Uses Spark StreamingPersonalized to user (viewing history, past ratings)Learns and adapts to events (Valentine’s Day)

106

“VHS”

Number of Plays

Number of Impressions

CalculateTake Rate

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Bonus: Pandora Time-based RecsWork Days

Play familiar musicUser is less likely accept new music

Evenings and WeekendsPlay new musicMore like to accept new music

107

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

$1 Million Netflix Prize (2006-2009)Goal

Improve movie predictions by 10% (Root Mean Sq Error)Test data withheld to calculate RMSE upon submission

5-star Ratings Dataset(userId, movieId, rating, timestamp)

Winning algorithm(s)10.06% improvement (RMSE)Ensemble of 500+ ML combined with GBDT’sComputationally impractical

108

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Secrets to the Winning AlgorithmsAdjust for the following human bias…① Alice effect: user rates lower than avg② Inception effect: movie rated higher than avg③ Overall mean rating of a movie④ Number of people who have rated a movie⑤ Number of days since user’s first rating⑥ Number of days since movie’s first rating⑦ Mood, time of day, day of week, season, weather

109

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Netflix Common ML AlgorithmsLogistic RegressionLinear RegressionGradient Boosted Decision TreesRandom ForestMatrix FactorizationSVDRestricted Boltzmann MachinesDeep Neural NetsMarkov ModelsLDAClustering

110

Ensembles!

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Netflix Genres and ClustersTypical Genres

Documentaries, Romance Comedies, Horror, Action, Adventure

Latent (Hidden) ClustersEmotionally-Independent Dramas for Hopeless RomanticsWitty Dysfunctional-Family TV Animated ComediesRomantic Crime Movies based on Classic LiteratureLatin American Forbidden-Love MoviesCritically-acclaimed Emotional Drug MovieCerebral Military Movie based on Real LifeSentimental Movies about Horses for Ages 11-12Gory Canadian Revenge MoviesRaunchy Mad Scientist Comedy

111

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Netflix Social IntegrationPost to Facebook after movie start (5 mins)Recommend to new users based on friendsHelps with Cold Start problem

112

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Netflix SearchNo results? No problem… Show similar results!

Utilize extensive DVD CatalogMetadata search (ElasticSearch)Named entity recognition (NLP)

Empty searches are opportunity!Explicit feedback for future recommendationsContent to buy and produce!

113

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Netflix A/B TestsUsers tend to click on images featuring…

Faces with strong emotional expressionsVillains over heroesSmall number of cast members

114

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Netflix Recommendation Serving LayerUse Case: Recommendation service depends on EVCacheProblem: EVCache cluster goes down or becomes latent!?Answer: github.com/Netflix/Hystrix Circuit Breaker!

Circuit StatesClosed: Service OK

Open: Service DOWNFallback to Static

115

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Why Higher Average Ratings 2004+?2004, Netflix noticed higher ratings on averageSome possible reasons why…

116

① Significant UI improvements deployed② New recommendation engine deployed③

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Thank You, Everyone!!Chris Fregly @cfreglyResearch Scientist @ Flux Capacitor AISan Francisco, California, USA

http://fluxcapacitor.comSign up for the Meetup and BookContribute to Github RepoRun all Demos using Docker

Find me LinkedIn, Twitter, Github, Email, Fax117

Image derived from http://www.duchess-france.org/

Flux Capacitor AI Bringing AI Back to the Future!

Bringing AI Back to the Future!