usf seminar series: apache spark, machine learning, recommendations feb 05 2016
Post on 15-Apr-2017
1.200 Views
Preview:
TRANSCRIPT
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc
Spark and Recommendations
Spark, Streaming, Machine Learning, Graph Processing, Approximations, Probabilistic Data Structures, NLP
USF Seminar Series Thanks, USF!!
Feb 5th, 2016
Chris Fregly Principal Data Solutions Engineer
We’re Hiring! (Only Nice People) advancedspark.com!
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Who Am I?
2
Streaming Data Engineer Netflix OSS Committer
Data Solutions Engineer
Apache Contributor
Principal Data Solutions Engineer IBM Technology Center
Meetup Organizer Advanced Apache Meetup
Book Author Advanced .
Due 2016
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Advanced Apache Spark Meetup http://advancedspark.com Meetup Metrics Top 5 Most-active Spark Meetup! 2400+ Members in just 6 mos!! 2500+ Docker image downloads
Meetup Mission Deep-dive into Spark and related open source projects Surface key patterns and idioms Focus on distributed systems, scale, and performance
3
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc
Live, Interactive Demo!! Audience Participation Required
(cell phone or laptop)
4
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
demo.advancedspark.com End User ->
ElasticSearch ->
Spark ML ->
Data Scientist -> 5
<- Kafka <- Spark Streaming <- Cassandra, Redis <- Zeppelin, iPython
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Presentation Outline Scaling with Parallelism and Composability
Similarity and Recommendations
When to Approximate
Common Algorithms and Data Structures Common Libraries and Tools
6
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Scaling with Parallelism
7
Peter O(log n)
O(log n)
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Scaling with Composability
Max (a max b max c max d) == (a max b) max (c max d)
Set Union (a U b U c U d) == (a U b) U (c U d)
Addition (a + b + c + d) == (a + b) + (c + d)
Multiply (a * b * c * d) == (a * b) * (c * d)
Division??
8
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
What about Division? Division (a / b / c / d) != (a / b) / (c / d) (3 / 4 / 7 / 8) != (3 / 4) / (7 / 8) (((3 / 4) / 7) / 8) != ((3 * 8) / (4 * 7)) 0.134 != 0.857
9
What were the Egyptians thinking?! Not Composable
“Divide like an Egyptian”
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
What about Average?
Overall AVG ( [3, 1] ((3 + 5) + (5 + 7)) 20 [5, 1] == ----------------------- == --- == 5 [5, 1] ((1 + 2) + 1) 4 [7, 1]
) 10
value
count
Pairwise AVG (3 + 5) (5 + 7) 8 12 20 ------- + ------- == --- + --- == --- == 10 != 5 2 2 2 2 2
Divide, Add, Divide? Not Composable
Single Divide at the End? Doesn’t need to be Composable!
AVG (3, 5, 5, 7) == 5
Add, Add, Add? Composable!
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Presentation Outline Scaling with Parallelism and Composability
Similarity and Recommendations
When to Approximate
Common Algorithms and Data Structures Common Libraries and Tools
11
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc
Similarity
12
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Euclidean Similarity Exists in Euclidean, flat space Based on Euclidean distance Linear measure Bias towards magnitude
13
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Cosine Similarity Angular measure Adjusts for Euclidean magnitude bias
14
Normalizes to unit vectors
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Jaccard Similarity Set similarity measurement Set intersection / set union -> Based on Jaccard distance Bias towards popularity
15
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Log Likelihood Similarity Adjusts for popularity bias Netflix “Shawshank” problem
16
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Word Similarity Edit Distance Calculate char differences between words Deletes, transposes, replaces, inserts
17
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Document Similarity TD/IDF Term Freq / Inverse Document Freq Used by most search engines
Word2Vec Words embedded in vector space nearby similars
18
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Similarity Pathway ie. Closest recommendations between 2 people
19
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Calculating Similarity Exact Brute-Force “All-pairs similarity” aka “Pair-wise similarity”, “Similarity join” Cartesian O(n^2) shuffle and comparison
Approximate Sampling Bucketing (aka “Partitioning”, “Clustering”) Remove data with low probability of similarity
Reduce shuffle and comparisons 20
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Bonus: Document Summary Text Rank aka “Sentence Rank” TF/IDF + Similarity Graph + PageRank
Intuition Surface summary sentences (abstract) Most similar to all others (TF/IDF + Similarity Graph) Most influential sentences (PageRank)
21
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Similarity Graph Vertex is movie, tag, actor, plot summary, etc. Edges are relationships and weights
22
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Topic-Sensitive PageRank Graph diffusion algorithm Pre-process graph, add vector of probabilities to each vertex Probability of ending up at this vertex from every other
vertex
23
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc
Recommendations
24
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Basic Terminology User: User seeking recommendations Item: Item being recommended Explicit User Feedback: like or rating Implicit User Feedback: search, click, hover, view, scroll Instances: Rows of user feedback/input data Overfitting: Training a model too closely to the training data & hyperparameters Hold Out Split: Holding out some of the instances to avoid overfitting Features: Columns of instance rows (of feedback/input data) Cold Start Problem: Not enough data to personalize (new) Hyperparameter: Model-specific config knobs for tuning (tree depth, iterations) Model Evaluation: Compare predictions to actual values of hold out split Feature Engineering: Modify, reduce, combine features
25
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Feature Engineering Dimension Reduction Reduce number of features (aka “feature space”)
Principle Component Analysis (PCA) Find principle features that describe the data in terms of variance Peel the dimensional layers back until you describe the data
Example: One-Hot Encoding Convert categorical feature values to 0’s, 1’s Remove any hint of a relationship between the categories Bears -> 1 Bears -> [1,0,0] 49’ers -> 2 --> 49’ers -> [0,1,0] Steelers-> 3 Steelers-> [0,0,1]
26
1 binary column per category
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Features Binary Features: True or False
Numeric Discrete Features: Integers
Numeric Features: Real values
Ordinal Features: Maintains order (S -> M -> L -> XL -> XXL)
Temporal Features: Time-based (Time of Day, Binge Watching)
Categorical Features: Finite, unique set of categories (NFL teams)
27
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc
Non-Personalized Recommendations
28
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Cold Start Problem “Cold Start” problem New user, don’t know their pref, must show them something!
Movies with highest-rated actors Top K Aggregations
Most desirable singles PageRank of like activity
Facebook social graph Recommend friend activity
29
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc
Personalized Recommendations
30
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Clustering (aka. Nearest Neighbors) User-to-User Clustering Similar movies watched or rated Similar wiewing pattern (ie. binge or casual)
Item-to-Item Clustering Similar tags/genres on movies Similar textual description (TF/IDF, Word2Vec, NLP, Image)
31 http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html!My OKCupid Profile! My Hinge Profile!
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
User-to-Item Collaborative Filtering Matrix Factorization ① Factor the large matrix (left) into 2 smaller matrices (right) ② Fill in the missing values with in the large matrix
32
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Item-to-Item Collaborative Filtering Made famous by Amazon Paper ~2003 Problem As # of users grew, Matrix Factorization couldn’t scale
Solution Offline/Batch Generate itemId -> List[customerId] vectors
Online/Real-time For each item in cart, recommend similar items from vector space
33
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Presentation Outline Scaling with Parallelism and Composability
Similarity and Recommendations
When to Approximate
Common Algorithms and Data Structures Common Libraries and Tools
34
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
When to Approximate? Memory or time constrained queries Relative vs. exact counts are OK (# errors between then and now)
Using machine learning or graph algos Inherently probabilistic and approximate Finding topics in documents (LDA) Finding similar pairs of users, items, words at scale (LSH) Finding top influencers (PageRank)
Streaming aggregations (distinct count or top k) Inherently sloppy means of collecting (at least once delivery)
35
Approximate as much as you can get away with! Ask for forgiveness later !!
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
When NOT to Approximate? If you’ve ever heard the term…
“Sarbanes-Oxley”
…in-that-order, at the office, after 2002.
36
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Presentation Outline Scaling with Parallelism and Composability
Similarity and Recommendations
When to Approximate
Common Algorithms and Data Structures Common Libraries and Tools
37
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc
A Few Good Algorithms
38
You can’t handle the approximate!
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Common to These Algos & Data Structs Low, fixed size in memory Known error bounds Store large amount of data Less memory than Java/Scala collections Tunable tradeoff between size and error Rely on multiple hash functions or operations Size of hash range defines error
39
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc
Bloom Filter Set.contains(key): Boolean
“Hash Multiple Times and Flip the Bits Wherever You Land”
40
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Bloom Filter Approximate set membership for key False positive: expect contains(), actual !contains() True negative: expect !contains(), actual !contains()
Elements only added, never removed
41
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Bloom Filter in Action
42
set(key) contains(key): Boolean
Images by @avibryant
TRUE -> maybe contains FALSE -> definitely does not contain.
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc
CountMin Sketch Frequency Count and TopK
“Hash Multiple Times and Add 1 Wherever You Land”
43
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
CountMin Sketch (CMS) Approximate frequency count and TopK for key ie. “Heavy Hitters” on Twitter
44
Johnny Hallyday Martin Odersky Donald Trump
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
CountMin Sketch In Action
45
Images derived from @avibryant
Find minimum of all rows
… …
Can overestimate, but never underestimate
Multiple hash functions (1 hash function per row)
Binary hash output (1 element per column)
x 2 occurrences of “Top Gun” for slightly additional complexity
Top Gun
Top Gun
Top Gun (x 2)
A FewGood Men
Taps
Top Gun (x 2)
add(Top Gun, 2)
getCount(Top Gun): Long
Use Case: TopK movies using total views
add(A Few Good Men, 1)
add(Taps, 1)
A FewGood Men
Taps
…
…
Overlap Top Gun
Overlap A Few Good Men
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc
HyperLogLog Count Distinct
“Hash Multiple Times and Uniformly Distribute Where You Land”
46
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
HyperLogLog (HLL) Approximate count distinct Slight twist Special hash function creates uniform distribution
Error estimate 14 bits for size of range m = 2^14 = 16,384 slots error = 1.04/(sqrt(16,384)) = .81%
47
Not many of these
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
HyperLogLog In Action Use Case: Distinct number of views per movie
48
0 32 Top Gun: Hour 2 user2001
user 4009
user 3002
user 7002
user 1005
user 6001
User 8001
User 8002
user 1001
user 2009
user 3005
user 3003
Top Gun: Hour 1 user 3001
user 7009
0 16
Uniform Distribution: Estimate distinct # of users by inspecting just the beginning
Uniform Distribution: Estimate distinct # of users
by inspecting just the beginning
Composable: Hour 1 + 2 (lose a bit of precision)
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc
Locality Sensitive Hashing Set Similarity
“Pre-process Items into Buckets, Compare Within Buckets”
49
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Locality Sensitive Hashing (LSH) Approximate set similarity Hash designed to cluster similar items Avoids cartesian all-pairs comparison Pre-process m rows into b buckets b << m
Hash items multiple times Similar items hash to overlapping buckets Compare just contents of buckets Much smaller cartesian … and parallel !!
50
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc
DIMSUM Set Similarity
“Pre-process Items into Buckets, Compare Within Buckets”
51
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
DIMSUM “Dimension Independent Matrix Square Using MR” Remove vectors with low probability of similarity RowMatrix.columnSimiliarites(threshold)
Twitter DIMSUM Case Study 40% efficiency gain over bruce-force cosine sim
52
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Presentation Outline Scaling with Parallelism and Composability
Similarity and Recommendations
When to Approximate
Common Algorithms and Data Structures Common Libraries and Tools
53
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Common Tools to Approximate
Twitter Algebird
Redis
Apache Spark
54
Composable Library
Distributed Cache
Big Data Processing
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Twitter Algebird Rooted in Algebraic Fundamentals! Parallel Associative Composable Examples Min, Max, Avg BloomFilter (Set.contains(key)) HyperLogLog (Count Distinct) CountMin Sketch (TopK Count)
55
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Redis Implementation of HyperLogLog (Count Distinct) 12KB per item count 2^64 max # of items 0.81% error (Tunable) Add user views for given movie
PFADD TopGun_HLL user1001 user2009 user3005 PFADD TopGun_HLL user3003 user1001
Get distinct count (cardinality) of set
PFCOUNT TopGun_HLL Returns: 4 (distinct users viewed this movie)
56
ignore duplicates
Tunable
Union 2 HyperLogLog Data Structures PFMERGE TopGun_HLL Taps_HLL
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Spark Approximations Spark Core
RDD.count*Approx() Spark SQL
PartialResult HyperLogLogPlus approxCountDistinct(column)
Spark ML Stratified sampling PairRDD.sampleByKey(fractions: Double[ ]) DIMSUM sampling Probabilistic sampling reduces amount of comparison shuffle RowMatrix.columnSimilarities(threshold)
Spark Streaming A/B testing StreamingTest.setTestMethod(“welch”).registerStream(dstream)
57
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc
Demos!
58
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc
Counting Exact Count vs. Approx HyperLogLog, CountMin Sketch
59
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
HashSet vs. HyperLogLog
60
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
HashSet vs. CountMin Sketch
61
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc
Set Similarity Exact Jaccard Similarity vs. Approx Locality Sensitive Hashing
62
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Brute Force Cartesian All Pair Similarity
63
90 mins!
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
All Pairs & Locality Sensitive Hashing
64
<< 90 mins!
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc
Many More Demos Available! http://advancedspark.com
Download Docker or Clone Github
65
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc
Bonus: Netflix Recommendations From Offline DVD Ratings to Real-time Trending Now
66
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
$1 Million Netflix Prize (2006-2009) Goal Improve movie predictions by 10% (RMSE)
Dataset (userId, movieId, rating, timestamp) Test data withheld to calculate RMSE upon submission
Winning algorithm 10.06% improvement (RMSE) Ensemble of 500+ ML Combined using GBDT’s Computationally impractical
67
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Secret to the Winning Algorithms Adjust for the following… Human bias “Alice effect”: Alice tends to rate lower than average user “Inception effect”: Inception is rated higher than average “Alice-Inception effect”: Combo of Alice and Inception Time-based bias Number of days since a user’s first rating Number of days since a movie’s first rating Number of people who have rated a movie A movie’s overall mean rating
68
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Current Netflix Recommendations
69
Throw away loffline-generated user factors (U)
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Netflix Common ML Algorithms Logistic Regression Linear Regression Gradient Boosted Decision Trees Random Forest Matrix Factorization SVD Restricted Boltzmann Machines Deep Neural Nets Markov Models LDA Clustering … 70
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Bonus: Netflix Search No results? No problem… Show similar results! Used as implicit feedback for future decision making
71
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Netflix and Data Netflix has a lot of data about a lot of users and a lot of movies. Netflix can use this data to buy new movies. Netflix is global. Netflix can use this data to choose original programming. Netflix knows that a lot of people like Politics and Kevin Spacey.
72
The UK doesn’t have any White Castles. So they renamed my favourite movie, “Harold and Kumar Get the Munchies”
(This broke all of my unit tests.)
My favorite movie, “Harold and Kumar Go to White Castle”
Summary: Buy NFLX Stock!
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Thank You!! Chris Fregly @cfregly IBM Spark Tech Center http://spark.tc San Francisco, California, USA
http://advancedspark.com Sign up for the Meetup and Book Contribute to Github Repo Run all Demos using Docker
Find me: LinkedIn, Twitter, Github, Email, Fax 73
Image derived from http://www.duchess-france.org/
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
advancedspark.com @cfregly
top related