paris data-geeks-2013-03-28

53
Practical Machine Learning with Mahout

Upload: ted-dunning

Post on 10-May-2015

5.541 views

Category:

Technology


1 download

DESCRIPTION

A quick BOF talk I gave in Paris at Devoxx

TRANSCRIPT

Page 1: Paris data-geeks-2013-03-28

Practical Machine Learning

with Mahout

Page 2: Paris data-geeks-2013-03-28

whoami – Ted Dunning

• Chief Application Architect, MapR Technologies• Committer, member, Apache Software Foundation

– particularly Mahout, Zookeeper and Drill

(we’re hiring)

• Contact me [email protected]@[email protected]@ted_dunning

Page 3: Paris data-geeks-2013-03-28

Agenda

• What works at scale• Recommendation• Unsupervised - Clustering

Page 4: Paris data-geeks-2013-03-28

What Works at Scale

• Logging• Counting• Session grouping

Page 5: Paris data-geeks-2013-03-28

What Works at Scale

• Logging• Counting• Session grouping

• Really. Don’t bet on anything much more complex than these

Page 6: Paris data-geeks-2013-03-28

What Works at Scale

• Logging• Counting• Session grouping

• Really. Don’t bet on anything much more complex than these

• These are harder than they look

Page 7: Paris data-geeks-2013-03-28

Recommendations

Page 8: Paris data-geeks-2013-03-28

Recommendations

• Special case of reflected intelligence• Traditionally “people who bought x also

bought y”

• But soooo much more is possible

Page 9: Paris data-geeks-2013-03-28

Examples

• Customers buying books (Linden et al)• Web visitors rating music (Shardanand and

Maes) or movies (Riedl, et al), (Netflix)• Internet radio listeners not skipping songs

(Musicmatch)• Internet video watchers watching >30 s

Page 10: Paris data-geeks-2013-03-28

Dyadic Structure

• Functional– Interaction: actor -> item*

• Relational– Interaction Actors x Items⊆

• Matrix– Rows indexed by actor, columns by item– Value is count of interactions

• Predict missing observations

Page 11: Paris data-geeks-2013-03-28

Recommendations Analysis

• R(x,y) = # people who bought x also bought y

select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id

) group by x, y

Page 12: Paris data-geeks-2013-03-28

Recommendations Analysis

• R(x,y) = People who bought x also bought y

select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id

) group by x, y

Page 13: Paris data-geeks-2013-03-28

Recommendations Analysis

• R(x,y) = People who bought x also bought y

select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id

) group by x, y

Page 14: Paris data-geeks-2013-03-28

Recommendations Analysis

• R(x,y) = People who bought x also bought y

select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id

) group by x, y

Page 15: Paris data-geeks-2013-03-28

Recommendations Analysis

• R(x,y) = People who bought x also bought y

select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id

) group by x, y

Page 16: Paris data-geeks-2013-03-28

Recommendations Analysis

• R(x,y) = People who bought x also bought y

select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id

) group by x, y

Page 17: Paris data-geeks-2013-03-28

Recommendations Analysis

Page 18: Paris data-geeks-2013-03-28

Fundamental Algorithmic Structure

• Cooccurrence

• Matrix approximation by factoring

• LLR

Page 19: Paris data-geeks-2013-03-28

But Wait!

• Cooccurrence

• Cross occurrence

Page 20: Paris data-geeks-2013-03-28

For example

• Users enter queries (A)– (actor = user, item=query)

• Users view videos (B)– (actor = user, item=video)

• A’A gives query recommendation– “did you mean to ask for”

• B’B gives video recommendation– “you might like these videos”

Page 21: Paris data-geeks-2013-03-28

The punch-line

• B’A recommends videos in response to a query– (isn’t that a search engine?)– (not quite, it doesn’t look at content or meta-data)

Page 22: Paris data-geeks-2013-03-28

Real-life example

• Query: “Paco de Lucia”• Conventional meta-data search results:– “hombres del paco” times 400– not much else

• Recommendation based search:– Flamenco guitar and dancers– Spanish and classical guitar– Van Halen doing a classical/flamenco riff

Page 23: Paris data-geeks-2013-03-28

Real-life example

Page 24: Paris data-geeks-2013-03-28

Hypothetical Example

• Want a navigational ontology?• Just put labels on a web page with traffic– This gives A = users x label clicks

• Remember viewing history– This gives B = users x items

• Cross recommend– B’A = label to item mapping

• After several users click, results are whatever users think they should be

Page 25: Paris data-geeks-2013-03-28

Super-fast k-means Clustering

Page 26: Paris data-geeks-2013-03-28

RATIONALE

Page 27: Paris data-geeks-2013-03-28

What is Quality?

• Robust clustering not a goal– we don’t care if the same clustering is replicated

• Generalization is critical• Agreement to “gold standard” is a non-issue

Page 28: Paris data-geeks-2013-03-28

An Example

Page 29: Paris data-geeks-2013-03-28

An Example

Page 30: Paris data-geeks-2013-03-28

Diagonalized Cluster Proximity

Page 31: Paris data-geeks-2013-03-28

Clusters as Distribution Surrogate

Page 32: Paris data-geeks-2013-03-28

Clusters as Distribution Surrogate

Page 33: Paris data-geeks-2013-03-28

THEORY

Page 34: Paris data-geeks-2013-03-28

For Example

Grouping these two clusters

seriously hurts squared distance

Page 35: Paris data-geeks-2013-03-28

ALGORITHMS

Page 36: Paris data-geeks-2013-03-28

Typical k-means Failure

Selecting two seeds here cannot be

fixed with Lloyds

Result is that these two clusters get glued

together

Page 37: Paris data-geeks-2013-03-28

Ball k-means

• Provably better for highly clusterable data• Tries to find initial centroids in each “core” of each real

clusters• Avoids outliers in centroid computation

initialize centroids randomly with distance maximizing tendencyfor each of a very few iterations:

for each data point:assign point to nearest cluster

recompute centroids using only points much closer than closest cluster

Page 38: Paris data-geeks-2013-03-28

Still Not a Win

• Ball k-means is nearly guaranteed with k = 2• Probability of successful seeding drops

exponentially with k• Alternative strategy has high probability of

success, but takes O(nkd + k3d) time

Page 39: Paris data-geeks-2013-03-28

Still Not a Win

• Ball k-means is nearly guaranteed with k = 2• Probability of successful seeding drops

exponentially with k• Alternative strategy has high probability of

success, but takes O( nkd + k3d ) time

• But for big data, k gets large

Page 40: Paris data-geeks-2013-03-28

Surrogate Method

• Start with sloppy clustering into lots of clustersκ = k log n clusters

• Use this sketch as a weighted surrogate for the data

• Results are provably good for highly clusterable data

Page 41: Paris data-geeks-2013-03-28

Algorithm Costs

• Surrogate methods– fast, sloppy single pass clustering with κ = k log n– fast sloppy search for nearest cluster,

O(d log κ) = O(d (log k + log log n)) per point– fast, in-memory, high-quality clustering of κ weighted

centroidsO(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high qualityO(κ d log k) or O(d log κ log k) for larger k, looser quality

– result is k high-quality centroids• Even the sloppy surrogate may suffice

Page 42: Paris data-geeks-2013-03-28

Algorithm Costs

• Surrogate methods– fast, sloppy single pass clustering with κ = k log n– fast sloppy search for nearest cluster,

O(d log κ) = O(d ( log k + log log n )) per point– fast, in-memory, high-quality clustering of κ weighted

centroidsO(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high qualityO(κ d log k) or O( d log k ( log k + log log n ) ) for larger k, looser quality

– result is k high-quality centroids• For many purposes, even the sloppy surrogate may suffice

Page 43: Paris data-geeks-2013-03-28

Algorithm Costs

• How much faster for the sketch phase?– take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000– d (log k + log log n) = 10(11 + 5) = 170– 3,000 times faster is a bona fide big deal

Page 44: Paris data-geeks-2013-03-28

Algorithm Costs

• How much faster for the sketch phase?– take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000– d (log k + log log n) = 10(11 + 5) = 170– 3,000 times faster is a bona fide big deal

Page 45: Paris data-geeks-2013-03-28

How It Works

• For each point– Find approximately nearest centroid (distance = d)– If (d > threshold) new centroid– Else if (u > d/threshold) new cluster– Else add to nearest centroid

• If centroids > κ ≈ C log N– Recursively cluster centroids with higher threshold

Page 46: Paris data-geeks-2013-03-28

IMPLEMENTATION

Page 47: Paris data-geeks-2013-03-28

But Wait, …

• Finding nearest centroid is inner loop

• This could take O( d κ ) per point and κ can be big

• Happily, approximate nearest centroid works fine

Page 48: Paris data-geeks-2013-03-28

Projection Search

total ordering!

Page 49: Paris data-geeks-2013-03-28

LSH Bit-match Versus Cosine

Page 50: Paris data-geeks-2013-03-28

RESULTS

Page 51: Paris data-geeks-2013-03-28

Parallel Speedup?

Page 52: Paris data-geeks-2013-03-28

Quality

• Ball k-means implementation appears significantly better than simple k-means

• Streaming k-means + ball k-means appears to be about as good as ball k-means alone

• All evaluations on 20 newsgroups with held-out data

• Figure of merit is mean and median squared distance to nearest cluster

Page 53: Paris data-geeks-2013-03-28

Contact Me!• We’re hiring at MapR in US and Europe

• MapR software available for research use

• Get the code as part of Mahout trunk (or 0.8 very soon)

• Contact me at [email protected] or @ted_dunning

• Share news with @apachemahout