paris data geeks

53
Practical Machine Learning with Mahout

Upload: mapr-technologies

Post on 10-May-2015

503 views

Category:

Technology


0 download

DESCRIPTION

A quick BOF talk given by Ted Dunning in Paris at Devoxx

TRANSCRIPT

Page 1: Paris Data Geeks

Practical Machine Learning

with Mahout

Page 2: Paris Data Geeks

whoami – Ted Dunning

• Chief Application Architect, MapR Technologies• Committer, member, Apache Software Foundation

– particularly Mahout, Zookeeper and Drill

(we’re hiring)

• Contact me [email protected]@[email protected]@ted_dunning

Page 3: Paris Data Geeks

Agenda

• What works at scale• Recommendation• Unsupervised - Clustering

Page 4: Paris Data Geeks

What Works at Scale

• Logging• Counting• Session grouping

Page 5: Paris Data Geeks

What Works at Scale

• Logging• Counting• Session grouping

• Really. Don’t bet on anything much more complex than these

Page 6: Paris Data Geeks

What Works at Scale

• Logging• Counting• Session grouping

• Really. Don’t bet on anything much more complex than these

• These are harder than they look

Page 7: Paris Data Geeks

Recommendations

Page 8: Paris Data Geeks

Recommendations

• Special case of reflected intelligence• Traditionally “people who bought x also

bought y”

• But soooo much more is possible

Page 9: Paris Data Geeks

Examples

• Customers buying books (Linden et al)• Web visitors rating music (Shardanand and

Maes) or movies (Riedl, et al), (Netflix)• Internet radio listeners not skipping songs

(Musicmatch)• Internet video watchers watching >30 s

Page 10: Paris Data Geeks

Dyadic Structure

• Functional– Interaction: actor -> item*

• Relational– Interaction Actors x Items⊆

• Matrix– Rows indexed by actor, columns by item– Value is count of interactions

• Predict missing observations

Page 11: Paris Data Geeks

Recommendations Analysis

• R(x,y) = # people who bought x also bought y

select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id

) group by x, y

Page 12: Paris Data Geeks

Recommendations Analysis

• R(x,y) = People who bought x also bought y

select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id

) group by x, y

Page 13: Paris Data Geeks

Recommendations Analysis

• R(x,y) = People who bought x also bought y

select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id

) group by x, y

Page 14: Paris Data Geeks

Recommendations Analysis

• R(x,y) = People who bought x also bought y

select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id

) group by x, y

Page 15: Paris Data Geeks

Recommendations Analysis

• R(x,y) = People who bought x also bought y

select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id

) group by x, y

Page 16: Paris Data Geeks

Recommendations Analysis

• R(x,y) = People who bought x also bought y

select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id

) group by x, y

Page 17: Paris Data Geeks

Recommendations Analysis

Page 18: Paris Data Geeks

Fundamental Algorithmic Structure

• Cooccurrence

• Matrix approximation by factoring

• LLR

Page 19: Paris Data Geeks

But Wait!

• Cooccurrence

• Cross occurrence

Page 20: Paris Data Geeks

For example

• Users enter queries (A)– (actor = user, item=query)

• Users view videos (B)– (actor = user, item=video)

• A’A gives query recommendation– “did you mean to ask for”

• B’B gives video recommendation– “you might like these videos”

Page 21: Paris Data Geeks

The punch-line

• B’A recommends videos in response to a query– (isn’t that a search engine?)– (not quite, it doesn’t look at content or meta-data)

Page 22: Paris Data Geeks

Real-life example

• Query: “Paco de Lucia”• Conventional meta-data search results:– “hombres del paco” times 400– not much else

• Recommendation based search:– Flamenco guitar and dancers– Spanish and classical guitar– Van Halen doing a classical/flamenco riff

Page 23: Paris Data Geeks

Real-life example

Page 24: Paris Data Geeks

Hypothetical Example

• Want a navigational ontology?• Just put labels on a web page with traffic– This gives A = users x label clicks

• Remember viewing history– This gives B = users x items

• Cross recommend– B’A = label to item mapping

• After several users click, results are whatever users think they should be

Page 25: Paris Data Geeks

Super-fast k-means Clustering

Page 26: Paris Data Geeks

RATIONALE

Page 27: Paris Data Geeks

What is Quality?

• Robust clustering not a goal– we don’t care if the same clustering is replicated

• Generalization is critical• Agreement to “gold standard” is a non-issue

Page 28: Paris Data Geeks

An Example

Page 29: Paris Data Geeks

An Example

Page 30: Paris Data Geeks

Diagonalized Cluster Proximity

Page 31: Paris Data Geeks

Clusters as Distribution Surrogate

Page 32: Paris Data Geeks

Clusters as Distribution Surrogate

Page 33: Paris Data Geeks

THEORY

Page 34: Paris Data Geeks

For Example

Grouping these two clusters

seriously hurts squared distance

Page 35: Paris Data Geeks

ALGORITHMS

Page 36: Paris Data Geeks

Typical k-means Failure

Selecting two seeds here cannot be

fixed with Lloyds

Result is that these two clusters get glued

together

Page 37: Paris Data Geeks

Ball k-means

• Provably better for highly clusterable data• Tries to find initial centroids in each “core” of each real

clusters• Avoids outliers in centroid computation

initialize centroids randomly with distance maximizing tendencyfor each of a very few iterations:

for each data point:assign point to nearest cluster

recompute centroids using only points much closer than closest cluster

Page 38: Paris Data Geeks

Still Not a Win

• Ball k-means is nearly guaranteed with k = 2• Probability of successful seeding drops

exponentially with k• Alternative strategy has high probability of

success, but takes O(nkd + k3d) time

Page 39: Paris Data Geeks

Still Not a Win

• Ball k-means is nearly guaranteed with k = 2• Probability of successful seeding drops

exponentially with k• Alternative strategy has high probability of

success, but takes O( nkd + k3d ) time

• But for big data, k gets large

Page 40: Paris Data Geeks

Surrogate Method

• Start with sloppy clustering into lots of clustersκ = k log n clusters

• Use this sketch as a weighted surrogate for the data

• Results are provably good for highly clusterable data

Page 41: Paris Data Geeks

Algorithm Costs

• Surrogate methods– fast, sloppy single pass clustering with κ = k log n– fast sloppy search for nearest cluster,

O(d log κ) = O(d (log k + log log n)) per point– fast, in-memory, high-quality clustering of κ weighted

centroidsO(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high qualityO(κ d log k) or O(d log κ log k) for larger k, looser quality

– result is k high-quality centroids• Even the sloppy surrogate may suffice

Page 42: Paris Data Geeks

Algorithm Costs

• Surrogate methods– fast, sloppy single pass clustering with κ = k log n– fast sloppy search for nearest cluster,

O(d log κ) = O(d ( log k + log log n )) per point– fast, in-memory, high-quality clustering of κ weighted

centroidsO(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high qualityO(κ d log k) or O( d log k ( log k + log log n ) ) for larger k, looser quality

– result is k high-quality centroids• For many purposes, even the sloppy surrogate may suffice

Page 43: Paris Data Geeks

Algorithm Costs

• How much faster for the sketch phase?– take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000– d (log k + log log n) = 10(11 + 5) = 170– 3,000 times faster is a bona fide big deal

Page 44: Paris Data Geeks

Algorithm Costs

• How much faster for the sketch phase?– take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000– d (log k + log log n) = 10(11 + 5) = 170– 3,000 times faster is a bona fide big deal

Page 45: Paris Data Geeks

How It Works

• For each point– Find approximately nearest centroid (distance = d)– If (d > threshold) new centroid– Else if (u > d/threshold) new cluster– Else add to nearest centroid

• If centroids > κ ≈ C log N– Recursively cluster centroids with higher threshold

Page 46: Paris Data Geeks

IMPLEMENTATION

Page 47: Paris Data Geeks

But Wait, …

• Finding nearest centroid is inner loop

• This could take O( d κ ) per point and κ can be big

• Happily, approximate nearest centroid works fine

Page 48: Paris Data Geeks

Projection Search

total ordering!

Page 49: Paris Data Geeks

LSH Bit-match Versus Cosine

Page 50: Paris Data Geeks

RESULTS

Page 51: Paris Data Geeks

Parallel Speedup?

Page 52: Paris Data Geeks

Quality

• Ball k-means implementation appears significantly better than simple k-means

• Streaming k-means + ball k-means appears to be about as good as ball k-means alone

• All evaluations on 20 newsgroups with held-out data

• Figure of merit is mean and median squared distance to nearest cluster

Page 53: Paris Data Geeks

Contact Me!• We’re hiring at MapR in US and Europe

• MapR software available for research use

• Get the code as part of Mahout trunk (or 0.8 very soon)

• Contact me at [email protected] or @ted_dunning

• Share news with @apachemahout