paris data-geeks-2013-03-28

Practical Machine Learning

with Mahout

whoami – Ted Dunning

• Chief Application Architect, MapR Technologies• Committer, member, Apache Software Foundation

– particularly Mahout, Zookeeper and Drill

(we’re hiring)

• Contact me [email protected]@[email protected]@ted_dunning

mailto:[email protected]



Agenda

• What works at scale• Recommendation• Unsupervised - Clustering

What Works at Scale

• Logging• Counting• Session grouping

What Works at Scale


• Really. Don’t bet on anything much more complex than these

What Works at Scale


• Really. Don’t bet on anything much more complex than these

• These are harder than they look

Recommendations

Recommendations

• Special case of reflected intelligence• Traditionally “people who bought x also

bought y”

• But soooo much more is possible

Examples

• Customers buying books (Linden et al)• Web visitors rating music (Shardanand and

Maes) or movies (Riedl, et al), (Netflix)• Internet radio listeners not skipping songs

(Musicmatch)• Internet video watchers watching >30 s

Dyadic Structure

• Functional– Interaction: actor -> item*

• Relational– Interaction Actors x Items⊆

• Matrix– Rows indexed by actor, columns by item– Value is count of interactions

• Predict missing observations

Recommendations Analysis

• R(x,y) = # people who bought x also bought y

select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id

) group by x, y


• R(x,y) = People who bought x also bought y

select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id

) group by x, y

Fundamental Algorithmic Structure

• Cooccurrence

• Matrix approximation by factoring

• LLR

But Wait!

• Cooccurrence

• Cross occurrence

For example

• Users enter queries (A)– (actor = user, item=query)

• Users view videos (B)– (actor = user, item=video)

• A’A gives query recommendation– “did you mean to ask for”

• B’B gives video recommendation– “you might like these videos”

The punch-line

• B’A recommends videos in response to a query– (isn’t that a search engine?)– (not quite, it doesn’t look at content or meta-data)

Real-life example

• Query: “Paco de Lucia”• Conventional meta-data search results:– “hombres del paco” times 400– not much else

• Recommendation based search:– Flamenco guitar and dancers– Spanish and classical guitar– Van Halen doing a classical/flamenco riff

Real-life example

Hypothetical Example

• Want a navigational ontology?• Just put labels on a web page with traffic– This gives A = users x label clicks

• Remember viewing history– This gives B = users x items

• Cross recommend– B’A = label to item mapping

• After several users click, results are whatever users think they should be

Super-fast k-means Clustering

RATIONALE

What is Quality?

• Robust clustering not a goal– we don’t care if the same clustering is replicated

• Generalization is critical• Agreement to “gold standard” is a non-issue

An Example

Diagonalized Cluster Proximity

Clusters as Distribution Surrogate

THEORY

For Example

Grouping these two clusters

seriously hurts squared distance

ALGORITHMS

Typical k-means Failure

Selecting two seeds here cannot be

fixed with Lloyds

Result is that these two clusters get glued

together

Ball k-means

• Provably better for highly clusterable data• Tries to find initial centroids in each “core” of each real

clusters• Avoids outliers in centroid computation

initialize centroids randomly with distance maximizing tendencyfor each of a very few iterations:

for each data point:assign point to nearest cluster

recompute centroids using only points much closer than closest cluster

Still Not a Win

• Ball k-means is nearly guaranteed with k = 2• Probability of successful seeding drops

exponentially with k• Alternative strategy has high probability of

success, but takes O(nkd + k3d) time

Still Not a Win

• Ball k-means is nearly guaranteed with k = 2• Probability of successful seeding drops

exponentially with k• Alternative strategy has high probability of

success, but takes O( nkd + k3d ) time

• But for big data, k gets large

Surrogate Method

• Start with sloppy clustering into lots of clustersκ = k log n clusters

• Use this sketch as a weighted surrogate for the data

• Results are provably good for highly clusterable data

Algorithm Costs

• Surrogate methods– fast, sloppy single pass clustering with κ = k log n– fast sloppy search for nearest cluster,

O(d log κ) = O(d (log k + log log n)) per point– fast, in-memory, high-quality clustering of κ weighted

centroidsO(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high qualityO(κ d log k) or O(d log κ log k) for larger k, looser quality

– result is k high-quality centroids• Even the sloppy surrogate may suffice

Algorithm Costs

• Surrogate methods– fast, sloppy single pass clustering with κ = k log n– fast sloppy search for nearest cluster,

O(d log κ) = O(d ( log k + log log n )) per point– fast, in-memory, high-quality clustering of κ weighted

centroidsO(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high qualityO(κ d log k) or O( d log k ( log k + log log n ) ) for larger k, looser quality

– result is k high-quality centroids• For many purposes, even the sloppy surrogate may suffice

Algorithm Costs

• How much faster for the sketch phase?– take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000– d (log k + log log n) = 10(11 + 5) = 170– 3,000 times faster is a bona fide big deal

How It Works

• For each point– Find approximately nearest centroid (distance = d)– If (d > threshold) new centroid– Else if (u > d/threshold) new cluster– Else add to nearest centroid

• If centroids > κ ≈ C log N– Recursively cluster centroids with higher threshold

IMPLEMENTATION

But Wait, …

• Finding nearest centroid is inner loop

• This could take O( d κ ) per point and κ can be big

• Happily, approximate nearest centroid works fine

Projection Search

total ordering!

LSH Bit-match Versus Cosine

RESULTS

Parallel Speedup?

✓

Quality

• Ball k-means implementation appears significantly better than simple k-means

• Streaming k-means + ball k-means appears to be about as good as ball k-means alone

• All evaluations on 20 newsgroups with held-out data

• Figure of merit is mean and median squared distance to nearest cluster

Contact Me!• We’re hiring at MapR in US and Europe

• MapR software available for research use

• Get the code as part of Mahout trunk (or 0.8 very soon)

• Contact me at [email protected] or @ted_dunning

• Share news with @apachemahout


paris data-geeks-2013-03-28

Technology

x alsobought y

id group

example1d x

actor item

users x items cross

users x label clicks

item value

item mapping