which algorithms really matter

56
1 ©MapR Technologies 2013 Which Algorithms Really Matter?

Upload: ted-dunning

Post on 10-May-2015

19.889 views

Category:

Technology


1 download

DESCRIPTION

This is the position talk that I gave at CIKM. Included are 4 algorithms that I feel don't get much academic attention, but which are very important industrially. It isn't necessarily true that these algorithms *should* get academic attention, but I do feel that it is true that they are quite important pragmatically speaking.

TRANSCRIPT

Page 1: Which Algorithms Really Matter

1©MapR Technologies 2013

Which Algorithms Really Matter?

Page 2: Which Algorithms Really Matter

2©MapR Technologies 2013

Me, Us

Ted Dunning, Chief Application Architect, MapRCommitter PMC member, Mahout, Zookeeper, DrillBought the beer at the first HUG

MapRDistributes more open source components for HadoopAdds major technology for performance, HA, industry standard API’s

InfoHash tag - #maprSee also - @ApacheMahout @ApacheDrill

@ted_dunning and @mapR

Page 3: Which Algorithms Really Matter

4©MapR Technologies 2013

Topic For Today

What is important? What is not? Why? What is the difference from academic research? Some examples

Page 4: Which Algorithms Really Matter

5©MapR Technologies 2013

What is Important?

Deployable

Robust

Transparent

Skillset and mindset matched?

Proportionate

Page 5: Which Algorithms Really Matter

6©MapR Technologies 2013

What is Important?

Deployable– Clever prototypes don’t count if they can’t be standardized

Robust

Transparent

Skillset and mindset matched?

Proportionate

Page 6: Which Algorithms Really Matter

7©MapR Technologies 2013

What is Important?

Deployable– Clever prototypes don’t count

Robust– Mishandling is common

Transparent– Will degradation be obvious?

Skillset and mindset matched?

Proportionate

Page 7: Which Algorithms Really Matter

8©MapR Technologies 2013

What is Important?

Deployable– Clever prototypes don’t count

Robust– Mishandling is common

Transparent– Will degradation be obvious?

Skillset and mindset matched?– How long will your fancy data scientist enjoy doing standard ops tasks?

Proportionate– Where is the highest value per minute of effort?

Page 8: Which Algorithms Really Matter

9©MapR Technologies 2013

Academic Goals vs Pragmatics

Academic goals– Reproducible– Isolate theoretically important aspects– Work on novel problems

Pragmatics– Highest net value– Available data is constantly changing– Diligence and consistency have larger impact than cleverness– Many systems feed themselves, exploration and exploitation are both

important– Engineering constraints on budget and schedule

Page 9: Which Algorithms Really Matter

10©MapR Technologies 2013

Example 1:Making Recommendations Better

Page 10: Which Algorithms Really Matter

11©MapR Technologies 2013

Recommendation Advances

What are the most important algorithmic advances in recommendations over the last 10 years?

Cooccurrence analysis?

Matrix completion via factorization?

Latent factor log-linear models?

Temporal dynamics?

Page 11: Which Algorithms Really Matter

12©MapR Technologies 2013

The Winner – None of the Above

What are the most important algorithmic advances in recommendations over the last 10 years?

1. Result dithering2. Anti-flood

Page 12: Which Algorithms Really Matter

13©MapR Technologies 2013

The Real Issues

Exploration Diversity Speed

Not the last fraction of a percent

Page 13: Which Algorithms Really Matter

14©MapR Technologies 2013

Result Dithering

Dithering is used to re-order recommendation results – Re-ordering is done randomly

Dithering is guaranteed to make off-line performance worse

Dithering also has a near perfect record of making actual performance much better

Page 14: Which Algorithms Really Matter

15©MapR Technologies 2013

Result Dithering

Dithering is used to re-order recommendation results – Re-ordering is done randomly

Dithering is guaranteed to make off-line performance worse

Dithering also has a near perfect record of making actual performance much better

“Made more difference than any other change”

Page 15: Which Algorithms Really Matter

16©MapR Technologies 2013

Simple Dithering Algorithm

Generate synthetic score from log rank plus Gaussian

Pick noise scale to provide desired level of mixing

Typically

Oh… use floor(t/T) as seed

Page 16: Which Algorithms Really Matter

17©MapR Technologies 2013

Example … ε = 0.5

Page 17: Which Algorithms Really Matter

18©MapR Technologies 2013

Example … ε = log 2 = 0.69

Page 18: Which Algorithms Really Matter

19©MapR Technologies 2013

Exploring The Second Page

Page 19: Which Algorithms Really Matter

20©MapR Technologies 2013

Lesson 1:Exploration is good

Page 20: Which Algorithms Really Matter

21©MapR Technologies 2013

Example 2:Bayesian Bandits

Page 21: Which Algorithms Really Matter

22©MapR Technologies 2013

Bayesian Bandits

Based on Thompson sampling Very general sequential test Near optimal regret Trade-off exploration and exploitation

Possibly best known solution for exploration/exploitation

Incredibly simple

Page 22: Which Algorithms Really Matter

23©MapR Technologies 2013

Thompson Sampling

Select each shell according to the probability that it is the best

Probability that it is the best can be computed using posterior

But I promised a simple answer

Page 23: Which Algorithms Really Matter

24©MapR Technologies 2013

Thompson Sampling – Take 2

Sample θ

Pick i to maximize reward

Record result from using i

Page 24: Which Algorithms Really Matter

25©MapR Technologies 2013

Fast Convergence

Page 25: Which Algorithms Really Matter

26©MapR Technologies 2013

Thompson Sampling on Ads

An Empirical Evaluation of Thompson Sampling - Chapelle and Li, 2011

Page 26: Which Algorithms Really Matter

27©MapR Technologies 2013

Bayesian Bandits versus Result Dithering

Many useful systems are difficult to frame in fully Bayesian form Thompson sampling cannot be applied without posterior sampling

Can still do useful exploration with dithering

But better to use Thompson sampling if possible

Page 27: Which Algorithms Really Matter

28©MapR Technologies 2013

Lesson 2:Exploration is pretty easy to do and pays big benefits.

Page 28: Which Algorithms Really Matter

29©MapR Technologies 2013

Example 3:On-line Clustering

Page 29: Which Algorithms Really Matter

30©MapR Technologies 2013

The Problem

K-means clustering is useful for feature extraction or compression

At scale and at high dimension, the desirable number of clusters increases

Very large number of clusters may require more passes through the data

Super-linear scaling is generally infeasible

Page 30: Which Algorithms Really Matter

31©MapR Technologies 2013

The Solution

Sketch-based algorithms produce a sketch of the data Streaming k-means uses adaptive dp-means to produce this sketch

in the form of many weighted centroids which approximate the original distribution

The size of the sketch grows very slowly with increasing data size Many operations such as clustering are well behaved on sketches

Fast and Accurate k-means For Large Datasets. Michael Shindler, Alex Wong, Adam Meyerson.

Revisiting k-means: New Algorithms via Bayesian Nonparametrics . Brian Kulis, Michael Jordan.

Page 31: Which Algorithms Really Matter

32©MapR Technologies 2013

An Example

Page 32: Which Algorithms Really Matter

33©MapR Technologies 2013

An Example

Page 33: Which Algorithms Really Matter

34©MapR Technologies 2013

The Cluster Proximity Features

Every point can be described by the nearest cluster – 4.3 bits per point in this case– Significant error that can be decreased (to a point) by increasing number of

clusters

Or by the proximity to the 2 nearest clusters (2 x 4.3 bits + 1 sign bit + 2 proximities)– Error is negligible– Unwinds the data into a simple representation

Or we can increase the number of clusters (n fold increase adds log n bits per point, decreases error by sqrt(n)

Page 34: Which Algorithms Really Matter

35©MapR Technologies 2013

Diagonalized Cluster Proximity

Page 35: Which Algorithms Really Matter

36©MapR Technologies 2013

Lots of Clusters Are Fine

Page 36: Which Algorithms Really Matter

37©MapR Technologies 2013

Typical k-means Failure

Selecting two seeds here cannot be

fixed with Lloyds

Result is that these two clusters get glued

together

Page 37: Which Algorithms Really Matter

38©MapR Technologies 2013

Streaming k-means Ideas

By using a sketch with lots (k log N) of centroids, we avoid pathological cases

We still get a very good result if the sketch is created – in one pass– with approximate search

In fact, adaptive dp-means works just fine

In the end, the sketch can be used for clustering or …

Page 38: Which Algorithms Really Matter

39©MapR Technologies 2013

Lesson 3:Sketches make big data small.

Page 39: Which Algorithms Really Matter

40©MapR Technologies 2013

Example 4:Search Abuse

Page 40: Which Algorithms Really Matter

41©MapR Technologies 2013

Recommendations

Alice got an apple and a puppy

Charles got a bicycle

Alice

Charles

Page 41: Which Algorithms Really Matter

42©MapR Technologies 2013

Recommendations

Alice got an apple and a puppy

Charles got a bicycle

Bob got an apple

Alice

Bob

Charles

Page 42: Which Algorithms Really Matter

43©MapR Technologies 2013

Recommendations

What else would Bob like??

Alice

Bob

Charles

Page 43: Which Algorithms Really Matter

44©MapR Technologies 2013

Log Files

Alice

Bob

Charles

Alice

Bob

Charles

Alice

Page 44: Which Algorithms Really Matter

45©MapR Technologies 2013

History Matrix: Users by Items

Alice

Bob

Charles

✔ ✔ ✔

✔ ✔

✔ ✔

Page 45: Which Algorithms Really Matter

46©MapR Technologies 2013

Co-occurrence Matrix: Items by Items

-

1 2

1 1

1

1

2 1

How do you tell which co-occurrences are useful?.

00

0 0

Page 46: Which Algorithms Really Matter

47©MapR Technologies 2013

Co-occurrence Binary Matrix

1

1not

not

1

Page 47: Which Algorithms Really Matter

48©MapR Technologies 2013

Indicator Matrix: Anomalous Co-Occurrence

✔✔

Result: The marked row will be added to the indicator field in the item document…

Page 48: Which Algorithms Really Matter

49©MapR Technologies 2013

Indicator Matrix

id: t4title: puppydesc: The sweetest little puppy ever.keywords: puppy, dog, pet

indicators: (t1)

That one row from indicator matrix becomes the indicator field in the Solr document used to deploy the recommendation engine.

Note: data for the indicator field is added directly to meta-data for a document in Solr index. You don’t need to create a separate index for the indicators.

Page 49: Which Algorithms Really Matter

50©MapR Technologies 2013

Internals of the Recommender Engine

50

Page 50: Which Algorithms Really Matter

51©MapR Technologies 2013

Internals of the Recommender Engine

51

Page 51: Which Algorithms Really Matter

52©MapR Technologies 2013

Looking Inside LucidWorks

What to recommend if new user listened to 2122: Fats Domino & 303: Beatles?

Recommendation is “1710 : Chuck Berry”

52

Real-time recommendation query and results: Evaluation

Page 52: Which Algorithms Really Matter

53©MapR Technologies 2013

Real-life example

Page 53: Which Algorithms Really Matter

54©MapR Technologies 2013

Lesson 4:Recursive search abuse pays

Search can implement recsWhich can implement search

Page 54: Which Algorithms Really Matter

55©MapR Technologies 2013

Summary

Page 55: Which Algorithms Really Matter

56©MapR Technologies 2013

Page 56: Which Algorithms Really Matter

57©MapR Technologies 2013

Me, Us

Ted Dunning, Chief Application Architect, MapRCommitter PMC member, Mahout, Zookeeper, DrillBought the beer at the first HUG

MapRDistributes more open source components for HadoopAdds major technology for performance, HA, industry standard API’s

InfoHash tag - #maprSee also - @ApacheMahout @ApacheDrill

@ted_dunning and @mapR