recommendation and graph algorithms in hadoop and sql

50
Recommendation and graph algorithms in Hadoop and SQL DAVID F. GLEICH ASSISTANT PROFESSOR COMPUTER SCIENCE PURDUE UNIVERSITY David Gleich · Purdue 1 Code github.com/dgleich/matrix-hadoop-tutorial Ancestry.com @dgleich [email protected]

Upload: david-gleich

Post on 06-May-2015

724 views

Category:

Technology


1 download

DESCRIPTION

A talk I gave at ancestry.com on Hadoop, SQL, recommendation and graph algorithms. It's a tutorial overview, there are better algorithms than those I describe, but these are a simple starting point.

TRANSCRIPT

Page 1: Recommendation and graph algorithms in Hadoop and SQL

Recommendation and graph algorithms in Hadoop and SQL

DAVID F. GLEICH ASSISTANT PROFESSOR "COMPUTER SCIENCE "PURDUE UNIVERSITY

David Gleich · Purdue 1

Code github.com/dgleich/matrix-hadoop-tutorial

Ancestry.com

@dgleich [email protected]

Page 2: Recommendation and graph algorithms in Hadoop and SQL

Matrix computations

A =

2

66664

A1,1 A1,2 · · · A1,n

A2,1 A2,2 · · ·...

.... . .

. . . Am�1,nAm,1 · · · Am,n�1 Am,n

3

77775

Least squares Eigenvalues

Ax Ax = b min kAx � bk Ax = �x

Operations Linear "systems David Gleich · Purdue 2 Ancestry.com

Page 3: Recommendation and graph algorithms in Hadoop and SQL

Outcomes Recognize relationships between matrix methods and things you’ve already been doing" Example SQL queries as matrix computations See how to work with big graphs as large edge lists in Hadoop and SQL" Example Connected components Understand how to use Hadoop to compute these matrix methods at scale for BigData" Example Recommenders with social network info

David Gleich · Purdue 3 Ancestry.com

Page 4: Recommendation and graph algorithms in Hadoop and SQL

matrix computations "≠"

linear algebra

David Gleich · Purdue 4 Ancestry.com

Page 5: Recommendation and graph algorithms in Hadoop and SQL

World’s simplest recommendation system.

Suggest the average rating.

David Gleich · Purdue 5 Ancestry.com

Page 6: Recommendation and graph algorithms in Hadoop and SQL

A SQL statement as a "matrix computation

http://stackoverflow.com/questions/4217449/returning-average-rating-from-a-database-sql

How do I find the average rating for each product?

David Gleich · Purdue 6 Ancestry.com

Page 7: Recommendation and graph algorithms in Hadoop and SQL

A SQL statement as a "matrix computation

http://stackoverflow.com/questions/4217449/returning-average-rating-from-a-database-sql

SELECT ! p.product_id, ! p.name, ! AVG(pr.rating) AS rating_average!FROM products p !INNER JOIN product_ratings pr!ON pr.product_id = p.product_id!GROUP BY p.product_id!ORDER BY rating_average DESC !

How do I find the average rating for each product?

David Gleich · Purdue 7 Ancestry.com

Page 8: Recommendation and graph algorithms in Hadoop and SQL

This SQL statement is a "matrix computation!

8 Image from rockysprings, deviantart, CC share-alike Ancestry.com David Gleich · Purdue

Page 9: Recommendation and graph algorithms in Hadoop and SQL

SELECT ! ... ! AVG(pr.rating) !... !GROUP BY p.product_id!

product_ratings

pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4 pid9 uid9 1

Is a matrix!

pid1 pid2 pid3 pid4 pid5 pid6 pid7 pid8 pid9

David Gleich · Purdue 9 Ancestry.com

Page 10: Recommendation and graph algorithms in Hadoop and SQL

product_ratings

pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4 pid9 uid9 1

Is a matrix!

pid1 pid2 pid3 pid4 pid5 pid6 pid7 pid8 pid9

But it’s a weird matrix"

Missing entries!

David Gleich · Purdue 10

Ancestry.com

Page 11: Recommendation and graph algorithms in Hadoop and SQL

product_ratings

pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4

Is a matrix!

pid1 pid2 pid3 pid4 pid5 pid6 pid7 pid8 pid9

4

4

4

4 5 4

But it’s a weird matrix"

Matrix

SELECT AVG(r) ... GROUP BY pid

Vector

Average"of ratings

David Gleich · Purdue 11

Ancestry.com

Page 12: Recommendation and graph algorithms in Hadoop and SQL

But it’s a weird matrix"and not a linear operator

A =

2

66664

A1,1 A1,2 · · · A1,n

A2,1 A2,2 · · ·...

.... . .

. . . Am�1,nAm,1 · · · Am,n�1 Am,n

3

77775

avg(A) =

2

6664

Pj A1,j/

Pj “A1,j 6= 0”P

j A2,j/P

j “A2,j 6= 0”...P

j Am,j/P

j “Am,j 6= 0”

3

7775

David Gleich · Purdue 12

product_ratings

pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4 pid9 uid9 1

Is a matrix!

Ancestry.com

Page 13: Recommendation and graph algorithms in Hadoop and SQL

matrix computations "≠"

linear algebra

David Gleich · Purdue 13

Ancestry.com

Page 14: Recommendation and graph algorithms in Hadoop and SQL

Hadoop, MapReduce, and Matrix Methods

David Gleich · Purdue 14

Ancestry.com

Page 15: Recommendation and graph algorithms in Hadoop and SQL

MapReduce

David Gleich · Purdue 15

dataMap

dataMap

dataMap

dataMap

keyvalue

keyvalue

keyvalue

keyvalue

keyvalue

keyvalue

()

Shuffle

keyvaluevalue

dataReduce

keyvaluevaluevalue

dataReduce

keyvalue dataReduce

Ancestry.com

Page 16: Recommendation and graph algorithms in Hadoop and SQL

The MapReduce Framework Originated at Google for indexing web pages and computing PageRank.

Express algorithms in "“data-local operations”. Implement one type of communication: shuffle. Shuffle moves all data with the same key to the same reducer.

MM R

RMM

Input stored in triplicate

Map output"persisted to disk"before shuffle

Reduce input/"output on disk

1 MM R

RMMM

Maps Reduce

Shuffle

2

3

4

5

1 2 M M

3 4 M M

5 M

Data scalable

Fault-tolerance by design

16

David Gleich · Purdue Ancestry.com

Page 17: Recommendation and graph algorithms in Hadoop and SQL

wordcount "is a matrix computation too

map(document) :

for word in document

emit (word, 1)

reduce(word, counts) :

emit (word, sum(counts))

1 2 D D

3 4 D D

5 D

matrix,1 matrix,1 matrix,1 matrix,1

bigdata,1 bigdata,1 bigdata,1 bigdata,1 bigdata,1 bigdata,1 bigdata,1 bigdata,1

hadoop,1 hadoop,1 hadoop,1 hadoop,1 hadoop,1 hadoop,1 hadoop,1

David Gleich · Purdue 17

Ancestry.com

Page 18: Recommendation and graph algorithms in Hadoop and SQL

wordcount "is a matrix computation too

A =

2

66664

A1,1 A1,2 · · · A1,n

A2,1 A2,2 · · ·...

.... . .

. . . Am�1,nAm,1 · · · Am,n�1 Am,n

3

77775

doc1

doc2

docm

= A

colsum(A) = AT e word count = e is the vector of all ones

David Gleich · Purdue 18

Ancestry.com

Page 19: Recommendation and graph algorithms in Hadoop and SQL

inverted index"is a matrix computation too

A =

2

66664

A1,1 A1,2 · · · A1,n

A2,1 A2,2 · · ·...

.... . .

. . . Am�1,nAm,1 · · · Am,n�1 Am,n

3

77775

doc1

doc2

docm

= A

David Gleich · Purdue 19

Ancestry.com

Page 20: Recommendation and graph algorithms in Hadoop and SQL

2

66664

A1,1 A2,1 · · · Am,1

A1,2 A2,2 · · ·...

.... . .

. . . Am,n�1A1,n · · · Am�1,n Am,n

3

77775= AT

term1

term2

termm

inverted index"is a matrix computation too

David Gleich · Purdue 20

Ancestry.com

Page 21: Recommendation and graph algorithms in Hadoop and SQL

A recommender system "with social info

David Gleich · Purdue 21

product_ratings

pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4 pid9 uid9 1

friends_links

uid6 uid1 uid8 uid9 uid7 uid7 uid7 uid4 uid6 uid2 uid7 uid1 uid3 uid1 uid1 uid8 uid7 uid3 uid9 uid1

Ancestry.com

Page 22: Recommendation and graph algorithms in Hadoop and SQL

A recommender system "with social info

David Gleich · Purdue 22

product_ratings

pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4 pid9 uid9 1

friends_links

uid6 uid1 uid8 uid9 uid7 uid7 uid7 uid4 uid6 uid2 uid7 uid1 uid3 uid1 uid1 uid8 uid7 uid3 uid9 uid1

pid1

pid2

2

64A1,1 A2,1 · · ·A1,2 A2,2 · · ·...

. . .. . .

3

75uid1

uid2

2

64A1,1 A2,1 · · ·A1,2 A2,2 · · ·...

. . .. . .

3

75

Ancestry.com

Page 23: Recommendation and graph algorithms in Hadoop and SQL

A recommender system "with social info

David Gleich · Purdue 23

product_ratings

pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4 pid9 uid9 1

friends_links

uid6 uid1 uid8 uid9 uid7 uid7 uid7 uid4 uid6 uid2 uid7 uid1 uid3 uid1 uid1 uid8 uid7 uid3 uid9 uid1

R S

Ancestry.com

Page 24: Recommendation and graph algorithms in Hadoop and SQL

A recommender system "with social info

David Gleich · Purdue 24

Recommend each item based on the average rating of all trusted users

“X = S RT” with something that is"almost a matrix-matrix"product

R pid1

pid2

2

64A1,1 A2,1 · · ·A1,2 A2,2 · · ·...

. . .. . .

3

75 S uid1

uid2

2

64A1,1 A2,1 · · ·A1,2 A2,2 · · ·...

. . .. . .

3

75

Xuid,pid =

X

uid2

Suid,uid2Ruid2,pid

!· X

uid2

“Suid,uid2 and Ruid2,pid 6= 0”

!�1

Ancestry.com

Page 25: Recommendation and graph algorithms in Hadoop and SQL

Tools I like

hadoop streaming dumbo mrjob hadoopy C++

David Gleich · Purdue 25

Ancestry.com

Page 26: Recommendation and graph algorithms in Hadoop and SQL

Tools I don’t use but other people seem to like …

pig java hbase mahout Eclipse Cassandra

David Gleich · Purdue 26

Mahout is the closest thing to a library for matrix computations in Hadoop. If you like Java, you should probably start there. I’m a low-level guy

Ancestry.com

Page 27: Recommendation and graph algorithms in Hadoop and SQL

hadoop streaming

the map function is a program"(key,value) pairs are sent via stdin"output (key,value) pairs goes to stdout the reduce function is a program"(key,value) pairs are sent via stdin"keys are grouped"output (key,value) pairs goes to stdout

David Gleich · Purdue 27

Ancestry.com

Page 28: Recommendation and graph algorithms in Hadoop and SQL

mrjob from

a wrapper around hadoop streaming for map and reduce functions in python

class MRWordFreqCount(MRJob): def mapper(self, _, line): for word in line.split(): yield (word.lower(), 1) def reducer(self, word, counts): yield (word, sum(counts)) if __name__ == '__main__': MRWordFreqCount.run()

David Gleich · Purdue 28

Ancestry.com

Page 29: Recommendation and graph algorithms in Hadoop and SQL

Connected components in SQL and Hadoop

Ancestry.com David Gleich · Purdue 29

Page 30: Recommendation and graph algorithms in Hadoop and SQL

Connected components

Ancestry.com David Gleich · Purdue 30

3 “components” in this graph How can we find them algorithmically … … on a huge network?

Page 31: Recommendation and graph algorithms in Hadoop and SQL

Connected components

Ancestry.com David Gleich · Purdue 31

Algorithm!Assign each node a random component id. For each node, take the minimum component id of itself and all neighbors.

Page 32: Recommendation and graph algorithms in Hadoop and SQL

DEMO

Ancestry.com David Gleich · Purdue 32

Page 33: Recommendation and graph algorithms in Hadoop and SQL

Computing Connected Components in SQL

Graph!Edges : id | head | tail !

!“Vector”!v : id | comp ! initialized to random ! component!

Ancestry.com David Gleich · Purdue 33

!CREATE TABLE v2 AS ( !SELECT ! e.tail AS id, ! MIN(v.comp) as COMP !FROM edges e !INNER JOIN vector v !ON e.head = v.id!GROUP BY e.tail!); !!DROP TABLE v; !ALTER TABLE v2 ! RENAME TO v; !!... Repeat ... !!

Page 34: Recommendation and graph algorithms in Hadoop and SQL

Matrix-vector product and connected components in Hadoop

David Gleich · Purdue 34

Ax = y

y

i

=X

k

A

ik

x

k

A x

See example! ���matrix-hadoop/codes/smatvec.py!

Ancestry.com

Google’s PageRank Word count, average rating!

“AT

x = y”y

i

= min(xi

, mink

A

ki

x

k

)

Connected components

Page 35: Recommendation and graph algorithms in Hadoop and SQL

Matrix-vector product

David Gleich · Purdue 35

Ax = y

y

i

=X

k

A

ik

x

k

A x

A is stored by “node”

$ head samples/smat_5_5.txt !0 0 0.125 3 1.024 4 0.121 !1 0 0.597 !2 2 1.247 !3 4 -1.45 !4 2 0.061 !

v initially random !

$ head samples/vec_5.txt !0 0.241 !1 -0.98 !2 0.237 !3 -0.32 !4 0.080 !

Follow along! ���matrix-hadoop/codes/smatvec.py!

Ancestry.com

Page 36: Recommendation and graph algorithms in Hadoop and SQL

Matrix-vector product"(in pictures)

David Gleich · Purdue 36

Ax = y

y

i

=X

k

A

ik

x

k

A x

A x

Input Map 1!Align on columns"

Reduce 1!Output Aik xk"keyed on row i

A

x Reduce 2!Output sum(Aik xk)"

y

Ancestry.com

Page 37: Recommendation and graph algorithms in Hadoop and SQL

Matrix-vector product"(in pictures)

David Gleich · Purdue 37

Ax = y

y

i

=X

k

A

ik

x

k

A x

A x

Input Map 1!Align on columns"

def joinmap(self, key, line): ! vals = line.split() ! if len(vals) == 2: ! # the vector ! yield (vals[0], # row ! (float(vals[1]),)) # xi ! else: ! # the matrix ! row = vals[0] ! for i in xrange(1,len(vals),2): ! yield (vals[i], # column ! (row, # i,Aij! float(vals[i+1]))) !

Ancestry.com

Page 38: Recommendation and graph algorithms in Hadoop and SQL

“Matrix-vector” for connected components

David Gleich · Purdue 38

A x

A x

Input Map 1!Align on columns"

def joinmap(self, key, line): ! vals = line.split() ! if len(vals) == 2: ! # the vector ! yield (vals[0], # row ! (float(vals[1]),)) # vi ! else: ! # the matrix ! row = vals[0] ! for i in xrange(1,len(vals),2): ! yield (row, # head ! (vals[i], # tail)) !

Ancestry.com

“AT

x = y”y

i

= min(xi

, mink

A

ki

x

k

)

Page 39: Recommendation and graph algorithms in Hadoop and SQL

Matrix-vector product"(in pictures)

David Gleich · Purdue 39

Ax = y

y

i

=X

k

A

ik

x

k

A x

A x

Input Map 1!Align on columns"

Reduce 1!Output Aik xk"keyed on row i

A

x def joinred(self, key, vals): ! vecval = 0. ! matvals = [] ! for val in vals: ! if len(val) == 1: ! vecval += val[0] ! else: ! matvals.append(val) ! for val in matvals: ! yield (val[0], val[1]*vecval) !

Note that you should use a secondary sort to avoid reading both in memory

Ancestry.com

Page 40: Recommendation and graph algorithms in Hadoop and SQL

“Matrix-vector” for connected components

David Gleich · Purdue 40

A x

A x

Input Map 1!Align on columns"

Reduce 1!Output Aik xk"keyed on row i

A

x def joinred(self, key, vals): ! vecval = 0. ! matvals = [] ! for val in vals: ! if len(val) == 1: ! vecval += val[0] ! else: ! matvals.append(val) ! for val in matvals: ! yield (val[0], vecval) !

Note that you should use a secondary sort to avoid reading both in memory

Ancestry.com

“AT

x = y”y

i

= min(xi

, mink

A

ki

x

k

)

Page 41: Recommendation and graph algorithms in Hadoop and SQL

Matrix-vector product"(in pictures)

David Gleich · Purdue 41

Ax = y

y

i

=X

k

A

ik

x

k

A x

A x

Input Map 1!Align on columns"

Reduce 1!Output Aik xk"keyed on row i

A

x Reduce 2!Output sum(Aik xk)"

y

def sumred(self, key, vals): ! yield (key, sum(vals)) !

Ancestry.com

Page 42: Recommendation and graph algorithms in Hadoop and SQL

Our social recommender

David Gleich · Purdue 42

RT S

Follow along! ���matrix-hadoop/recsys/recsys.py!

R is stored entry-wise !

$ gunzip –c data/rating.txt.gz!139431556 591156 5 !139431556 1312460676 5 !139431556 204358 4 139431556 368725 5 !Object ID! User ID! Rating!

S is stored entry-wise !

$ gunzip –c data/rating.txt.gz!3287060356 232085 -1 !3288305540 709420 1 !3290337156 204418 -1 !3294138244 269243 -1 !Other ID! Trust!My ID!

Ancestry.com

Page 43: Recommendation and graph algorithms in Hadoop and SQL

Matrix-matrix product

David Gleich · Purdue 43

A B

Follow along! ���matrix-hadoop/codes/matmat.py!

AB = CCij =

X

k

Aik Bkj

Ancestry.com

Conceptually, the first step is the same as the matrix-vector product with a block of vectors.

Page 44: Recommendation and graph algorithms in Hadoop and SQL

Matrix-matrix product "(in pictures)

David Gleich · Purdue 44

A B

AB = CCij =

X

k

Aik Bkj

A Map 1!Align on columns"

B Reduce 1!Output Aik Bkj"keyed on (i,j)

A

B Reduce 2!Output sum(Aik Bkj)"

C

Ancestry.com

Page 45: Recommendation and graph algorithms in Hadoop and SQL

Social recommender "(in code)

David Gleich · Purdue 45

A B

A Map 1!Align on columns"

B

def joinmap(self, key, line): ! parts = line.split('\t') ! if len(parts) == 8: # ratings ! objid = parts[0].strip() ! uid = parts[1].strip() ! rat = int(parts[2]) ! yield (uid, (objid, rat)) ! else len(parts) == 4: # trust ! myid = parts[0].strip() ! otherid = parts[1].strip() ! value = int(parts[2]) ! if value > 0: ! yield (otherid, (myid,)) !

Ancestry.com

Page 46: Recommendation and graph algorithms in Hadoop and SQL

Matrix-matrix product "(in pictures)

David Gleich · Purdue 46

A B

A Map 1!Align on columns"

B Reduce 1!Output Aik Bkj"keyed on (i,j)

A

B

def joinred(self, key, vals): ! tusers = [] # uids that trust key ! ratobjs = [] # objs rated by uid=key ! for val in vals: ! if len(val) == 1: ! tusers.append(val[0]) ! else: ! ratobjs.append(val) !! for (objid, rat) in ratobjs: ! for uid in tusers: ! yield ((uid, objid), rat) !

Conceptually, the second step

is the same as the matrix-

matrix product too, we “map”

the ratings from each trusted

user back to the source.

Ancestry.com

Page 47: Recommendation and graph algorithms in Hadoop and SQL

Matrix-matrix product "(in pictures)

David Gleich · Purdue 47

A B

AB = CCij =

X

k

Aik Bkj

A Map 1!Align on columns"

B Reduce 1!Output Aik Bkj"keyed on (i,j)

A

B Reduce 2!Output sum(Aik Bkj)"

C def avgred(self, key, vals): ! s = 0. ! n = 0 ! for val in vals: ! s += val! n += 1 ! # the smoothed average of ratings ! yield key, ! (s+self.options.avg)/float(n+1) ! !

Ancestry.com

Page 48: Recommendation and graph algorithms in Hadoop and SQL

Better ways to store "matrices in Hadoop

David Gleich · Purdue 48

A B

A B

Block matrices minimize the number of intermediate keys and values used. I’d form them based on the first reduce No need for “integer” keys that

fall between 1 and n!

Ancestry.com

Page 49: Recommendation and graph algorithms in Hadoop and SQL

49

A

From tinyimages"collection

Ancestry.com

Tall-and-Skinny matrices (m ≫ n) Many rows (like a billion) A few columns (under 10,000)

regression and general linear models"with many samples

block iterative methods panel factorizations

simulation data analysis !

big-data SVD/PCA!

Used in

David Gleich · Purdue

Page 50: Recommendation and graph algorithms in Hadoop and SQL

Questions?

50

Image from rockysprings, deviantart, CC share-alike Ancestry.com David Gleich · Purdue