recommendation and graph algorithms in hadoop and sql

Recommendation and graph algorithms in Hadoop and SQL

DAVID F. GLEICH ASSISTANT PROFESSOR "COMPUTER SCIENCE "PURDUE UNIVERSITY

David Gleich · Purdue 1

Code github.com/dgleich/matrix-hadoop-tutorial

Ancestry.com

@dgleich dgleich@purdue.edu

Matrix computations

A1,1 A1,2 · · · A1,n

A2,1 A2,2 · · ·...

.... . .

. . . Am�1,nAm,1 · · · Am,n�1 Am,n

Least squares Eigenvalues

Ax Ax = b min kAx � bk Ax = �x

Operations Linear "systems David Gleich · Purdue 2 Ancestry.com

Outcomes Recognize relationships between matrix methods and things you’ve already been doing" Example SQL queries as matrix computations See how to work with big graphs as large edge lists in Hadoop and SQL" Example Connected components Understand how to use Hadoop to compute these matrix methods at scale for BigData" Example Recommenders with social network info

David Gleich · Purdue 3 Ancestry.com

matrix computations "≠"

linear algebra

World’s simplest recommendation system.

Suggest the average rating.

A SQL statement as a "matrix computation

http://stackoverflow.com/questions/4217449/returning-average-rating-from-a-database-sql

How do I find the average rating for each product?

A SQL statement as a "matrix computation

http://stackoverflow.com/questions/4217449/returning-average-rating-from-a-database-sql

SELECT ! p.product_id, ! p.name, ! AVG(pr.rating) AS rating_average!FROM products p !INNER JOIN product_ratings pr!ON pr.product_id = p.product_id!GROUP BY p.product_id!ORDER BY rating_average DESC !

How do I find the average rating for each product?

This SQL statement is a "matrix computation!

8 Image from rockysprings, deviantart, CC share-alike Ancestry.com David Gleich · Purdue

SELECT ! ... ! AVG(pr.rating) !... !GROUP BY p.product_id!

product_ratings

pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4 pid9 uid9 1

Is a matrix!

pid1 pid2 pid3 pid4 pid5 pid6 pid7 pid8 pid9

product_ratings

Is a matrix!

But it’s a weird matrix"

Missing entries!

Ancestry.com

product_ratings

pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4

Is a matrix!

But it’s a weird matrix"

Matrix

SELECT AVG(r) ... GROUP BY pid

Vector

Average"of ratings

Ancestry.com

But it’s a weird matrix"and not a linear operator

A1,1 A1,2 · · · A1,n

A2,1 A2,2 · · ·...

.... . .

. . . Am�1,nAm,1 · · · Am,n�1 Am,n

avg(A) =

Pj A1,j/

Pj “A1,j 6= 0”P

j A2,j/P

j “A2,j 6= 0”...P

j Am,j/P

j “Am,j 6= 0”

product_ratings

Is a matrix!

Ancestry.com

matrix computations "≠"

linear algebra

Ancestry.com

Hadoop, MapReduce, and Matrix Methods

Ancestry.com

MapReduce

dataMap

keyvalue

Shuffle

keyvaluevalue

dataReduce

keyvaluevaluevalue

dataReduce

keyvalue dataReduce

Ancestry.com

The MapReduce Framework Originated at Google for indexing web pages and computing PageRank.

Express algorithms in "“data-local operations”. Implement one type of communication: shuffle. Shuffle moves all data with the same key to the same reducer.

Input stored in triplicate

Map output"persisted to disk"before shuffle

Reduce input/"output on disk

1 MM R

Maps Reduce

Shuffle

1 2 M M

3 4 M M

Data scalable

Fault-tolerance by design

David Gleich · Purdue Ancestry.com

wordcount "is a matrix computation too

map(document) :

for word in document

emit (word, 1)

reduce(word, counts) :

emit (word, sum(counts))

1 2 D D

3 4 D D

matrix,1 matrix,1 matrix,1 matrix,1

bigdata,1 bigdata,1 bigdata,1 bigdata,1 bigdata,1 bigdata,1 bigdata,1 bigdata,1

hadoop,1 hadoop,1 hadoop,1 hadoop,1 hadoop,1 hadoop,1 hadoop,1

Ancestry.com

wordcount "is a matrix computation too

A1,1 A1,2 · · · A1,n

A2,1 A2,2 · · ·...

.... . .

. . . Am�1,nAm,1 · · · Am,n�1 Am,n

colsum(A) = AT e word count = e is the vector of all ones

Ancestry.com

inverted index"is a matrix computation too

A1,1 A1,2 · · · A1,n

A2,1 A2,2 · · ·...

.... . .

. . . Am�1,nAm,1 · · · Am,n�1 Am,n

Ancestry.com

A1,1 A2,1 · · · Am,1

A1,2 A2,2 · · ·...

.... . .

. . . Am,n�1A1,n · · · Am�1,n Am,n

77775= AT

inverted index"is a matrix computation too

Ancestry.com

A recommender system "with social info

product_ratings

friends_links

uid6 uid1 uid8 uid9 uid7 uid7 uid7 uid4 uid6 uid2 uid7 uid1 uid3 uid1 uid1 uid8 uid7 uid3 uid9 uid1

Ancestry.com

product_ratings

friends_links

64A1,1 A2,1 · · ·A1,2 A2,2 · · ·...

. . .. . .

75uid1

64A1,1 A2,1 · · ·A1,2 A2,2 · · ·...

. . .. . .

Ancestry.com

product_ratings

friends_links

Ancestry.com

Recommend each item based on the average rating of all trusted users

“X = S RT” with something that is"almost a matrix-matrix"product

R pid1

64A1,1 A2,1 · · ·A1,2 A2,2 · · ·...

. . .. . .

75 S uid1

64A1,1 A2,1 · · ·A1,2 A2,2 · · ·...

. . .. . .

Xuid,pid =

Suid,uid2Ruid2,pid

“Suid,uid2 and Ruid2,pid 6= 0”

Ancestry.com

Tools I like

hadoop streaming dumbo mrjob hadoopy C++

Ancestry.com

Tools I don’t use but other people seem to like …

pig java hbase mahout Eclipse Cassandra

Mahout is the closest thing to a library for matrix computations in Hadoop. If you like Java, you should probably start there. I’m a low-level guy

Ancestry.com

hadoop streaming

the map function is a program"(key,value) pairs are sent via stdin"output (key,value) pairs goes to stdout the reduce function is a program"(key,value) pairs are sent via stdin"keys are grouped"output (key,value) pairs goes to stdout

Ancestry.com

mrjob from

a wrapper around hadoop streaming for map and reduce functions in python

class MRWordFreqCount(MRJob): def mapper(self, _, line): for word in line.split(): yield (word.lower(), 1) def reducer(self, word, counts): yield (word, sum(counts)) if __name__ == '__main__': MRWordFreqCount.run()

Ancestry.com

Connected components in SQL and Hadoop

Ancestry.com David Gleich · Purdue 29

Connected components

3 “components” in this graph How can we find them algorithmically … … on a huge network?

Algorithm!Assign each node a random component id. For each node, take the minimum component id of itself and all neighbors.

Computing Connected Components in SQL

Graph!Edges : id | head | tail !

!“Vector”!v : id | comp ! initialized to random ! component!

!CREATE TABLE v2 AS ( !SELECT ! e.tail AS id, ! MIN(v.comp) as COMP !FROM edges e !INNER JOIN vector v !ON e.head = v.id!GROUP BY e.tail!); !!DROP TABLE v; !ALTER TABLE v2 ! RENAME TO v; !!... Repeat ... !!

Matrix-vector product and connected components in Hadoop

Ax = y

See example! ��matrix-hadoop/codes/smatvec.py!

Ancestry.com

Google’s PageRank Word count, average rating!

x = y”y

= min(xi

, mink

Matrix-vector product

Ax = y

A is stored by “node”

$ head samples/smat_5_5.txt !0 0 0.125 3 1.024 4 0.121 !1 0 0.597 !2 2 1.247 !3 4 -1.45 !4 2 0.061 !

v initially random !

$ head samples/vec_5.txt !0 0.241 !1 -0.98 !2 0.237 !3 -0.32 !4 0.080 !

Follow along! ��matrix-hadoop/codes/smatvec.py!

Ancestry.com

Matrix-vector product"(in pictures)

Ax = y

Input Map 1!Align on columns"

Reduce 1!Output Aik xk"keyed on row i

x Reduce 2!Output sum(Aik xk)"

Ancestry.com

Ax = y

def joinmap(self, key, line): ! vals = line.split() ! if len(vals) == 2: ! # the vector ! yield (vals[0], # row ! (float(vals[1]),)) # xi ! else: ! # the matrix ! row = vals[0] ! for i in xrange(1,len(vals),2): ! yield (vals[i], # column ! (row, # i,Aij! float(vals[i+1]))) !

Ancestry.com

“Matrix-vector” for connected components

def joinmap(self, key, line): ! vals = line.split() ! if len(vals) == 2: ! # the vector ! yield (vals[0], # row ! (float(vals[1]),)) # vi ! else: ! # the matrix ! row = vals[0] ! for i in xrange(1,len(vals),2): ! yield (row, # head ! (vals[i], # tail)) !

Ancestry.com

x = y”y

= min(xi

, mink

Ax = y

x def joinred(self, key, vals): ! vecval = 0. ! matvals = [] ! for val in vals: ! if len(val) == 1: ! vecval += val[0] ! else: ! matvals.append(val) ! for val in matvals: ! yield (val[0], val[1]*vecval) !

Note that you should use a secondary sort to avoid reading both in memory

Ancestry.com

“Matrix-vector” for connected components

x def joinred(self, key, vals): ! vecval = 0. ! matvals = [] ! for val in vals: ! if len(val) == 1: ! vecval += val[0] ! else: ! matvals.append(val) ! for val in matvals: ! yield (val[0], vecval) !

Note that you should use a secondary sort to avoid reading both in memory

Ancestry.com

x = y”y

= min(xi

, mink

Ax = y

x Reduce 2!Output sum(Aik xk)"

def sumred(self, key, vals): ! yield (key, sum(vals)) !

Ancestry.com

Our social recommender

Follow along! ��matrix-hadoop/recsys/recsys.py!

R is stored entry-wise !

$ gunzip –c data/rating.txt.gz!139431556 591156 5 !139431556 1312460676 5 !139431556 204358 4 139431556 368725 5 !Object ID! User ID! Rating!

S is stored entry-wise !

$ gunzip –c data/rating.txt.gz!3287060356 232085 -1 !3288305540 709420 1 !3290337156 204418 -1 !3294138244 269243 -1 !Other ID! Trust!My ID!

Ancestry.com

Matrix-matrix product

Follow along! ��matrix-hadoop/codes/matmat.py!

AB = CCij =

Aik Bkj

Ancestry.com

Conceptually, the first step is the same as the matrix-vector product with a block of vectors.

Matrix-matrix product "(in pictures)

AB = CCij =

Aik Bkj

A Map 1!Align on columns"

B Reduce 1!Output Aik Bkj"keyed on (i,j)

B Reduce 2!Output sum(Aik Bkj)"

Ancestry.com

Social recommender "(in code)

def joinmap(self, key, line): ! parts = line.split('\t') ! if len(parts) == 8: # ratings ! objid = parts[0].strip() ! uid = parts[1].strip() ! rat = int(parts[2]) ! yield (uid, (objid, rat)) ! else len(parts) == 4: # trust ! myid = parts[0].strip() ! otherid = parts[1].strip() ! value = int(parts[2]) ! if value > 0: ! yield (otherid, (myid,)) !

Ancestry.com

def joinred(self, key, vals): ! tusers = [] # uids that trust key ! ratobjs = [] # objs rated by uid=key ! for val in vals: ! if len(val) == 1: ! tusers.append(val[0]) ! else: ! ratobjs.append(val) !! for (objid, rat) in ratobjs: ! for uid in tusers: ! yield ((uid, objid), rat) !

Conceptually, the second step

is the same as the matrix-

matrix product too, we “map”

the ratings from each trusted

user back to the source.

Ancestry.com

AB = CCij =

Aik Bkj

B Reduce 2!Output sum(Aik Bkj)"

C def avgred(self, key, vals): ! s = 0. ! n = 0 ! for val in vals: ! s += val! n += 1 ! # the smoothed average of ratings ! yield key, ! (s+self.options.avg)/float(n+1) ! !

Ancestry.com

Better ways to store "matrices in Hadoop

Block matrices minimize the number of intermediate keys and values used. I’d form them based on the first reduce No need for “integer” keys that

fall between 1 and n!

Ancestry.com

From tinyimages"collection

Ancestry.com

Tall-and-Skinny matrices (m ≫ n) Many rows (like a billion) A few columns (under 10,000)

regression and general linear models"with many samples

block iterative methods panel factorizations

simulation data analysis !

big-data SVD/PCA!

Used in

David Gleich · Purdue

Questions?

Image from rockysprings, deviantart, CC share-alike Ancestry.com David Gleich · Purdue

recommendation and graph algorithms in hadoop and sql

pid9 uid8

pid5 uid9

pid2 uid9

pid9 uid5

matrix methods

matrix computationhttp

pid6 uid8

j j a2

Technology

interactive sql-on-hadoop and jethrodata

sql on hadoop 100tb tpc-ds benchmark

apache hive and stinger: sql in hadoop

hadoop & no sql new generation database systems

sql on hadoop

priyank patel, teradata, hadoop & sql

sql in hadoop

sql server 2016 - assets.microsoft.com · polybase sql...

impala: a modern sql engine for hadoop

a study of sql-on-hadoop systems - renmin univ. of...

sql on hadoop technology, architecture & innovations

sql-on-hadoop tutorial

hadoop - an introduction for sql server dbas

final version sql over hadoop ver1

blistering fast access to hadoop with sql

sql on hadoop for enterprise analytics

a study of sql-on- hadoop systems

big sql 3.0 - fast and easy sql on hadoop

the economics of sql on hadoop

outline -...