algorithms and abstractions for stream-and-sort. announcements thursday: there will be a short quiz...

36
Algorithms and Abstractions for Stream-and-Sort

Upload: elvin-blair

Post on 05-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably

Algorithms and Abstractions for Stream-and-Sort

Page 2: Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably

Announcements

Thursday: • there will be a short quiz• quiz will close at midnight Thursday,

but probably best to do it soon after class

Page 3: Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably

Recap – Last week• Algorithms that use only stream and sort primitives:– Naïve Bayes training (event counting)– Inverted indices, preprocessing for distributional

clustering, …– Naïve Bayes classification (with event counts out of

memory)• Phrase finding task– Score all the phrases in a corpus for “phrasiness” and

“informativeness”– Statistics based on BLRT, Pointwise KL-divergence

Page 4: Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably

Today• Phrase finding task – the implementation– Score all the phrases in a corpus for

“phrasiness” and “informativeness”– Statistics based on BLRT, Pointwise KL-

divergence• Phrase finding algorithm• Abstractions for Map-Reduce• Guest lecture: Manik Varma, MSR India

Page 5: Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably

Phrase Finding

William W. Cohen

Page 6: Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably

scoring

Page 7: Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably

“Phraseness”1 – based on BLRT–Define• pi=ki /ni, p=(k1+k2)/(n1+n2),

• L(p,k,n) = pk(1-p)n-k

comment

k1 C(W1=x ^ W2=y)

how often bigram x y occurs in corpus C

n1 C(W1=x) how often word x occurs in corpus C

k2 C(W1≠x^W2=y)

how often y occurs in C after a non-x

n2 C(W1≠x) how often a non-x occurs in C

Phrase x y: W1=x ^ W2=y

Does y occur at the same frequency after x as in other positions?

Page 8: Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably

“Informativeness”1 – based on BLRT

–Define• pi=ki /ni, p=(k1+k2)/(n1+n2),

• L(p,k,n) = pk(1-p)n-k

Phrase x y: W1=x ^ W2=y and two corpora, C and B

comment

k1 C(W1=x ^ W2=y)

how often bigram x y occurs in corpus C

n1 C(W1=* ^ W2=*)

how many bigrams in corpus C

k2 B(W1=x^W2=y) how often x y occurs in background corpus

n2 B(W1=* ^ W2=*)

how many bigrams in background corpus

Does x y occur at the same frequency in both corpora?

Page 9: Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably

The breakdown: what makes a good phrase

– To compare distributions, use KL-divergence

“Pointwise KL divergence”

Phraseness: difference between bigram and unigram language model in foreground

Bigram model: P(x y)=P(x)P(y|x)

Unigram model: P(x y)=P(x)P(y)

Page 10: Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably

The breakdown: what makes a good phrase

– To compare distributions, use KL-divergence

“Pointwise KL divergence”

Informativeness: difference between foreground and background models

Bigram model: P(x y)=P(x)P(y|x)

Unigram model: P(x y)=P(x)P(y)

w would be a phrase here

Page 11: Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably

The breakdown: what makes a good phrase

– To compare distributions, use KL-divergence

“Pointwise KL divergence”

Combined: difference between foreground bigram model and background unigram model

Bigram model: P(x y)=P(x)P(y|x)

Unigram model: P(x y)=P(x)P(y)

Page 12: Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably

Implementation• Request-and-answer pattern

– Main data structure: tables of key-value pairs• key is a phrase x y • value is a mapping from a attribute names (like phraseness,

freq-in-B, …) to numeric values.

– Keys and values are just strings– We’ll operate mostly by sending messages to this data

structure and getting results back, or else streaming thru the whole table

– For really big data: we’d also need tables where key is a word and val is set of attributes of the word (freq-in-B, freq-in-C, …)

Page 13: Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably

Generating and scoring phrases: 1

• Stream through foreground corpus and count events “W1=x ^ W2=y” the same way we do in training naive Bayes: stream-and sort and accumulate deltas (a “sum-reduce”)– Don’t bother generating “boring” phrases (e.g., crossing a

sentence, contain a stopword, …)• Then stream through the output and convert to phrase, attributes-

of-phrase records with one attribute: freq-in-C=n• Stream through foreground corpus and count events “W1=x” in a

(memory-based) hashtable….• This is enough* to compute phrasiness:

– ψp(x y) = f( freq-in-C(x), freq-in-C(y), freq-in-C(x y))

• …so you can do that with a scan through the phrase table that adds an extra attribute (holding word frequencies in memory).

* actually you also need total # words and total #phrases….

Page 14: Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably

Generating and scoring phrases: 2

• Stream through background corpus and count events “W1=x ^ W2=y” and convert to phrase, attributes-of-phrase records with one attribute: freq-in-B=n

• Sort the two phrase-tables: freq-in-B and freq-in-C and run the output through another “reducer” that– appends together all the attributes associated

with the same key, so we now have elements like

Page 15: Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably

Generating and scoring phrases: 3

• Scan the through the phrase table one more time and add the informativeness attribute and the overall quality attribute

Summary, assuming word vocabulary nW is small:• Scan foreground corpus C for phrases: O(nC) producing mC

phrase records – of course mC << nC

• Compute phrasiness: O(mC) • Scan background corpus B for phrases: O(nB) producing mB • Sort together and combine records: O(m log m), m=mB +

mC

• Compute informativeness and combined quality: O(m)

Assumes word counts fit in memory

Page 16: Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably

Is there a stream-and-sort analog of this request-and-answer pattern?

id1 found an aardvark in zynga’s farmville today!id2 …id3 ….id4 …id5 …..

Test data Record of all event counts for each word

w Counts associated with W

aardvark C[w^Y=sports]=2

agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564

… …

zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464

Classification logic

found ~ctr to id1

aardvark ~ctr to id2

…today ~ctr to idi

Counter records

requests

Combine and sort

Recap

Page 17: Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably

A stream-and-sort analog of the request-and-answer pattern…

Record of all event counts for each word

w Countsaardvark C[w^Y=sports]

=2

agent …

zynga …

found ~ctr to id1

aardvark ~ctr to id1

…today ~ctr to id1

Counter records

requests

Combine and sort

w Counts

aardvark C[w^Y=sports]=2

aardvark ~ctr to id1

agent C[w^Y=sports]=…

agent ~ctr to id345

agent ~ctr to id9854

… ~ctr to id345

agent ~ctr to id34742

zynga C[…]

zynga ~ctr to id1

Request-handling logic

Recap

Page 18: Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably

A stream-and-sort analog of the request-and-answer pattern…

requests

Combine and sort

w Counts

aardvark C[w^Y=sports]=2

aardvark ~ctr to id1

agent C[w^Y=sports]=…

agent ~ctr to id345

agent ~ctr to id9854

… ~ctr to id345

agent ~ctr to id34742

zynga C[…]

zynga ~ctr to id1

Request-handling logic

•previousKey = somethingImpossible• For each (key,val) in input:• If key==previousKey

• Answer(recordForPrevKey,val)• Else

• previousKey = key• recordForPrevKey = val

define Answer(record,request):• find id where “request = ~ctr to id”• print “id ~ctr for request is record”

Recap

Page 19: Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably

A stream-and-sort analog of the request-and-answer pattern…

requests

Combine and sort

w Counts

aardvark C[w^Y=sports]=2

aardvark ~ctr to id1

agent C[w^Y=sports]=…

agent ~ctr to id345

agent ~ctr to id9854

… ~ctr to id345

agent ~ctr to id34742

zynga C[…]

zynga ~ctr to id1

Request-handling logic

•previousKey = somethingImpossible• For each (key,val) in input:• If key==previousKey

• Answer(recordForPrevKey,val)• Else

• previousKey = key• recordForPrevKey = val

define Answer(record,request):• find id where “request = ~ctr to id”• print “id ~ctr for request is record”

Output:id1 ~ctr for aardvark is C[w^Y=sports]=2…id1 ~ctr for zynga is ….…

Recap

Page 20: Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably

A stream-and-sort analog of the request-and-answer pattern…

w Counts

aardvark C[w^Y=sports]=2

aardvark ~ctr to id1

agent C[w^Y=sports]=…

agent ~ctr to id345

agent ~ctr to id9854

… ~ctr to id345

agent ~ctr to id34742

zynga C[…]

zynga ~ctr to id1

Request-handling logic

Output:id1 ~ctr for aardvark is C[w^Y=sports]=2…id1 ~ctr for zynga is ….…id1 found an aardvark in zynga’s farmville today!id2 …id3 ….id4 …id5 …..

Combine and sort ????

Recap

Page 21: Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably

Key Value

id1 found aardvark zynga farmville today

~ctr for aardvark is C[w^Y=sports]=2

~ctr for found is

C[w^Y=sports]=1027,C[w^Y=worldNews]=564

id2 w2,1 w2,2 w2,3 ….

~ctr for w2,1 is …

… …

What we ended up with

Recap

Page 22: Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably

Ramping it up – keeping word counts out of memory

• Goal: records for xy with attributes freq-in-B, freq-in-C, freq-of-x-in-C, freq-of-y-in-C, …

• Assume I have built built phrase tables and word tables….how do I incorporate the word attributes into the phrase records?

• For each phrase xy, request necessary word frequencies:– Print “x ~request=freq-in-C,from=xy”– Print “y ~request=freq-in-C,from=xy”

• Sort all the word requests in with the word tables• Scan through the result and generate the answers: for each

word w, a1=n1,a2=n2,….

– Print “xy ~request=freq-in-C,from=w”• Sort the answers in with the xy records• Scan through and augment the xy records appropriately

Page 23: Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably

Generating and scoring phrases: 3

Summary1. Scan foreground corpus C for phrases, words: O(nC)

producing mC phrase records, vC word records2. Scan phrase records producing word-freq requests:

O(mC )producing 2mC requests

3. Sort requests with word records: O((2mC + vC )log(2mC + vC))

= O(mClog mC) since vC < mC

4. Scan through and answer requests: O(mC)5. Sort answers with phrase records: O(mClog mC) 6. Repeat 1-5 for background corpus: O(nB + mBlogmB)7. Combine the two phrase tables: O(m log m), m = mB

+ mC

8. Compute all the statistics: O(m)

Page 24: Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably

24

ABSTRACTIONS FOR STREAM AND SORTAND MAP-REDUCE

Page 25: Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably

25

Abstractions On Top Of Map-Reduce

• We’ve decomposed some algorithms into a map-reduce “workflow” (series of map-reduce steps)–naive Bayes training–naïve Bayes testing–phrase scoring

• How else can we express these sorts of computations? Are there some common special cases of map-reduce steps we can parameterize and reuse?

Page 26: Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably

26

Abstractions On Top Of Map-Reduce

• Some obvious streaming processes: – for each row in a table• Transform it and

output the result

• Decide if you want to keep it with some boolean test, and copy out only the ones that pass the test

Example: stem words in a stream of word-count pairs:(“aardvarks”,1) (“aardvark”,1)

Proposed syntax:

table2 = MAP table1 TO λ row : f(row)) f(row)row’

Example: apply stop words(“aardvark”,1) (“aardvark”,1)(“the”,1) deleted

Proposed syntax:

table2 = FILTER table1 BY λ row : f(row)) f(row) {true,false}

Page 27: Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably

27

Abstractions On Top Of Map-Reduce

• A non-obvious? streaming processes: – for each row in a table• Transform it to a list

of items• Splice all the lists

together to get the output table (flatten)Example: tokenizing a line

“I found an aardvark” [“i”, “found”,”an”,”aardvark”]“We love zymurgy” [“we”,”love”,”zymurgy”]..but final table is one word per row

“i”“found”“an”“aardvark”“we”“love”…

Proposed syntax:

table2 = FLATMAP table1 TO λ row : f(row)) f(row)list of rows

Page 28: Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably

28

Abstractions On Top Of Map-Reduce

• Another example from the Naïve Bayes test program…

Page 29: Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably

NB Test Step

X=w1^Y=sportsX=w1^Y=worldNewsX=..X=w2^Y=…X=……

524510542120

373

Event counts

How:• Stream and sort:• for each C[X=w^Y=y]=n

• print “w C[Y=y]=n”• sort and build a list of values

associated with each key wLike an inverted index

w Counts associated with W

aardvark C[w^Y=sports]=2

agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564

… …

zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464

Page 30: Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably

NB Test Step

X=w1^Y=sportsX=w1^Y=worldNewsX=..X=w2^Y=…X=……

524510542120

373

Event counts

w Counts associated with W

aardvark C[w^Y=sports]=2

agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564

… …

zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464

The general case:We’re taking rows from a table• In a particular format

(event,count)Applying a function to get a new value• The word for the eventAnd grouping the rows of the table by this new value

Grouping operationSpecial case of a map-reduce

Proposed syntax:

GROUP table BY λ row : f(row) Could define f via: a function, a field of a defined record structure, …

f(row)field

Page 31: Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably

NB Test StepThe general case:We’re taking rows from a table• In a particular format

(event,count)Applying a function to get a new value• The word for the eventAnd grouping the rows of the table by this new value

Grouping operationSpecial case of a map-reduce

Proposed syntax:

GROUP table BY λ row : f(row) Could define f via: a function, a field of a defined record structure, …

f(row)field

Aside: you guys know how to implement this, right?

1. Output pairs (f(row),row) with a map/streaming process

2. Sort pairs by key – which is f(row)

3. Reduce and aggregate by appending together all the values associated with the same key

Page 32: Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably

32

Abstractions On Top Of Map-Reduce

• And another example from the Naïve Bayes test program…

Page 33: Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably

Request-and-answer

id1 w1,1 w1,2 w1,3 …. w1,k1

id2 w2,1 w2,2 w2,3 …. id3 w3,1 w3,2 …. id4 w4,1 w4,2 …id5 w5,1 w5,2 …...

Test data Record of all event counts for each word

w Counts associated with W

aardvark C[w^Y=sports]=2

agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564

… …

zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464

Step 2: stream through and for each test case

idi wi,1 wi,2 wi,3 …. wi,ki

request the event counters needed to classify idi from the event-count DB, then classify using the answers

Classification logic

Page 34: Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably

Request-and-answer

• Break down into stages– Generate the data being requested (indexed by

key, here a word)• Eg with group … by

– Generate the requests as (key, requestor) pairs• Eg with flatmap … to

– Join these two tables by key• Join defined as (1) cross-product and (2) filter out pairs

with different values for keys • This replaces the step of concatenating two different

tables of key-value pairs, and reducing them together

– Postprocess the joined result

Page 35: Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably

w Counters

aardvark C[w^Y=sports]=2

agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564

… …

zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464

w Counters Requests

aardvark C[w^Y=sports]=2 ~ctr to id1

agent C[w^Y=sports]=…

~ctr to id345

agent C[w^Y=sports]=…

~ctr to id9854

agent C[w^Y=sports]=…

~ctr to id345

… C[w^Y=sports]=…

~ctr to id34742

zynga C[…] ~ctr to id1

zynga C[…] …

w Request

found ~ctr to id1

aardvark ~ctr to id1

zynga ~ctr to id1

… ~ctr to id2

Page 36: Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably

w Counters

aardvark C[w^Y=sports]=2

agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564

… …

zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464

w Counters Requests

aardvark C[w^Y=sports]=2 id1

agent C[w^Y=sports]=…

id345

agent C[w^Y=sports]=…

id9854

agent C[w^Y=sports]=…

id345

… C[w^Y=sports]=…

id34742

zynga C[…] id1

zynga C[…] …

w Request

found id1

aardvark id1

zynga id1

… id2

Proposed syntax:

JOIN table1 BY λ row : f(row), table2 BY λ row : g(row)

Examples:

JOIN wordInDoc BY word, wordCounters BY word --- if word(row) defined correctlyJOIN wordInDoc BY lambda (word,docid):word, wordCounters BY lambda (word,counters):word – using python syntax for functions