workflows and abstractions for map-reducewcohen/10-605/workflows-2.pdfguineapig: pig in python •...

63
Workflows and Abstractions for Map-Reduce 1

Upload: others

Post on 07-Jul-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

Workflows and Abstractions for Map-Reduce

1

Page 2: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

Recap

• Map-reduce✔• Algorithmswithmultiplemap-reducesteps

–Naïvebayestestroutineforlargedatasetsandlargemodels

• Cleanlydescribingthesealgorithms–workflow(ordataflow)languages– abstractoperations:map,filter,flatten,group,join,…

–PIG:onesuchlanguage2

Page 3: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

PIG• Releasedin2008• Wordcount program:

3

Page 4: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

GUINEA PIG

4

Page 5: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

GuineaPig: PIG in Python• PurePython(<1500lines)• StreamsPythondatastructures

– strings,numbers,tuples (a,b),lists[a,b,c]– Norecords:operationsdefinedfunctionally

• CompilestoHadoop streamingpipeline– OptimizessequencesofMAPs

• RunslocallywithoutHadoop– compilestostream-and-sortpipeline– intermediateresultscanbeviewed

• Caneasilyrunpartsofapipeline• http://curtis.ml.cmu.edu/w/courses/index.php/Guinea_Pig

5

Page 6: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

GuineaPig: PIG in Python• PurePython,streamsPythondatastructures

– nottoomuchnewtolearn(eg field/recordnotation,specialstringoperations,UDFs,…)

– codebaseissmallandreadable• CompilestoHadoop orstream-and-sort,caneasilyrunpartsofapipeline– intermediateresultsoftenare(andalwayscanbe)storedandinspected

– planisfairlyvisible• Syntaxincludeshigh-leveloperationsbutalsofairlydetaileddescriptionofanoptimizedmap-reducestep– Flatten|Group(by=…,retaining=…,reducingTo=…)

6

Page 7: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

7

A wordcount example

class variables in the planner are data structures

Page 8: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

Wordcount example ….• Aprogramisconvertedtoadatastructure• Thedatastructurecanbeconvertedtoaseriesof“abstractmap-reducetasks”andthenshellcommands

8

steps in the compiled plan

invoke your script with special args

Page 9: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

Wordcount example ….

• Datastructurecanbeconvertedtocommandsforstreaminghadoop

9

Page 10: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

Wordcount example ….

• Ofcourseyouwon’taccesslocalfileswithHadoop,soyouneedtospecifyanHDFSlocationforinputsandoutputs

10

Page 11: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

Wordcount example ….

• Ofcourseyouwon’taccesslocalfileswithHadoop,soyouneedtospecifyanHDFSlocationforinputsandoutputs

11

Page 12: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

More examples of GuineaPig

12

Join syntax, macros, Format command

Incremental debugging, when intermediate views are stored:

% python wrdcmp.py –store result…% python wrdcmp.py –store result –reuse cmp

Page 13: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

More examples of GuineaPig

13

Full Syntax for Group

Group(wc, by=lambda (word,count):word[:k], retaining=lambda (word,count):count, combiningTo=ReduceToSum(),reducingTo=ReduceToSum())

equiv to:

Group(wc, by=lambda (word,count):word[:k], reducingTo=

ReduceTo(int, lambda accum,word,count): accum+count))

Page 14: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

ANOTHER EXAMPLE:COMPUTINGTFIDF IN GUINEA PIG

14

Page 15: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

Implementation

15

Page 16: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

Implementation

16

docId term

d123 found

d123 aardvark

… …

(d123,found)(d123,aardvark)…

Page 17: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

Implementation

17

docId term

d123 found

d123 aardvark

key value

found (d123,found),(d134,found),… 2456

aardvark (d123,aardvark),… 7

Page 18: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

Implementation

18

docId term

d123 found

d123 aardvark

key value

found 2456

aardvark 7

('1', 'quite') ('1', 'a') ('1', 'difference.’)…(’3’, ‘alcohol’)…

("'94", 1) ("'94,", 1) ("'a", 1) ("'alcohol", 1) …

(('2', "'alcohol"), ("'alcohol", 1)) (('550', "'cause"), ("'cause", 1))…

Page 19: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

Implementation

19

docId term df

d123 found 2456

d123 aardvark 7

('2', "'confabulation'.", 2) ('3', "'confabulation'.", 2) ('209', "'controversy'", 1) ('181', "'em", 3) ('434', "'em", 3) ('452', "'em", 3) ('113', "'fancy", 1) ('212', "'franchise',", 1) ('352', "'honest,", 1)

Page 20: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

Implementation: Map-side join

20

Augment: loads a preloaded object b at mapper initialization time,cycles thru the input, and generates pairs (a,b)

docId term df

d123 found 2456 Arbitrary python object

d123 aardvark 7 Arbitrary python object

Page 21: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

Implementation

21

Augment: loads a preloaded object b at mapper initialization time,cycles thru the input, and generates pairs (a,b), where b points to the preloaded object

docId term df

d123 found 2456 ptr

d123 aardvark 7 ptr

… …

Arbitrary python object

(('2', "'confabulation'.", 2), ('ndoc', 964)) (('3', "'confabulation'.", 2), ('ndoc', 964)) (('209', "'controversy'", 1), ('ndoc', 964)) (('181', "'em", 3), ('ndoc', 964)) (('434', "'em", 3), ('ndoc', 964))

Page 22: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

Implementation

Augment: loads a preloaded object b at mapper initialization time,cycles thru the input, and generates pairs (a,b), where b points to the preloaded object

(('2', "'confabulation'.", 2), ('ndoc', 964)) (('3', "'confabulation'.", 2), ('ndoc', 964)) (('209', "'controversy'", 1), ('ndoc', 964)) (('181', "'em", 3), ('ndoc', 964)) (('434', "'em", 3), ('ndoc', 964))

This looks like a join. But it’s different.• It’s a single map, not a map-shuffle/sort-reduce• The loaded object is paired with every a, not just ones where the join keys

match (but you can use it for a map-side join!)• The loaded object has to be distributed to every mapper (so, copied!)

22

Page 23: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

Implementation

23

Gotcha: if you store an augment, it’s printed on disk, and Python writes the object pointed to, not the pointer. So when you store you make a copy of the object for every row.

docId term df

d123 found 2456 ptr

d123 aardvark 7 ptr

… …

Arbitrary python object

(('2', "'confabulation'.", 2), printed-object) (('3', "'confabulation'.", 2), printed-object) (('209', "'controversy'", 1), printed-object) (('181', "'em", 3), printed-object) (('434', "'em", 3), printed-object)

Page 24: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

Full Implementation

24

Page 25: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

TFIDF with map-side joins

25

Page 26: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

TFIDF with map-side joins

26

Page 27: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

SOFT JOINS

Anotherproblemtoattackwithdataflow

27

Page 28: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

In the once upon a time days of the First Age of Magic, the prudent sorcerer regarded his own true name as his most valued possession but also the greatest threat to his continued good health, for--the stories go--once an enemy, even a weak unskilled enemy, learned the sorcerer's true name, then routine and widely known spells could destroy or enslave even the most powerful. As times passed, and we graduated to the Age of Reason and thence to the first and second industrial revolutions, such notions were discredited. Now it seems that the Wheel has turned full circle (even if there never really was a First Age) and we are back to worrying about true names again:

The first hint Mr. Slippery had that his own True Name might be known--and, for that matter, known to the Great Enemy--came with the appearance of two black Lincolns humming up the long dirt driveway ... Roger Pollack was in his garden weeding, had been there nearly the whole morning.... Four heavy-set men and a hard-looking female piled out, started purposefully across his well-tended cabbage patch.…

This had been, of course, Roger Pollack's great fear. They had discovered Mr. Slippery's True Name and it was Roger Andrew Pollack TIN/SSAN 0959-34-2861.

28

Page 29: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

Outline: Soft Joins with TFIDF

• Whysimilarityjoinsareimportant• Usefulsimilaritymetricsforsetsandstrings• FastmethodsforK-NNandsimilarityjoins

–Blocking– Indexing–Short-cutalgorithms–Parallelimplementation

29

Page 30: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

SOFT JOINS WITH TFIDF:WHY AND WHAT

30

Page 31: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

Motivation• Integratingdataisimportant• Datafromdifferentsourcesmaynothaveconsistentobjectidentifiers– Especiallyautomatically-constructedones

• Butdatabaseswillhavehuman-readablenamesfortheobjects

• Butnamesaretricky….

31

Page 32: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

32

Page 33: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

Sim Joins on Product Descriptions

• Similarity can be high for descriptions of distinct items:

o AERO TGX-Series Work Table -42'' x 96'' Model 1TGX-4296 All tables shipped KD AEROSPEC- 1TGX Tables are Aerospec Designed. In addition to above specifications; - All four sides have a V countertop edge ...

o AERO TGX-Series Work Table -42'' x 48'' Model 1TGX-4248 All tables shipped KD AEROSPEC- 1TGX Tables are Aerospec Designed. In addition to above specifications; - All four sides have a V countertop ..

• Similarity can be low for descriptions of identical items:

o Canon Angle Finder C 2882A002 Film Camera Angle Finders Right Angle Finder C (Includes ED-C & ED-D Adapters for All SLR Cameras) Film Camera Angle Finders & Magnifiers The Angle Finder C lets you adjust ...

o CANON 2882A002 ANGLE FINDER C FOR EOS REBEL® SERIES PROVIDES A FULL SCREEN IMAGE SHOWS EXPOSURE DATA BUILT-IN DIOPTRIC ADJUSTMENT COMPATIBLE WITH THE CANON® REBEL, EOS & REBEL EOS SERIES.

33

Page 34: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

One solution: Soft (Similarity) joins

• AsimilarityjoinoftwosetsAandBis– anorderedlistoftriples(sij,ai,bj)suchthat

• ai isfromA• bj isfromB• sij isthesimilarity ofai andbj• thetriplesareindescendingorder

• thelistiseitherthetopKtriplesbysij orALLtripleswithsij>L…orsometimessomeapproximationofthese….

34

Page 35: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

Example: soft joins/similarity joins

… …

Input: Two Different Lists of Entity Names

35

Page 36: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

Example: soft joins/similarity joinsOutput: Pairs of Names Ranked by Similarity

identical

similar

less similar

36

Page 37: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

Softjoin Example - 1

A useful scalable similarity metric: IDF weighting plus cosine distance!37

Page 38: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

How well does TFIDF work?

38

Page 39: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

39

Page 40: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

There are refinements to TFIDF distance – eg ones that extend with soft matching at the token level (e.g., softTFIDF)

40

Page 41: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

Semantic Joiningwith Multiscale Statistics

WilliamCohenKatieRivard,DanaAttias-Moshevitz

CMU

41

Page 42: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

42

Page 43: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

SOFT JOINS WITH TFIDF:HOW?

43

Page 44: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

Rocchio’s algorithm

DF(w) = # different docs w occurs in TF(w,d) = # different times w occurs in doc d

IDF(w) = |D |DF(w)

u(w,d) = log(TF(w,d)+1) ⋅ log(IDF(w))u(d) = u(w1,d),....,u(w|V |,d)

u(y) =α 1|Cy |

u(d)||u(d) ||2d∈Cy

∑ −β1

|D−Cy |u(d ')

||u(d ') ||2d '∈D−Cy

f (d) = argmaxyu(d)

||u(d) ||2

⋅u(y)

||u(y) ||2

Many variants of these formulae

…as long as u(w,d)=0 for words not in d!

Store only non-zeros in u(d), so size is O(|d| )

But size of u(y) is O(|nV| )

u2= ui

2

i∑

44

Page 45: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

TFIDF similarity

DF(w) = # different docs w occurs in TF(w,d) = # different times w occurs in doc d

IDF(w) = |D |DF(w)

u(w,d) = log(TF(w,d)+1) ⋅ log(IDF(w))u(d) = u(w1,d),....,u(w|V |,d)

v(d) = u(d)||u(d) ||2

sim(v(d1),v(d2 )) = v(d1) ⋅v(d2 ) = u(w,d1)||u(d1) ||2w

∑ u(w,d2 )||u(d2 ) ||2

45

Page 46: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

Soft TFIDF joins• AsimilarityjoinoftwosetsofTFIDF-weightedvectorsAandBis– anorderedlistoftriples(sij,ai,bj)suchthat

• ai isfromA• bj isfromB• sij isthedotproductofai andbj• thetriplesareindescendingorder

• thelistiseitherthetopKtriplesbysij orALLtripleswithsij>L…orsometimessomeapproximationofthese….

46

Page 47: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

PARALLEL SOFT JOINS

47

Page 48: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

SIGMOD 2010

loosely adapted

48

Page 49: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

Parallel Inverted Index Softjoin - 1

want this to work for long documents or short ones…and keep the relations simple

Statistics for computing TFIDF with IDFs local to each relation49

Page 50: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

Parallel Inverted Index Softjoin - 2

What’sthealgorithm?• Step1:createdocumentvectorsas(Cd,d,term,weight)tuples

• Step2:jointhetuplesfromAandB:onesortandreduce• Givesyoutuples(a,b,term,w(a,term)*w(b,term))

• Step3:groupthecommontermsby(a,b)andreducetoaggregatethecomponentsofthesum

50

Page 51: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

An alternative TFIDF pipeline

51

Page 52: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

Parallel versus serial

• NamesoftagsfromStackoverflow vsWikipediaconcepts:– input612M,7.1Mentities– docvec 1050M– softjoin 67M,1.5Mpairs– wallclock time24min

• 25processesonin-memorymap-reduce• called”mrs_gp”

– wallclocktimeforSecondString• 3-4days

– (preliminaryexperiments)

52

Page 53: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

The same code in PIG

53

Page 54: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

Inverted Index Softjoin – PIG 1/3

54

Page 55: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

Inverted Index Softjoin – 2/3

55

Page 56: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

Inverted Index Softjoin – 3/3

56

Page 57: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

Results…..

57

Page 58: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

58

Page 59: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

Making the algorithm smarter….

59

Page 60: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

Inverted Index Softjoin - 2

• thisjoiniswhereitcangetexpensive• ifatermappearsinNdocsinrel1andMdocsinrel2thenyougetN*Mtuplesinthejoin

• butfrequenttermsdon’tcountmuch…• weshouldmakeasmart choiceaboutwhichtermstouse

60

Page 61: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

Adding heuristics to the soft join - 1

61

Page 62: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

Adding heuristics to the soft join - 2

62

Page 63: Workflows and Abstractions for Map-Reducewcohen/10-605/workflows-2.pdfGuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures –strings, numbers,

Adding heuristics

• Parks:– input40k–data60k–docvec 102k– softjoin

• 539ktokens• 508kdocuments• 0errorsintop50

• w/heuristics:– input40k–data60k–docvec 102k– softjoin

• 32ktokens• 24kdocuments• 3errorsintop50• <400usefulterms

63