workflows and abstractions for map-reducewcohen/10-605/workflows-2.pdfguineapig: pig in python •...

Post on 07-Jul-2020

12 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Workflows and Abstractions for Map-Reduce

1

Recap

• Map-reduce✔• Algorithmswithmultiplemap-reducesteps

–Naïvebayestestroutineforlargedatasetsandlargemodels

• Cleanlydescribingthesealgorithms–workflow(ordataflow)languages– abstractoperations:map,filter,flatten,group,join,…

–PIG:onesuchlanguage2

PIG• Releasedin2008• Wordcount program:

3

GUINEA PIG

4

GuineaPig: PIG in Python• PurePython(<1500lines)• StreamsPythondatastructures

– strings,numbers,tuples (a,b),lists[a,b,c]– Norecords:operationsdefinedfunctionally

• CompilestoHadoop streamingpipeline– OptimizessequencesofMAPs

• RunslocallywithoutHadoop– compilestostream-and-sortpipeline– intermediateresultscanbeviewed

• Caneasilyrunpartsofapipeline• http://curtis.ml.cmu.edu/w/courses/index.php/Guinea_Pig

5

GuineaPig: PIG in Python• PurePython,streamsPythondatastructures

– nottoomuchnewtolearn(eg field/recordnotation,specialstringoperations,UDFs,…)

– codebaseissmallandreadable• CompilestoHadoop orstream-and-sort,caneasilyrunpartsofapipeline– intermediateresultsoftenare(andalwayscanbe)storedandinspected

– planisfairlyvisible• Syntaxincludeshigh-leveloperationsbutalsofairlydetaileddescriptionofanoptimizedmap-reducestep– Flatten|Group(by=…,retaining=…,reducingTo=…)

6

7

A wordcount example

class variables in the planner are data structures

Wordcount example ….• Aprogramisconvertedtoadatastructure• Thedatastructurecanbeconvertedtoaseriesof“abstractmap-reducetasks”andthenshellcommands

8

steps in the compiled plan

invoke your script with special args

Wordcount example ….

• Datastructurecanbeconvertedtocommandsforstreaminghadoop

9

Wordcount example ….

• Ofcourseyouwon’taccesslocalfileswithHadoop,soyouneedtospecifyanHDFSlocationforinputsandoutputs

10

Wordcount example ….

• Ofcourseyouwon’taccesslocalfileswithHadoop,soyouneedtospecifyanHDFSlocationforinputsandoutputs

11

More examples of GuineaPig

12

Join syntax, macros, Format command

Incremental debugging, when intermediate views are stored:

% python wrdcmp.py –store result…% python wrdcmp.py –store result –reuse cmp

More examples of GuineaPig

13

Full Syntax for Group

Group(wc, by=lambda (word,count):word[:k], retaining=lambda (word,count):count, combiningTo=ReduceToSum(),reducingTo=ReduceToSum())

equiv to:

Group(wc, by=lambda (word,count):word[:k], reducingTo=

ReduceTo(int, lambda accum,word,count): accum+count))

ANOTHER EXAMPLE:COMPUTINGTFIDF IN GUINEA PIG

14

Implementation

15

Implementation

16

docId term

d123 found

d123 aardvark

… …

(d123,found)(d123,aardvark)…

Implementation

17

docId term

d123 found

d123 aardvark

key value

found (d123,found),(d134,found),… 2456

aardvark (d123,aardvark),… 7

Implementation

18

docId term

d123 found

d123 aardvark

key value

found 2456

aardvark 7

('1', 'quite') ('1', 'a') ('1', 'difference.’)…(’3’, ‘alcohol’)…

("'94", 1) ("'94,", 1) ("'a", 1) ("'alcohol", 1) …

(('2', "'alcohol"), ("'alcohol", 1)) (('550', "'cause"), ("'cause", 1))…

Implementation

19

docId term df

d123 found 2456

d123 aardvark 7

('2', "'confabulation'.", 2) ('3', "'confabulation'.", 2) ('209', "'controversy'", 1) ('181', "'em", 3) ('434', "'em", 3) ('452', "'em", 3) ('113', "'fancy", 1) ('212', "'franchise',", 1) ('352', "'honest,", 1)

Implementation: Map-side join

20

Augment: loads a preloaded object b at mapper initialization time,cycles thru the input, and generates pairs (a,b)

docId term df

d123 found 2456 Arbitrary python object

d123 aardvark 7 Arbitrary python object

Implementation

21

Augment: loads a preloaded object b at mapper initialization time,cycles thru the input, and generates pairs (a,b), where b points to the preloaded object

docId term df

d123 found 2456 ptr

d123 aardvark 7 ptr

… …

Arbitrary python object

(('2', "'confabulation'.", 2), ('ndoc', 964)) (('3', "'confabulation'.", 2), ('ndoc', 964)) (('209', "'controversy'", 1), ('ndoc', 964)) (('181', "'em", 3), ('ndoc', 964)) (('434', "'em", 3), ('ndoc', 964))

Implementation

Augment: loads a preloaded object b at mapper initialization time,cycles thru the input, and generates pairs (a,b), where b points to the preloaded object

(('2', "'confabulation'.", 2), ('ndoc', 964)) (('3', "'confabulation'.", 2), ('ndoc', 964)) (('209', "'controversy'", 1), ('ndoc', 964)) (('181', "'em", 3), ('ndoc', 964)) (('434', "'em", 3), ('ndoc', 964))

This looks like a join. But it’s different.• It’s a single map, not a map-shuffle/sort-reduce• The loaded object is paired with every a, not just ones where the join keys

match (but you can use it for a map-side join!)• The loaded object has to be distributed to every mapper (so, copied!)

22

Implementation

23

Gotcha: if you store an augment, it’s printed on disk, and Python writes the object pointed to, not the pointer. So when you store you make a copy of the object for every row.

docId term df

d123 found 2456 ptr

d123 aardvark 7 ptr

… …

Arbitrary python object

(('2', "'confabulation'.", 2), printed-object) (('3', "'confabulation'.", 2), printed-object) (('209', "'controversy'", 1), printed-object) (('181', "'em", 3), printed-object) (('434', "'em", 3), printed-object)

Full Implementation

24

TFIDF with map-side joins

25

TFIDF with map-side joins

26

SOFT JOINS

Anotherproblemtoattackwithdataflow

27

In the once upon a time days of the First Age of Magic, the prudent sorcerer regarded his own true name as his most valued possession but also the greatest threat to his continued good health, for--the stories go--once an enemy, even a weak unskilled enemy, learned the sorcerer's true name, then routine and widely known spells could destroy or enslave even the most powerful. As times passed, and we graduated to the Age of Reason and thence to the first and second industrial revolutions, such notions were discredited. Now it seems that the Wheel has turned full circle (even if there never really was a First Age) and we are back to worrying about true names again:

The first hint Mr. Slippery had that his own True Name might be known--and, for that matter, known to the Great Enemy--came with the appearance of two black Lincolns humming up the long dirt driveway ... Roger Pollack was in his garden weeding, had been there nearly the whole morning.... Four heavy-set men and a hard-looking female piled out, started purposefully across his well-tended cabbage patch.…

This had been, of course, Roger Pollack's great fear. They had discovered Mr. Slippery's True Name and it was Roger Andrew Pollack TIN/SSAN 0959-34-2861.

28

Outline: Soft Joins with TFIDF

• Whysimilarityjoinsareimportant• Usefulsimilaritymetricsforsetsandstrings• FastmethodsforK-NNandsimilarityjoins

–Blocking– Indexing–Short-cutalgorithms–Parallelimplementation

29

SOFT JOINS WITH TFIDF:WHY AND WHAT

30

Motivation• Integratingdataisimportant• Datafromdifferentsourcesmaynothaveconsistentobjectidentifiers– Especiallyautomatically-constructedones

• Butdatabaseswillhavehuman-readablenamesfortheobjects

• Butnamesaretricky….

31

32

Sim Joins on Product Descriptions

• Similarity can be high for descriptions of distinct items:

o AERO TGX-Series Work Table -42'' x 96'' Model 1TGX-4296 All tables shipped KD AEROSPEC- 1TGX Tables are Aerospec Designed. In addition to above specifications; - All four sides have a V countertop edge ...

o AERO TGX-Series Work Table -42'' x 48'' Model 1TGX-4248 All tables shipped KD AEROSPEC- 1TGX Tables are Aerospec Designed. In addition to above specifications; - All four sides have a V countertop ..

• Similarity can be low for descriptions of identical items:

o Canon Angle Finder C 2882A002 Film Camera Angle Finders Right Angle Finder C (Includes ED-C & ED-D Adapters for All SLR Cameras) Film Camera Angle Finders & Magnifiers The Angle Finder C lets you adjust ...

o CANON 2882A002 ANGLE FINDER C FOR EOS REBEL® SERIES PROVIDES A FULL SCREEN IMAGE SHOWS EXPOSURE DATA BUILT-IN DIOPTRIC ADJUSTMENT COMPATIBLE WITH THE CANON® REBEL, EOS & REBEL EOS SERIES.

33

One solution: Soft (Similarity) joins

• AsimilarityjoinoftwosetsAandBis– anorderedlistoftriples(sij,ai,bj)suchthat

• ai isfromA• bj isfromB• sij isthesimilarity ofai andbj• thetriplesareindescendingorder

• thelistiseitherthetopKtriplesbysij orALLtripleswithsij>L…orsometimessomeapproximationofthese….

34

Example: soft joins/similarity joins

… …

Input: Two Different Lists of Entity Names

35

Example: soft joins/similarity joinsOutput: Pairs of Names Ranked by Similarity

identical

similar

less similar

36

Softjoin Example - 1

A useful scalable similarity metric: IDF weighting plus cosine distance!37

How well does TFIDF work?

38

39

There are refinements to TFIDF distance – eg ones that extend with soft matching at the token level (e.g., softTFIDF)

40

Semantic Joiningwith Multiscale Statistics

WilliamCohenKatieRivard,DanaAttias-Moshevitz

CMU

41

42

SOFT JOINS WITH TFIDF:HOW?

43

Rocchio’s algorithm

DF(w) = # different docs w occurs in TF(w,d) = # different times w occurs in doc d

IDF(w) = |D |DF(w)

u(w,d) = log(TF(w,d)+1) ⋅ log(IDF(w))u(d) = u(w1,d),....,u(w|V |,d)

u(y) =α 1|Cy |

u(d)||u(d) ||2d∈Cy

∑ −β1

|D−Cy |u(d ')

||u(d ') ||2d '∈D−Cy

f (d) = argmaxyu(d)

||u(d) ||2

⋅u(y)

||u(y) ||2

Many variants of these formulae

…as long as u(w,d)=0 for words not in d!

Store only non-zeros in u(d), so size is O(|d| )

But size of u(y) is O(|nV| )

u2= ui

2

i∑

44

TFIDF similarity

DF(w) = # different docs w occurs in TF(w,d) = # different times w occurs in doc d

IDF(w) = |D |DF(w)

u(w,d) = log(TF(w,d)+1) ⋅ log(IDF(w))u(d) = u(w1,d),....,u(w|V |,d)

v(d) = u(d)||u(d) ||2

sim(v(d1),v(d2 )) = v(d1) ⋅v(d2 ) = u(w,d1)||u(d1) ||2w

∑ u(w,d2 )||u(d2 ) ||2

45

Soft TFIDF joins• AsimilarityjoinoftwosetsofTFIDF-weightedvectorsAandBis– anorderedlistoftriples(sij,ai,bj)suchthat

• ai isfromA• bj isfromB• sij isthedotproductofai andbj• thetriplesareindescendingorder

• thelistiseitherthetopKtriplesbysij orALLtripleswithsij>L…orsometimessomeapproximationofthese….

46

PARALLEL SOFT JOINS

47

SIGMOD 2010

loosely adapted

48

Parallel Inverted Index Softjoin - 1

want this to work for long documents or short ones…and keep the relations simple

Statistics for computing TFIDF with IDFs local to each relation49

Parallel Inverted Index Softjoin - 2

What’sthealgorithm?• Step1:createdocumentvectorsas(Cd,d,term,weight)tuples

• Step2:jointhetuplesfromAandB:onesortandreduce• Givesyoutuples(a,b,term,w(a,term)*w(b,term))

• Step3:groupthecommontermsby(a,b)andreducetoaggregatethecomponentsofthesum

50

An alternative TFIDF pipeline

51

Parallel versus serial

• NamesoftagsfromStackoverflow vsWikipediaconcepts:– input612M,7.1Mentities– docvec 1050M– softjoin 67M,1.5Mpairs– wallclock time24min

• 25processesonin-memorymap-reduce• called”mrs_gp”

– wallclocktimeforSecondString• 3-4days

– (preliminaryexperiments)

52

The same code in PIG

53

Inverted Index Softjoin – PIG 1/3

54

Inverted Index Softjoin – 2/3

55

Inverted Index Softjoin – 3/3

56

Results…..

57

58

Making the algorithm smarter….

59

Inverted Index Softjoin - 2

• thisjoiniswhereitcangetexpensive• ifatermappearsinNdocsinrel1andMdocsinrel2thenyougetN*Mtuplesinthejoin

• butfrequenttermsdon’tcountmuch…• weshouldmakeasmart choiceaboutwhichtermstouse

60

Adding heuristics to the soft join - 1

61

Adding heuristics to the soft join - 2

62

Adding heuristics

• Parks:– input40k–data60k–docvec 102k– softjoin

• 539ktokens• 508kdocuments• 0errorsintop50

• w/heuristics:– input40k–data60k–docvec 102k– softjoin

• 32ktokens• 24kdocuments• 3errorsintop50• <400usefulterms

63

top related