workflows and abstractions for map-reducewcohen/10-605/workflows-2.pdfguineapig: pig in python •...
TRANSCRIPT
Workflows and Abstractions for Map-Reduce
1
Recap
• Map-reduce✔• Algorithmswithmultiplemap-reducesteps
–Naïvebayestestroutineforlargedatasetsandlargemodels
• Cleanlydescribingthesealgorithms–workflow(ordataflow)languages– abstractoperations:map,filter,flatten,group,join,…
–PIG:onesuchlanguage2
PIG• Releasedin2008• Wordcount program:
3
GUINEA PIG
4
GuineaPig: PIG in Python• PurePython(<1500lines)• StreamsPythondatastructures
– strings,numbers,tuples (a,b),lists[a,b,c]– Norecords:operationsdefinedfunctionally
• CompilestoHadoop streamingpipeline– OptimizessequencesofMAPs
• RunslocallywithoutHadoop– compilestostream-and-sortpipeline– intermediateresultscanbeviewed
• Caneasilyrunpartsofapipeline• http://curtis.ml.cmu.edu/w/courses/index.php/Guinea_Pig
5
GuineaPig: PIG in Python• PurePython,streamsPythondatastructures
– nottoomuchnewtolearn(eg field/recordnotation,specialstringoperations,UDFs,…)
– codebaseissmallandreadable• CompilestoHadoop orstream-and-sort,caneasilyrunpartsofapipeline– intermediateresultsoftenare(andalwayscanbe)storedandinspected
– planisfairlyvisible• Syntaxincludeshigh-leveloperationsbutalsofairlydetaileddescriptionofanoptimizedmap-reducestep– Flatten|Group(by=…,retaining=…,reducingTo=…)
6
7
A wordcount example
class variables in the planner are data structures
Wordcount example ….• Aprogramisconvertedtoadatastructure• Thedatastructurecanbeconvertedtoaseriesof“abstractmap-reducetasks”andthenshellcommands
8
steps in the compiled plan
invoke your script with special args
Wordcount example ….
• Datastructurecanbeconvertedtocommandsforstreaminghadoop
9
Wordcount example ….
• Ofcourseyouwon’taccesslocalfileswithHadoop,soyouneedtospecifyanHDFSlocationforinputsandoutputs
10
Wordcount example ….
• Ofcourseyouwon’taccesslocalfileswithHadoop,soyouneedtospecifyanHDFSlocationforinputsandoutputs
11
More examples of GuineaPig
12
Join syntax, macros, Format command
Incremental debugging, when intermediate views are stored:
% python wrdcmp.py –store result…% python wrdcmp.py –store result –reuse cmp
More examples of GuineaPig
13
Full Syntax for Group
Group(wc, by=lambda (word,count):word[:k], retaining=lambda (word,count):count, combiningTo=ReduceToSum(),reducingTo=ReduceToSum())
equiv to:
Group(wc, by=lambda (word,count):word[:k], reducingTo=
ReduceTo(int, lambda accum,word,count): accum+count))
ANOTHER EXAMPLE:COMPUTINGTFIDF IN GUINEA PIG
14
Implementation
15
Implementation
16
docId term
d123 found
d123 aardvark
… …
(d123,found)(d123,aardvark)…
Implementation
17
docId term
d123 found
d123 aardvark
key value
found (d123,found),(d134,found),… 2456
aardvark (d123,aardvark),… 7
Implementation
18
docId term
d123 found
d123 aardvark
key value
found 2456
aardvark 7
('1', 'quite') ('1', 'a') ('1', 'difference.’)…(’3’, ‘alcohol’)…
("'94", 1) ("'94,", 1) ("'a", 1) ("'alcohol", 1) …
(('2', "'alcohol"), ("'alcohol", 1)) (('550', "'cause"), ("'cause", 1))…
Implementation
19
docId term df
d123 found 2456
d123 aardvark 7
('2', "'confabulation'.", 2) ('3', "'confabulation'.", 2) ('209', "'controversy'", 1) ('181', "'em", 3) ('434', "'em", 3) ('452', "'em", 3) ('113', "'fancy", 1) ('212', "'franchise',", 1) ('352', "'honest,", 1)
Implementation: Map-side join
20
Augment: loads a preloaded object b at mapper initialization time,cycles thru the input, and generates pairs (a,b)
docId term df
d123 found 2456 Arbitrary python object
d123 aardvark 7 Arbitrary python object
…
Implementation
21
Augment: loads a preloaded object b at mapper initialization time,cycles thru the input, and generates pairs (a,b), where b points to the preloaded object
docId term df
d123 found 2456 ptr
d123 aardvark 7 ptr
… …
Arbitrary python object
(('2', "'confabulation'.", 2), ('ndoc', 964)) (('3', "'confabulation'.", 2), ('ndoc', 964)) (('209', "'controversy'", 1), ('ndoc', 964)) (('181', "'em", 3), ('ndoc', 964)) (('434', "'em", 3), ('ndoc', 964))
Implementation
Augment: loads a preloaded object b at mapper initialization time,cycles thru the input, and generates pairs (a,b), where b points to the preloaded object
(('2', "'confabulation'.", 2), ('ndoc', 964)) (('3', "'confabulation'.", 2), ('ndoc', 964)) (('209', "'controversy'", 1), ('ndoc', 964)) (('181', "'em", 3), ('ndoc', 964)) (('434', "'em", 3), ('ndoc', 964))
This looks like a join. But it’s different.• It’s a single map, not a map-shuffle/sort-reduce• The loaded object is paired with every a, not just ones where the join keys
match (but you can use it for a map-side join!)• The loaded object has to be distributed to every mapper (so, copied!)
22
Implementation
23
Gotcha: if you store an augment, it’s printed on disk, and Python writes the object pointed to, not the pointer. So when you store you make a copy of the object for every row.
docId term df
d123 found 2456 ptr
d123 aardvark 7 ptr
… …
Arbitrary python object
(('2', "'confabulation'.", 2), printed-object) (('3', "'confabulation'.", 2), printed-object) (('209', "'controversy'", 1), printed-object) (('181', "'em", 3), printed-object) (('434', "'em", 3), printed-object)
Full Implementation
24
TFIDF with map-side joins
25
TFIDF with map-side joins
26
SOFT JOINS
Anotherproblemtoattackwithdataflow
27
In the once upon a time days of the First Age of Magic, the prudent sorcerer regarded his own true name as his most valued possession but also the greatest threat to his continued good health, for--the stories go--once an enemy, even a weak unskilled enemy, learned the sorcerer's true name, then routine and widely known spells could destroy or enslave even the most powerful. As times passed, and we graduated to the Age of Reason and thence to the first and second industrial revolutions, such notions were discredited. Now it seems that the Wheel has turned full circle (even if there never really was a First Age) and we are back to worrying about true names again:
The first hint Mr. Slippery had that his own True Name might be known--and, for that matter, known to the Great Enemy--came with the appearance of two black Lincolns humming up the long dirt driveway ... Roger Pollack was in his garden weeding, had been there nearly the whole morning.... Four heavy-set men and a hard-looking female piled out, started purposefully across his well-tended cabbage patch.…
This had been, of course, Roger Pollack's great fear. They had discovered Mr. Slippery's True Name and it was Roger Andrew Pollack TIN/SSAN 0959-34-2861.
28
Outline: Soft Joins with TFIDF
• Whysimilarityjoinsareimportant• Usefulsimilaritymetricsforsetsandstrings• FastmethodsforK-NNandsimilarityjoins
–Blocking– Indexing–Short-cutalgorithms–Parallelimplementation
29
SOFT JOINS WITH TFIDF:WHY AND WHAT
30
Motivation• Integratingdataisimportant• Datafromdifferentsourcesmaynothaveconsistentobjectidentifiers– Especiallyautomatically-constructedones
• Butdatabaseswillhavehuman-readablenamesfortheobjects
• Butnamesaretricky….
31
32
Sim Joins on Product Descriptions
• Similarity can be high for descriptions of distinct items:
o AERO TGX-Series Work Table -42'' x 96'' Model 1TGX-4296 All tables shipped KD AEROSPEC- 1TGX Tables are Aerospec Designed. In addition to above specifications; - All four sides have a V countertop edge ...
o AERO TGX-Series Work Table -42'' x 48'' Model 1TGX-4248 All tables shipped KD AEROSPEC- 1TGX Tables are Aerospec Designed. In addition to above specifications; - All four sides have a V countertop ..
• Similarity can be low for descriptions of identical items:
o Canon Angle Finder C 2882A002 Film Camera Angle Finders Right Angle Finder C (Includes ED-C & ED-D Adapters for All SLR Cameras) Film Camera Angle Finders & Magnifiers The Angle Finder C lets you adjust ...
o CANON 2882A002 ANGLE FINDER C FOR EOS REBEL® SERIES PROVIDES A FULL SCREEN IMAGE SHOWS EXPOSURE DATA BUILT-IN DIOPTRIC ADJUSTMENT COMPATIBLE WITH THE CANON® REBEL, EOS & REBEL EOS SERIES.
33
One solution: Soft (Similarity) joins
• AsimilarityjoinoftwosetsAandBis– anorderedlistoftriples(sij,ai,bj)suchthat
• ai isfromA• bj isfromB• sij isthesimilarity ofai andbj• thetriplesareindescendingorder
• thelistiseitherthetopKtriplesbysij orALLtripleswithsij>L…orsometimessomeapproximationofthese….
34
Example: soft joins/similarity joins
… …
Input: Two Different Lists of Entity Names
35
Example: soft joins/similarity joinsOutput: Pairs of Names Ranked by Similarity
…
…
identical
similar
less similar
36
Softjoin Example - 1
A useful scalable similarity metric: IDF weighting plus cosine distance!37
How well does TFIDF work?
38
39
There are refinements to TFIDF distance – eg ones that extend with soft matching at the token level (e.g., softTFIDF)
40
Semantic Joiningwith Multiscale Statistics
WilliamCohenKatieRivard,DanaAttias-Moshevitz
CMU
41
42
SOFT JOINS WITH TFIDF:HOW?
43
Rocchio’s algorithm
DF(w) = # different docs w occurs in TF(w,d) = # different times w occurs in doc d
IDF(w) = |D |DF(w)
u(w,d) = log(TF(w,d)+1) ⋅ log(IDF(w))u(d) = u(w1,d),....,u(w|V |,d)
u(y) =α 1|Cy |
u(d)||u(d) ||2d∈Cy
∑ −β1
|D−Cy |u(d ')
||u(d ') ||2d '∈D−Cy
∑
f (d) = argmaxyu(d)
||u(d) ||2
⋅u(y)
||u(y) ||2
Many variants of these formulae
…as long as u(w,d)=0 for words not in d!
Store only non-zeros in u(d), so size is O(|d| )
But size of u(y) is O(|nV| )
u2= ui
2
i∑
44
TFIDF similarity
DF(w) = # different docs w occurs in TF(w,d) = # different times w occurs in doc d
IDF(w) = |D |DF(w)
u(w,d) = log(TF(w,d)+1) ⋅ log(IDF(w))u(d) = u(w1,d),....,u(w|V |,d)
v(d) = u(d)||u(d) ||2
sim(v(d1),v(d2 )) = v(d1) ⋅v(d2 ) = u(w,d1)||u(d1) ||2w
∑ u(w,d2 )||u(d2 ) ||2
45
Soft TFIDF joins• AsimilarityjoinoftwosetsofTFIDF-weightedvectorsAandBis– anorderedlistoftriples(sij,ai,bj)suchthat
• ai isfromA• bj isfromB• sij isthedotproductofai andbj• thetriplesareindescendingorder
• thelistiseitherthetopKtriplesbysij orALLtripleswithsij>L…orsometimessomeapproximationofthese….
46
PARALLEL SOFT JOINS
47
SIGMOD 2010
loosely adapted
48
Parallel Inverted Index Softjoin - 1
want this to work for long documents or short ones…and keep the relations simple
Statistics for computing TFIDF with IDFs local to each relation49
Parallel Inverted Index Softjoin - 2
What’sthealgorithm?• Step1:createdocumentvectorsas(Cd,d,term,weight)tuples
• Step2:jointhetuplesfromAandB:onesortandreduce• Givesyoutuples(a,b,term,w(a,term)*w(b,term))
• Step3:groupthecommontermsby(a,b)andreducetoaggregatethecomponentsofthesum
50
An alternative TFIDF pipeline
51
Parallel versus serial
• NamesoftagsfromStackoverflow vsWikipediaconcepts:– input612M,7.1Mentities– docvec 1050M– softjoin 67M,1.5Mpairs– wallclock time24min
• 25processesonin-memorymap-reduce• called”mrs_gp”
– wallclocktimeforSecondString• 3-4days
– (preliminaryexperiments)
52
The same code in PIG
53
Inverted Index Softjoin – PIG 1/3
54
Inverted Index Softjoin – 2/3
55
Inverted Index Softjoin – 3/3
56
Results…..
57
58
Making the algorithm smarter….
59
Inverted Index Softjoin - 2
• thisjoiniswhereitcangetexpensive• ifatermappearsinNdocsinrel1andMdocsinrel2thenyougetN*Mtuplesinthejoin
• butfrequenttermsdon’tcountmuch…• weshouldmakeasmart choiceaboutwhichtermstouse
60
Adding heuristics to the soft join - 1
61
Adding heuristics to the soft join - 2
62
Adding heuristics
• Parks:– input40k–data60k–docvec 102k– softjoin
• 539ktokens• 508kdocuments• 0errorsintop50
• w/heuristics:– input40k–data60k–docvec 102k– softjoin
• 32ktokens• 24kdocuments• 3errorsintop50• <400usefulterms
63