spark vs hadoop - carnegie mellon school of computer...
TRANSCRIPT
![Page 1: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/1.jpg)
Spark vs Hadoop
1
![Page 2: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/2.jpg)
Spark • Too much typing
– programs are not concise • Too low level
– missing abstractions – hard to specify a work:low
• Not well suited to iterative operations – E.g., E/M, k-means clustering, … – Work:low and memory-loading issues
2
Set of concise dataflow operations (“transformation”) Dataflow operations are embedded in an API together with “actions”
Sharded files are replaced by “RDDs” – resiliant distributed datasets RDDs can be cached in cluster memory and recreated to recover from error
![Page 3: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/3.jpg)
Spark examples
3
spark is a spark context object
![Page 4: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/4.jpg)
Spark examples
4
errors is a transformation, and
thus a data strucure that explains HOW to
do something count() is an action: it will actually execute the plan for errors and return a value.
errors.filter() is a transformation
collect() is an action
everything is sharded, like in Hadoop and GuineaPig
![Page 5: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/5.jpg)
Spark examples
5
# modify errors to be stored in cluster memory
subsequent actions will be much faster
everything is sharded … and the shards are stored in memory of worker machines not local disk (if possible)
You can also persist() an RDD on disk, which is like marking it as opts(stored=True) in GuineaPig. Spark’s not smart about persisting data.
![Page 6: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/6.jpg)
Spark examples: wordcount
6
the action transformation on (key,value) pairs , which are special
![Page 7: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/7.jpg)
Spark examples: batch logistic regression
7
reduce is an action – it produces a numpy
vector
p.x and w are vectors, from the numpy package
p.x and w are vectors, from the numpy package.
Python overloads operations like * and +
for vectors.
![Page 8: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/8.jpg)
Spark examples: batch logistic regression
Important note: numpy vectors/matrices are not just “syntactic sugar”. • They are much more compact than something like a list of python
floats. • numpy operations like dot, *, + are calls to optimized C code • a little python logic around a lot of numpy calls is pretty efficient
8
![Page 9: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/9.jpg)
Spark examples: batch logistic regression
9
w is defined outside the lambda function,
but used inside it So: python builds a closure – code
including the current value of w – and Spark ships it off to each worker. So
w is copied, and must be read-only.
![Page 10: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/10.jpg)
Spark examples: batch logistic regression
10
dataset of points is cached in cluster
memory to reduce i/o
![Page 11: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/11.jpg)
Spark logistic regression example
11
![Page 12: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/12.jpg)
Spark
12
![Page 13: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/13.jpg)
Spark details: broadcast
13
So: python builds a closure – code including the current value of w – and Spark ships it off to each worker. So
w is copied, and must be read-only.
![Page 14: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/14.jpg)
Spark details: broadcast
14
alternative: create a broadcast variable, e.g., • w_broad = spark.broadcast(w) which is accessed by the worker via • w_broad.value()
what’s sent is a small pointer to w (e.g., the name of a file containing a serialized version of w) and when value is called, some clever all-reduce like machinery is used to reduce network load.
little penalty for distributing something that’s not used by all workers
![Page 15: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/15.jpg)
Spark details: mapPartitions
15
Common issue: • map task requires loading in some small shared value • more generally, map task requires some sort of initialization before
processing a shard • GuineaPig:
• special Augment … sideview … pattern for shared values • can kludge up any initializer using Augment
• Raw Hadoop: mapper.configure() and mapper.close() methods
![Page 16: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/16.jpg)
Spark details: mapPartitions
16
Spark: • rdd.mapPartitions(f): will call f(iteratorOverShard) once per
shard, and return an iterator over the mapped values.
• f() can do any setup/close steps it needs
Also: • there are transformations to partition an RDD with a user-selected
function, like in Hadoop. Usually you partition and persist/cache.
![Page 17: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/17.jpg)
Other Map-Reduce (ish) Frameworks
William Cohen
17
![Page 18: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/18.jpg)
MAP-REDUCE ABSTRACTIONS: CASCADING, PIPES, SCALDING
18
![Page 19: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/19.jpg)
Y:Y=Hadoop+X
• Cascading – Java library for map-reduce work:lows – Also some library operations for common
mappers/reducers
19
![Page 20: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/20.jpg)
Cascading WordCount Example
20
Input format
Output format: pairs
Bind to HFS path
Bind to HFS path A pipeline of map-reduce jobs
Append a step: apply function to the “line” field
Append step: group a (flattened) stream of “tuples”
Replace line with bag of words
Append step: aggregate grouped values
Run the pipeline
![Page 21: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/21.jpg)
Cascading WordCount Example
Is this inefficient? We explicitly form a group for each word, and then count the elements…?
We could be saved by careful optimization: we know we don’t need the GroupBy intermediate result when we run the assembly….
Many of the Hadoop abstraction levels have a similar flavor: • Define a pipeline of tasks declaratively • Optimize it automatically • Run the final result
The key question: does the system successfully hide the details from you?
21
![Page 22: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/22.jpg)
Y:Y=Hadoop+X • Cascading
– Java library for map-reduce work:lows • expressed as “Pipe”s, to which you add Each, Every,
GroupBy, … – Also some library operations for common mappers/
reducers • e.g. RegexGenerator
– Turing-complete since it’s an API for Java • Pipes
– C++ library for map-reduce work:lows on Hadoop • Scalding
– More concise Scala library based on Cascading
22
![Page 23: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/23.jpg)
MORE DECLARATIVE LANGUAGES
23
![Page 24: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/24.jpg)
Hive and PIG: word count
• Declarative ….. Fairly stable
PIG program is a bunch of assignments where every LHS is a relation. No loops, conditionals, etc allowed. 24
![Page 25: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/25.jpg)
FLINK
• Recent Apache Project – formerly Stratosphere
25
….
![Page 26: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/26.jpg)
FLINK
• Apache Project – just getting started
26
….
Java API
![Page 27: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/27.jpg)
FLINK
27
![Page 28: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/28.jpg)
FLINK
• Like Spark, in-memory or on disk • Everything is a Java object • Unlike Spark, contains operations for iteration
– Allowing query optimization • Very easy to use and install in local model
– Very modular – Only needs Java
28
![Page 29: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/29.jpg)
One more algorithm to discuss as a Map-reduce implementation….
29
![Page 30: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/30.jpg)
30
![Page 31: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/31.jpg)
ACL Workshop 2003 31
![Page 32: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/32.jpg)
32
![Page 33: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/33.jpg)
Why phrase-finding?
• There are lots of phrases • There’s not supervised data • It’s hard to articulate
– What makes a phrase a phrase, vs just an n-gram? • a phrase is independently meaningful (“test
drive”, “red meat”) or not (“are interesting”, “are lots”)
– What makes a phrase interesting?
33
![Page 34: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/34.jpg)
The breakdown: what makes a good phrase • Two properties:
– Phraseness: “the degree to which a given word sequence is considered to be a phrase” • Statistics: how often words co-occur together vs
separately – Informativeness: “how well a phrase captures or
illustrates the key ideas in a set of documents” – something novel and important relative to a domain • Background corpus and foreground corpus; how
often phrases occur in each
34
![Page 35: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/35.jpg)
“Phraseness”1 – based on BLRT • Binomial Ratio Likelihood Test (BLRT):
– Draw samples: • n1 draws, k1 successes • n2 draws, k2 successes • Are they from one binominal (i.e., k1/n1 and k2/n2 were
different due to chance) or from two distinct binomials? – De:ine
• p1=k1 / n1, p2=k2 / n2, p=(k1+k2)/(n1+n2), • L(p,k,n) = pk(1-p)n-k
BLRT (n1,k1,n2,k2 ) =L(p1,k1 ,n1)L(p2,k2,n2 )L(p,k1 ,n1)L(p,k2,n2 )
35
![Page 36: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/36.jpg)
“Phraseness”1 – based on BLRT • Binomial Ratio Likelihood Test (BLRT):
– Draw samples: • n1 draws, k1 successes • n2 draws, k2 successes • Are they from one binominal (i.e., k1/n1 and k2/n2 were
different due to chance) or from two distinct binomials? – De:ine
• pi=ki/ni, p=(k1+k2)/(n1+n2), • L(p,k,n) = pk(1-p)n-k
BLRT (n1,k1,n2,k2 ) = 2 logL(p1,k1 ,n1)L(p2,k2,n2 )L(p,k1 ,n1)L(p,k2,n2 )
36
![Page 37: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/37.jpg)
“Informativeness”1 – based on BLRT
– De:ine • pi=ki /ni, p=(k1+k2)/(n1+n2), • L(p,k,n) = pk(1-p)n-k
Phrase x y: W1=x ^ W2=y and two corpora, C and B
comment
k1 C(W1=x ^ W2=y) how often bigram x y occurs in corpus C
n1 C(W1=* ^ W2=*) how many bigrams in corpus C
k2 B(W1=x^W2=y) how often x y occurs in background corpus
n2 B(W1=* ^ W2=*) how many bigrams in background corpus
Does x y occur at the same frequency in both corpora?
ϕi (n1,k1,n2,k2 ) = 2 logL(p1,k1 ,n1)L(p2,k2,n2 )L(p,k1 ,n1)L(p,k2,n2 )
37
![Page 38: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/38.jpg)
“Phraseness”1 – based on BLRT – De:ine
• pi=ki /ni, p=(k1+k2)/(n1+n2), • L(p,k,n) = pk(1-p)n-k
ϕ p(n1,k1,n2,k2 ) = 2 logL(p1,k1 ,n1)L(p2,k2,n2 )L(p,k1 ,n1)L(p,k2,n2 )
comment
k1 C(W1=x ^ W2=y) how often bigram x y occurs in corpus C
n1 C(W1=x) how often word x occurs in corpus C
k2 C(W1≠x^W2=y) how often y occurs in C after a non-x
n2 C(W1≠x) how often a non-x occurs in C
Phrase x y: W1=x ^ W2=y
Does y occur at the same frequency after x as in other positions? 38
![Page 39: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/39.jpg)
The breakdown: what makes a good phrase • “Phraseness” and “informativeness” are then combined
with a tiny classi:ier, tuned on labeled data.
• Background corpus: 20 newsgroups dataset (20k messages, 7.4M words)
• Foreground corpus: rec.arts.movies.current-films June-Sep 2002 (4M words)
• Results?
€
logp
1− p= s
#
$ %
&
' ( ⇔ p =
11+ es
#
$ %
&
' (
39
![Page 40: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/40.jpg)
40
![Page 41: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/41.jpg)
The breakdown: what makes a good phrase • Two properties:
– Phraseness: “the degree to which a given word sequence is considered to be a phrase”
• Statistics: how often words co-occur together vs separately – Informativeness: “how well a phrase captures or illustrates the
key ideas in a set of documents” – something novel and important relative to a domain
• Background corpus and foreground corpus; how often phrases occur in each
– Another intuition: our goal is to compare distributions and see how different they are:
• Phraseness: estimate x y with bigram model or unigram model
• Informativeness: estimate with foreground vs background corpus
41
![Page 42: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/42.jpg)
The breakdown: what makes a good phrase
– Another intuition: our goal is to compare distributions and see how different they are:
• Phraseness: estimate x y with bigram model or unigram model
• Informativeness: estimate with foreground vs background corpus
– To compare distributions, use KL-divergence
“Pointwise KL divergence”
42
![Page 43: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/43.jpg)
The breakdown: what makes a good phrase
– To compare distributions, use KL-divergence
“Pointwise KL divergence”
Phraseness: difference between bigram and unigram language model in foreground
Bigram model: P(x y)=P(x)P(y|x) Unigram model: P(x y)=P(x)P(y)
43
![Page 44: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/44.jpg)
The breakdown: what makes a good phrase
– To compare distributions, use KL-divergence
“Pointwise KL divergence”
Informativeness: difference between foreground and background models
Bigram model: P(x y)=P(x)P(y|x) Unigram model: P(x y)=P(x)P(y)
44
![Page 45: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/45.jpg)
The breakdown: what makes a good phrase
– To compare distributions, use KL-divergence
“Pointwise KL divergence”
Combined: difference between foreground bigram model and background unigram model
Bigram model: P(x y)=P(x)P(y|x) Unigram model: P(x y)=P(x)P(y)
45
![Page 46: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/46.jpg)
The breakdown: what makes a good phrase
– To compare distributions, use KL-divergence
Combined: difference between foreground bigram model and background unigram model
Subtle advantages: • BLRT scores “more frequent in
foreground” and “more frequent in background” symmetrically, pointwise KL does not.
• Phrasiness and informativeness scores are more comparable – straightforward combination w/o a classifier is reasonable.
• Language modeling is well-studied: • extensions to n-grams, smoothing
methods, … • we can build on this work in a
modular way 46
![Page 47: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/47.jpg)
Pointwise KL, combined
47
![Page 48: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/48.jpg)
Why phrase-finding? • Phrases are where the standard supervised “bag
of words” representation starts to break. • There’s not supervised data, so it’s hard to see
what’s “right” and why • It’s a nice example of using unsupervised signals
to solve a task that could be formulated as supervised learning
• It’s a nice level of complexity, if you want to do it in a scalable way.
48
![Page 49: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/49.jpg)
Phrase Finding in Guinea Pig
49
![Page 50: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/50.jpg)
Phrase Finding 1 – counting words
background corpus
50
![Page 51: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/51.jpg)
Phrase Finding 2 – counting phrases
51
![Page 52: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/52.jpg)
Phrase Finding 3 – collecting info
dictionary: {‘statistic name’:value}
returns copy with a new key,value pair
52
![Page 53: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/53.jpg)
Phrase Finding 3 – collecting info
join fg and bg phrase counts and output a dict
join fg and bg count for first word “x” in “x y”
53
![Page 54: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/54.jpg)
Phrase Finding 3 – collecting info
join fg and bg count for word “y” in “x y”
54
![Page 55: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/55.jpg)
Phrase Finding 4 – totals
MAP
REDUCE
(‘const’,6743324)
55
![Page 56: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/56.jpg)
Phrase Finding 4 – totals
MAP COMBINE
56
![Page 57: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/57.jpg)
Phrase Finding 4 – totals
57
![Page 58: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/58.jpg)
Phrase Finding 4 – totals (map-side)
58
![Page 59: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/59.jpg)
Phrase Finding 5 – collect totals
59
![Page 60: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/60.jpg)
Phrase Finding 6 – compute
….
60
![Page 61: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/61.jpg)
Phrase Finding results Overall Phrasiness Only Top 100 phraseiness,
lo informativeness
61
![Page 62: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/62.jpg)
Phrase Finding results Overall
Top 100 informativeness, lo phraseiness
62
![Page 63: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/63.jpg)
The full phrase-finding pipeline
63
![Page 64: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/64.jpg)
The full phrase-finding pipeline
64
![Page 65: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/65.jpg)
The full phrase-finding pipeline
65
![Page 66: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/66.jpg)
Phrase Finding in PIG
66
![Page 67: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/67.jpg)
Phrase Finding 1 - loading the input
67
![Page 68: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/68.jpg)
…
68
![Page 69: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/69.jpg)
PIG Features
• comments -- like this /* or like this */ • ‘shell-like’ commands:
– fs -ls … -- any hadoop fs … command – some shorter cuts: ls, cp, … – sh ls -al -- escape to shell
69
![Page 70: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/70.jpg)
…
70
![Page 71: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/71.jpg)
PIG Features • comments -- like this /* or like this */ • ‘shell-like’ commands:
– fs -ls … -- any hadoop fs … command – some shorter cuts: ls, cp, … – sh ls -al -- escape to shell
• LOAD ‘hdfs-path’ AS (schema) – schemas can include int, double, … – schemas can include complex types: bag, map, tuple, …
• FOREACH alias GENERATE … AS …, … – transforms each row of a relation – operators include +, -, and, or, … – can extend this set easily (more later)
• DESCRIBE alias -- shows the schema • ILLUSTRATE alias -- derives a sample tuple
71
![Page 72: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/72.jpg)
Phrase Finding 1 - word counts
72
![Page 73: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/73.jpg)
73
![Page 74: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/74.jpg)
PIG Features • LOAD ‘hdfs-path’ AS (schema)
– schemas can include int, double, bag, map, tuple, …
• FOREACH alias GENERATE … AS …, … – transforms each row of a relation
• DESCRIBE alias/ ILLUSTRATE alias -- debugging • GROUP r BY x
– like a shufHle-sort: produces relation with Hields group and r, where r is a bag
74
![Page 75: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/75.jpg)
PIG parses and optimizes a sequence of commands before it executes them It’s smart enough to turn GROUP … FOREACH… SUM … into a map-reduce
75
![Page 76: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/76.jpg)
PIG Features • LOAD ‘hdfs-path’ AS (schema)
– schemas can include int, double, bag, map, tuple, … • FOREACH alias GENERATE … AS …, …
– transforms each row of a relation • DESCRIBE alias/ ILLUSTRATE alias -- debugging • GROUP alias BY … • FOREACH alias GENERATE group, SUM(….)
– GROUP/GENERATE … aggregate op together act like a map-reduce
– aggregates: COUNT, SUM, AVERAGE, MAX, MIN, … – you can write your own
76
![Page 77: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/77.jpg)
PIG parses and optimizes a sequence of commands before it executes them It’s smart enough to turn GROUP … FOREACH… SUM … into a map-reduce
77
![Page 78: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/78.jpg)
Phrase Finding 3 - assembling phrase- and word-level statistics
78
![Page 79: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/79.jpg)
79
![Page 80: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/80.jpg)
80
![Page 81: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/81.jpg)
PIG Features • LOAD ‘hdfs-path’ AS (schema)
– schemas can include int, double, bag, map, tuple, … • FOREACH alias GENERATE … AS …, …
– transforms each row of a relation • DESCRIBE alias/ ILLUSTRATE alias -- debugging • GROUP alias BY … • FOREACH alias GENERATE group, SUM(….)
– GROUP/GENERATE … aggregate op together act like a map-reduce
• JOIN r BY Hield, s BY Hield, … – inner join to produce rows: r::f1, r::f2, … s::f1, s::f2, …
81
![Page 82: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/82.jpg)
Phrase Finding 4 - adding total frequencies
82
![Page 83: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/83.jpg)
83
![Page 84: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/84.jpg)
How do we add the totals to the phraseStats relation?
STORE triggers execution of the query plan….
it also limits optimization
84
![Page 85: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/85.jpg)
Comment: schema is lost when you store…. 85
![Page 86: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/86.jpg)
PIG Features • LOAD ‘hdfs-path’ AS (schema)
– schemas can include int, double, bag, map, tuple, … • FOREACH alias GENERATE … AS …, …
– transforms each row of a relation • DESCRIBE alias/ ILLUSTRATE alias -- debugging • GROUP alias BY … • FOREACH alias GENERATE group, SUM(….)
– GROUP/GENERATE … aggregate op together act like a map-reduce
• JOIN r BY Hield, s BY Hield, … – inner join to produce rows: r::f1, r::f2, … s::f1, s::f2, …
• CROSS r, s, … – use with care unless all but one of the relations are singleton – newer pigs allow singleton relation to be cast to a scalar
86
![Page 87: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/87.jpg)
Phrase Finding 5 - phrasiness and informativeness
87
![Page 88: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/88.jpg)
How do we compute some complicated function? With a “UDF”
88
![Page 89: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/89.jpg)
89
![Page 90: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/90.jpg)
PIG Features • LOAD ‘hdfs-path’ AS (schema)
– schemas can include int, double, bag, map, tuple, … • FOREACH alias GENERATE … AS …, …
– transforms each row of a relation • DESCRIBE alias/ ILLUSTRATE alias -- debugging • GROUP alias BY … • FOREACH alias GENERATE group, SUM(….)
– GROUP/GENERATE … aggregate op together act like a map-reduce
• JOIN r BY Hield, s BY Hield, … – inner join to produce rows: r::f1, r::f2, … s::f1, s::f2, …
• CROSS r, s, … – use with care unless all but one of the relations are singleton
• User de:ined functions as operators – also for loading, aggregates, …
90
![Page 91: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/91.jpg)
The full phrase-finding pipeline in PIG
91
![Page 92: Spark vs Hadoop - Carnegie Mellon School of Computer Sciencewcohen/10-605/workflow-4-and-phrases.pdf · 2016-09-22 · Spark details: broadcast 14 alternative: create a broadcast](https://reader033.vdocuments.us/reader033/viewer/2022042219/5ec583c0c00321700758d8d5/html5/thumbnails/92.jpg)
92