apache pig -- pittsburghhug
DESCRIPTION
Presentation on Apache Pig for the Pittsburgh Hadoop User Group. Intro to language, join algorithm descriptions, upcoming features, pie-in-the-sky research ideas.TRANSCRIPT
![Page 1: Apache Pig -- PittsburghHug](https://reader036.vdocuments.us/reader036/viewer/2022081719/54694c81af79593b558b4b6c/html5/thumbnails/1.jpg)
Apache Pig
Pittsburgh Hadoop User Group 11/3/2009
Dmitriy RyaboyAshutosh Chauhan
![Page 2: Apache Pig -- PittsburghHug](https://reader036.vdocuments.us/reader036/viewer/2022081719/54694c81af79593b558b4b6c/html5/thumbnails/2.jpg)
In This Talk
• What is Pig and why it’s needed– This part is going to be brief.– It’s a Hadoop User Group, after all.
• Examples– As well as some extra motivation
• Advanced Features• Improvements currently in development• Interesting research problems
– Want to get involved?
![Page 3: Apache Pig -- PittsburghHug](https://reader036.vdocuments.us/reader036/viewer/2022081719/54694c81af79593b558b4b6c/html5/thumbnails/3.jpg)
What is Pigor, “duality of Pig”
Pig compiles data
analysis tasks into
Map-Reduce jobs and
runs them on Hadoop.
Pig Latin is a language
for expressing data
transformation flows.
Pig can be made to understand other languages, too.
There is an SQL prototype. Just a question of compiling a language into internal operator tree.
See: Smith, Agent. The Matrix Trilogy, Warner Bros.
![Page 4: Apache Pig -- PittsburghHug](https://reader036.vdocuments.us/reader036/viewer/2022081719/54694c81af79593b558b4b6c/html5/thumbnails/4.jpg)
Who Uses Pig
• Yahoo – “40% of all Hadoop jobs are run with Pig”
• Twitter – “Some Java-based MapReduce, some
Hadoop Streaming”– “Most analysis, and most interesting
analysis, done in Pig”
• LinkedIn, AOL, CoolIris, Ning, eBuddy …
![Page 5: Apache Pig -- PittsburghHug](https://reader036.vdocuments.us/reader036/viewer/2022081719/54694c81af79593b558b4b6c/html5/thumbnails/5.jpg)
Why use Pig
• Express data transformation tasks in a few lines.• “Reach out and touch a Petabyte.”
Image credit: Michelangelo, with apologies.
![Page 6: Apache Pig -- PittsburghHug](https://reader036.vdocuments.us/reader036/viewer/2022081719/54694c81af79593b558b4b6c/html5/thumbnails/6.jpg)
Pig Latin Example
![Page 7: Apache Pig -- PittsburghHug](https://reader036.vdocuments.us/reader036/viewer/2022081719/54694c81af79593b558b4b6c/html5/thumbnails/7.jpg)
Why not plain Map/Reduce?
Brevity.
![Page 8: Apache Pig -- PittsburghHug](https://reader036.vdocuments.us/reader036/viewer/2022081719/54694c81af79593b558b4b6c/html5/thumbnails/8.jpg)
Corresponding Pig Script
![Page 9: Apache Pig -- PittsburghHug](https://reader036.vdocuments.us/reader036/viewer/2022081719/54694c81af79593b558b4b6c/html5/thumbnails/9.jpg)
Why Not SQL
• Pig Latin allows expressing transformations as a sequence of steps. SQL describes desired outcome.
• Writing down a sequence of steps is intuitive to developers.– Allows expressing more complex data flows– But still no loops or conditionals.
• Support for nested structures (arrays, maps)• Much easier to work with groups
– Let me show you…
![Page 10: Apache Pig -- PittsburghHug](https://reader036.vdocuments.us/reader036/viewer/2022081719/54694c81af79593b558b4b6c/html5/thumbnails/10.jpg)
Top 5 scores for each player
• Data: < playerId, score, date >• For each player
• Best 5 scores• Date each of these scores was achieved
• Common task; this particular one lifted from http://stackoverflow.com/questions/1467898
• Classically painful in SQL
![Page 11: Apache Pig -- PittsburghHug](https://reader036.vdocuments.us/reader036/viewer/2022081719/54694c81af79593b558b4b6c/html5/thumbnails/11.jpg)
SQL
• Self-join• Data explosion for same values of tblAbc.aa• Readability?• I like SQL. But this is far from straightforward.
![Page 12: Apache Pig -- PittsburghHug](https://reader036.vdocuments.us/reader036/viewer/2022081719/54694c81af79593b558b4b6c/html5/thumbnails/12.jpg)
SQL
We can fix these problems, by going procedural.
There goes the brevity and declarativity.
![Page 13: Apache Pig -- PittsburghHug](https://reader036.vdocuments.us/reader036/viewer/2022081719/54694c81af79593b558b4b6c/html5/thumbnails/13.jpg)
![Page 14: Apache Pig -- PittsburghHug](https://reader036.vdocuments.us/reader036/viewer/2022081719/54694c81af79593b558b4b6c/html5/thumbnails/14.jpg)
Pig Latin
![Page 15: Apache Pig -- PittsburghHug](https://reader036.vdocuments.us/reader036/viewer/2022081719/54694c81af79593b558b4b6c/html5/thumbnails/15.jpg)
Diving Deeper
http://www.flickr.com/photos/lollaping/2573664394/
![Page 16: Apache Pig -- PittsburghHug](https://reader036.vdocuments.us/reader036/viewer/2022081719/54694c81af79593b558b4b6c/html5/thumbnails/16.jpg)
User-Defined Functions
• Java Interfaces– Reading / Writing Data
• Allows reading from DBs, HBase, custom file formats• Significant API overhaul underway.
– Transformation / Evaluation of Tuples– Group Aggregation
• See Piggybank for exampleshttp://svn.apache.org/viewvc/hadoop/pig/trunk/contrib/piggybank/
• Support for other languages in the workshttps://issues.apache.org/jira/browse/PIG-928
![Page 17: Apache Pig -- PittsburghHug](https://reader036.vdocuments.us/reader036/viewer/2022081719/54694c81af79593b558b4b6c/html5/thumbnails/17.jpg)
Streaming
![Page 18: Apache Pig -- PittsburghHug](https://reader036.vdocuments.us/reader036/viewer/2022081719/54694c81af79593b558b4b6c/html5/thumbnails/18.jpg)
• IO Cost of scanning the data dominates most jobs
• Share data scans • This example
requires separate pipelines for state and demographics
• Or does it?
Multiquery Optimizationload usersload users
filter out botsfilter out bots
group by stategroup by state group by demographic
group by demographic
apply UDFsapply UDFs apply UDFsapply UDFs
store into ‘bystate’store into ‘bystate’
store into ‘bydemo’store into ‘bydemo’
Slide credit: Alan Gates et al, VLDB 2009
![Page 19: Apache Pig -- PittsburghHug](https://reader036.vdocuments.us/reader036/viewer/2022081719/54694c81af79593b558b4b6c/html5/thumbnails/19.jpg)
Multiquery Optimization
• Tag each record with the pipeline it belongs to
• Regular Hadoop grouping on [pipeline #, key]
• Multiplexer in reduce stage sends records to appropriate pipeline
• Use a single script to compute many things from shared sources.
mapmap filterfilter
local rearrangelocal rearrange
splitsplit
local rearrangelocal rearrange
reducereduce
multiplexmultiplexpackagepackage packagepackage
foreachforeach foreachforeach
Slide credit: Alan Gates et al, VLDB 2009
![Page 20: Apache Pig -- PittsburghHug](https://reader036.vdocuments.us/reader036/viewer/2022081719/54694c81af79593b558b4b6c/html5/thumbnails/20.jpg)
Hash Join
• Default algorithm • Mapper emits ([key, relid], tuple)• Reducer sees all tuples from each relation with the same
key, performs join– easy, due to sorted order
• Optimization: entries in the last relation can be streamed through (not accumulated in memory for each key)
• Can join multiple tables, supports outer joins• If one relation has many duplicate keys, put it last!
![Page 21: Apache Pig -- PittsburghHug](https://reader036.vdocuments.us/reader036/viewer/2022081719/54694c81af79593b558b4b6c/html5/thumbnails/21.jpg)
Fragment-Replicate Join
• Join large table A with 1 or more small tables (B, C)
• Fragment large table across mappers• Replicate smaller tables in entirety to all
mappers
• Supports multiple tables• Use when all small tables together can
fit in memory on a single map task
Map
3
M
ap2
Map
1
A B
C
B
B
C
C
![Page 22: Apache Pig -- PittsburghHug](https://reader036.vdocuments.us/reader036/viewer/2022081719/54694c81af79593b558b4b6c/html5/thumbnails/22.jpg)
Merge Join
• 2 inputs, both sorted on join key • Build sparse index into B• Partition A across mappers• Use index to read B from
corresponding block on each mapper
• Bounded buffering on both sides – least memory intensive
• Significant speed gains, especially on skewed data
Index
M
ap3
M
ap2
M
ap1
![Page 23: Apache Pig -- PittsburghHug](https://reader036.vdocuments.us/reader036/viewer/2022081719/54694c81af79593b558b4b6c/html5/thumbnails/23.jpg)
Skewed Join
• 1+ keys with more tuples per key than fit in reducer memory, in both relations
• Build histogram based on sample
• Round-robin skewed keys into multiple reducers
• Stream other table, sending each record to all reducers responsible for key
http://wiki.apache.org/pig/PigSkewedJoinSpec
![Page 24: Apache Pig -- PittsburghHug](https://reader036.vdocuments.us/reader036/viewer/2022081719/54694c81af79593b558b4b6c/html5/thumbnails/24.jpg)
In Development
• Release 0.5 : Hadoop 20 compatibility– Or apply a patch, or get Cloudera distro– Imminent
• Release 0.6 (or later)– Load/Store redesign. See PIG-966– UDFs in other languages, see PIG-923– Columnar Storage, see Zebra in /contrib– Stored Schemas, see PIG-760– Metadata service (“Owl”), see PIG-823– SQL compiler, see PIG-824– Auto selection of joins when stats are known, see us
![Page 25: Apache Pig -- PittsburghHug](https://reader036.vdocuments.us/reader036/viewer/2022081719/54694c81af79593b558b4b6c/html5/thumbnails/25.jpg)
Research Problemsaka, when pigs fly
• Extending the language– Functions– Conditional logic– Loops
• Smarter data storage– Learn from workload
• auto-selected column families• different sort orders• overlapping projections
– Go Faster• pushing queries to storage• lazy decompression
![Page 26: Apache Pig -- PittsburghHug](https://reader036.vdocuments.us/reader036/viewer/2022081719/54694c81af79593b558b4b6c/html5/thumbnails/26.jpg)
Research Problemsaka, when pigs fly
• Adaptive optimization– Change plan mid-flight based on observation of current
environment– Can we do this without waiting for MR stage to finish?
• Continuous Queries, Approximate Early Results– Initial work at Berkeley: Map-Reduce Online *
• Cost-Based Optimization– Appropriate cost model?– balance of disk IO, network traffic, task overhead, memory
limitations on individual tasks…– Different from, but similar to, distributed DBs
* http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-136.html
![Page 27: Apache Pig -- PittsburghHug](https://reader036.vdocuments.us/reader036/viewer/2022081719/54694c81af79593b558b4b6c/html5/thumbnails/27.jpg)
Further Reading
• Gates et al, Building a High-Level Dataflow System on top of Map-Reduce: The Pig Experience” VLDB 2009http://infolab.stanford.edu/~olston/publications/vldb09.pdf
• Olston et al, “Generating example data for dataflow programs” SIGMOD 2009 (best paper) http://infolab.stanford.edu/~olston/publications/sigmod09.pdf
• Olston et al, “Pig Latin: A not-so-foreign language for data processing” SIGMOD 2008http://infolab.stanford.edu/~olston/publications/sigmod08.pdf
• Olston et al, “Automatic optimization of parallel dataflow programs” USENIX 2008 http://infolab.stanford.edu/~olston/publications/usenix08.pdf
• More: http://wiki.apache.org/pig/PigTalksPapers
![Page 28: Apache Pig -- PittsburghHug](https://reader036.vdocuments.us/reader036/viewer/2022081719/54694c81af79593b558b4b6c/html5/thumbnails/28.jpg)
Performanceaverage over 12 benchmark queries
Pig speed as factor of Hadoop time (average runtime)
7.6
2.5
1.81.6 1.5 1.4
1.221
alpha 11/21/2008 1/20/2009 2/23/2009 5/28/2009 7/28/2009 8/27/2009 10/18/2009
![Page 29: Apache Pig -- PittsburghHug](https://reader036.vdocuments.us/reader036/viewer/2022081719/54694c81af79593b558b4b6c/html5/thumbnails/29.jpg)
Performanceweighted average over 12 benchmark queries
Pig speed as factor of Hadoop time (weighted average)
11.20
3.26
2.201.97 1.83 1.68 1.53
1.04
alpha 11/21/2008 1/20/2009 2/23/2009 5/28/2009 7/28/2009 8/27/2009 10/18/2009
![Page 30: Apache Pig -- PittsburghHug](https://reader036.vdocuments.us/reader036/viewer/2022081719/54694c81af79593b558b4b6c/html5/thumbnails/30.jpg)
Queries?
http://hadoop.apache.org/pig
@squarecog