pig, a high level data processing system on hadoop gang luo nov. 1, 2010
TRANSCRIPT
![Page 1: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/1.jpg)
Pig, a high level data processing system on Hadoop
Gang Luo
Nov. 1, 2010
![Page 2: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/2.jpg)
2
Agenda
Recap on MapReduce Introduction to Pig View of Pig from outside
Pig Latin View of Pig from inside
Plan generation Future work
![Page 3: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/3.jpg)
Recap on MapReduce
![Page 4: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/4.jpg)
4
MapReduce
Large scale data processing framework Map phase + Reduce phase Proposed at 2004 by Google Variance and extension in open source
community (Hadoop, Pig, Hive, etc.)
![Page 5: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/5.jpg)
5
MapReduce Programming Model Programmers think in a data-centric fashion
Apply transformations on data sets The MR framework handles the Hard Stuff
Fault tolerance Distributed execution, scheduling,
concurrency Coordination Network communication
![Page 6: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/6.jpg)
6
MapReduce System Model
Designed for batch-oriented computations over large data sets
Each operator runs to completion before producing any output
Operator output is written to stable storage Map output to local disk, reduce output to HDFS
Simple, elegant fault tolerance model: operator restart
Critical for large clusters
![Page 7: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/7.jpg)
7
'Hello World'
Word Count
map(doc_id, text) → list(word, count)
reduce(word, list(count)) → list(sum_count)
Combiner is optional
![Page 8: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/8.jpg)
8
'Hello World'
![Page 9: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/9.jpg)
9
Hadoop Hadoop MapReduce Execution Engine
Single master node, many worker nodes Client submits a job to master node Master splits each job into tasks (map/reduce), and
assigns tasks to worker nodes Hadoop Distributed File System (HDFS)
Single name node, many data nodes Files stored as large, fixed-size (e.g. 64MB) blocks HDFS typically holds map input and reduce output
![Page 10: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/10.jpg)
Introduction to Pig
![Page 11: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/11.jpg)
11
MapReduce not Good Enough? Restrict programming model
Only two phases Job chain for long data flow
Put the logic at the right phase Programmers are responsible for this
Too many lines of code even for simple logic How many lines do you have for word count?
![Page 12: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/12.jpg)
12
Pig to Rescure
High level dataflow language (Pig Latin) Much simpler than Java Simplify the data processing
Put the operations at the apropriate phases Chains multiple MR jobs
![Page 13: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/13.jpg)
13
Motivation by Example
Suppose we have user data in one file, website data in another file.
We need to find the top 5 most visited pages by users aged 18-25
![Page 14: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/14.jpg)
14
In MapReduce
![Page 15: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/15.jpg)
15
In Pig Latin
![Page 16: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/16.jpg)
16
Pig runs over Hadoop
![Page 17: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/17.jpg)
17
How Pig is used in Industry
At Yahoo, 70% MapReduce jobs are written in Pig
Used to Process web log Build user behavior models Process images Data mining
Also used by Twitter, Linkin, Ebay, AOL, etc.
![Page 18: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/18.jpg)
View of Pig from outside
![Page 19: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/19.jpg)
19
MapReduce vs. Pig
MaxTemperature
Year Temperature
Air Quality …
1998 87 2 …1983 93 4 ..2008 90 3 …2001 89 5 …1965 97 4 …
SELECT Year, MAX(Temperature)
FROM Table1
WHERE AirQuality = 0|1|4|5|9
GROUPBY Year
Table1
![Page 20: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/20.jpg)
20
In MapReduce
![Page 21: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/21.jpg)
21
In Pig
![Page 22: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/22.jpg)
22
Wait a mimute
How to map the data to records By default, one line → one record User can customize the loading process
How to identify attributes and map them to schema
Delimitor to separate different attributes By default, delimitor is tab. Customizable.
![Page 23: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/23.jpg)
23
MapReduce vs. Pig cont.
Join in MapReduce Various algorithms. None of them are easy to
implement in MapReduce Multi-way join more complicated Hard to integrate into SPJA workflow
![Page 24: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/24.jpg)
24
MapReduce vs. Pig cont.
Join in Pig Various algorithms already available. Some of them are generic to support multi-way join No need to consider integration into SPJA workflow. Pig
does that for you!
A = LOAD 'input/join/A';
B = LOAD 'input/join/B';
C = JOIN A BY $0, B BY $1;
DUMP C;
![Page 25: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/25.jpg)
25
Pig Latin
Data flow language User specify a sequence of operations to
process data More control on the process, compared with
declarative language Various data types supports Schema supports User defined functions supports
![Page 26: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/26.jpg)
26
Statement
A statement represents an operation, or a stage in the data flow.
Usually a variable is used to represent the result of the statement
Not limited to data processing operations, but also contains filesystem operations
![Page 27: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/27.jpg)
27
Schema
User can optionally define the schema of the input data
Onec the schema of the source data is given, all the schema of the intermediate relation will be induced by Pig
![Page 28: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/28.jpg)
28
Schema cont.
Why schema? Scripts are more readable (by alias) Help system validate the input
Similar to Database? Yes. But schema here is optional Schema is not fixed for a particular dataset,
but changable
![Page 29: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/29.jpg)
29
Schema cont. Schema 1
A = LOAD 'input/A' as (name:chararray, age:int);
B = FILTER A BY age != 20;
Schema 2A = LOAD 'input/A' as (name:chararray, age:chararray);
B = FILTER A BY age != '20';
No SchemaA = LOAD 'input/A' ;
B = FILTER A BY A.$1 != '20';
![Page 30: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/30.jpg)
30
Date Types
Every attribute can always interpreted as bytearray, without further type definition
Simple data types For each attribute Defined by user in the schema Int, double, chararray ...
Complex data types Usually contructed by relational operations Tuple, bag, map
![Page 31: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/31.jpg)
31
Date Types cont.
Type casting Pig will try to cast data types when meets type
inconsistency. Warning will be thrown if casting fails. Process
still goes on Validation
Null will replace the inconvertable data type in type casting
User can tell a corrupted record by detecting whether a particular attribute is null
![Page 32: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/32.jpg)
32
Date Types cont.
![Page 33: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/33.jpg)
33
Operators
Relational Operators Represent an operation that will be added to
the logical plan LOAD, STORE, FILTER, JOIN,
FOREACH...GENERATE
![Page 34: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/34.jpg)
34
Operators
Diagnostic Operators Show the status/metadata of the relations Used for debugging Will not be integrated into execution plan DESCRIBE, EXPLAIN, ILLUSTRATE.
![Page 35: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/35.jpg)
35
Functions Eval Functions
Record transformation Filter Functions
Test whether a record satisfy particular predicate Comparison Functions
Impose ordering between two records. Used by ORDER operation
Load Functions Specify how to load data into relations
Store Functions Specify how to store relations to external storage
![Page 36: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/36.jpg)
36
Functions
Built in Functions Hard-coded routines offered by Pig.
User Defined Function (UDF) Supports customized functionalities Piggy Bank, a warehouse for UDFs. Re-think about Word Count in Pig
![Page 37: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/37.jpg)
View of Pig from inside
![Page 38: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/38.jpg)
38
Pig Execution Modes
Local mode Launch single JVM Access local file system No MR job running
Hadoop mode Execute a sequence of MR jobs Pig interacts with Hadoop master node
![Page 39: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/39.jpg)
39
CompilationCompilation
![Page 40: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/40.jpg)
4004/13/10
Parsing
Type checking with schema
References verification
Logic plan generatingOne-to-one fashionIndependent of execution platformLimited optimizationNo execution until DUMP or STORE
Parsing
![Page 41: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/41.jpg)
4104/13/10
Logic Plan
A=LOAD 'file1' AS (x, y, z);
B=LOAD 'file2' AS (t, u, v);
C=FILTER A by y > 0;
D=JOIN C BY x, B BY u;
E=GROUP D BY z;
F=FOREACH E GENERATE group, COUNT(D);
STORE F INTO 'output';
LOAD
FILTER
LOAD
JOIN
GROUP
FOREACH
STORE
Logical Plan
![Page 42: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/42.jpg)
4204/13/10
Physical Plan
1:1 correspondence with most logical operators
Except for:DISTINCT(CO)GROUPJOINORDER
Physical Plan
![Page 43: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/43.jpg)
4304/13/10
LOAD
FILTER
LOAD
JOIN
GROUP
FOREACH
STORE
LOAD
FILTER
LOAD
LOCAL REARRANGE
PACKAGE
FOREACH
STORE
GLOBAL REARRANGE
LOCAL REARRANGE
PACKAGE
FOREACH
GLOBAL REARRANGE
![Page 44: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/44.jpg)
4404/13/10
Physical Optimization
Always use combiner for pre-aggregation
Insert SPLIT to re-use intermediate result
Early projection
Physical Oprimizations
![Page 45: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/45.jpg)
4504/13/10
MapReduce Plan
Determine MapReduce boundaries
GLOBAL REARRANGE
STORE/LOAD
Some operations are done by MapReduce framework
Coalesce other operators into Map & Reduce stages
Generate job jar file
MapReduce Plan
![Page 46: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/46.jpg)
4604/13/10
LOAD
FILTER
LOAD
LOCAL REARRANGE
PACKAGE
FOREACH
STORE
GLOBAL REARRANGE
LOCAL REARRANGE
PACKAGE
FOREACH
GLOBAL REARRANGE
FILTER
LOCAL REARRANGE
Map
Reduce
Map
Reduce
PACKAGE
FOREACH
LOCAL REARRANGE
PACKAGE
FOREACH
![Page 47: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/47.jpg)
47
Execution in Hadoop Mode
The MR jobs depending on nothing in the MR plan will be submitted for execution
MR jobs will be removed from MR plan after completion Depending jobs are now available for execution
Currently, no supports for inter-job fault-tolerance
![Page 48: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/48.jpg)
Performance and future improvement
![Page 49: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/49.jpg)
49
Pig Performance
Images from http://wiki.apache.org/pig/PigTalksPapers
![Page 50: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/50.jpg)
50
Future Improvements
Query optimization Currently rule-based optimizer for plan rearrangement
and join selection Cost-based in the future
Non-Java UDFs Grouping and joining on pre-partitioned/sorted data
Avoid data shuffling for grouping and joining Building metadata facilities to keep track of data layout
Skew handling For load balancing
![Page 51: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/51.jpg)
51
Get more information at Pig website You can touch the source code to implement
something new in Pig Also take a look at Hive, a similar system
from Facebook
![Page 52: Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010](https://reader036.vdocuments.us/reader036/viewer/2022062323/5697bfad1a28abf838c9bd94/html5/thumbnails/52.jpg)
52
References
Some of the content come from the following presentations: Introduction to data processing using Hadoop
and Pig, by Ricardo Varela Pig, Making Hadoop Easy, by Alan F. Gates Large-scale social media analysis with Hadoop,
by Jake Hofman Getting Started on Hadoop, by Paco Nathan MapReduce Online, by Tyson Condie and Neil
Conway