alan f. gates, olga natkovich, shubham chopra, pradeep kamath, shravan m. narayanamurthy,...
TRANSCRIPT
Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh
Srinivasan, Utkarsh Srivastava
Building a High-LevelDataflow System on top of Map-Reduce:The Pig Experience
- 2 -
What is Pig?
• Procedural dataflow language (Pig Latin) for Map-Reduce• Provides standard relational transforms (group, join, filter,
sort, etc.)• Schemas are optional; if used, can be part of data or
specified at run time• User defined functions are first class citizens of the language
- 3 -
An Example
You have a dataset urls: (url, category, pagerank)
You want to know the top 10 urls per category as measured by pagerank for sufficiently large categories:urls = load ‘dataset’
as (url, category, pagerank);grps = group urls by category;bgrps = filter grps by
COUNT(urls) > 1000000;rslt = foreach bgrps
generate group, top10(urls);store rslt into ‘myOutput’;
- 4 -
Pig Latin = Sweet Spot between SQL & Map-Reduce
SQL Pig Map-Reduce
Programming style
Large blocks of declarative constraints
“Plug together pipes”
Built-in data manipulations
Group-by, Sort, Join, Filter, Aggregate, Top-k, etc...
Group-by, Sort
Execution model Fancy; trust the query optimizer
Simple, transparent
Opportunities for automatic optimization
Many
Few (logic buried in map() and reduce())
Data Schema Must be known at table creation
Not required, may be defined at runtime
- 5 -
Building Pig
• Type System and Type Inference• Compilation to Map-Reduce Jobs• Plan Execution• Streaming; supporting user provided executables• Performance Measurements• Project Experience
- 6 -
Map Reduce Overview
- 7 -
From Pig Latin to Map Reduce
Parser
ScriptA = loadB = filterC = groupD = foreach
Logical PlanSemanticChecks
Logical PlanLogicalOptimizer
Logical Plan
Logical toPhysicalTranslatorPhysical Plan
PhysicalTo MRTranslator
MapReduceLauncher
Jar tohadoop
Map-Reduce Plan
Logical Plan ≈ relational algebra
Plan standard optimizations
Physical Plan = physical operators to be executed
Map-Reduce Plan = physical operators broken into Map, Combine, and Reduce stages
- 8 -
Pig Latin to Logical Plan
A = load ‘users’ as (user, age);B = load ‘pageviews’ as (user, url);C = filter A by age < 18;D = join A by user, B by user;E = group D by url;F = foreach E generate group, CalcScore(url);store F into ‘scored_urls’;
Pig Latin Logical Plan
load users
load pageviews
filter
join
group
foreach
store
- 9 -
Group
(tim, 17, yahoo.com)(tim, 17, ebay.com)(joe, 15, yahoo.com)
D = join A by user, B by user;E = group D by url;F = foreach E generate group, CalcScore(url);
join
group
foreach
(yahoo.com, )(tim, 17),(joe, 15)
(ebay.com, (tim, 17) )
(yahoo.com, 0.95)(ebay.com, 0.90)
- 10 -
Join
join cogroup
foreach(tim, 17, yahoo.com)(tim, 17, ebay.com)(joe, 15, yahoo.com)
(tim, yahoo.com)(tim, ebay.com)(joe, yahoo.com)
(tim, 17)(joe, 15)(bob, 11)
(tim, (17) )
(joe, (15) , (yahoo.com) )
(bob, (11) , )
(yahoo.com)(ebay.com)
load pageviews
filter
load pageviews
filter
- 11 -
Join Implementations
• Default is symmetric hash join• Fragment-replicate for joining large and small inputs• Merge join for joining inputs sorted on join key• Skew join for handling inputs with significant skew in the join
key
- 12 -
Logical to Physical Plan
Logical Plan
load users
load pageviews
filter
join
group
foreach
store
Physical Planload users load pageviews
filter
local rearrange
global rearrange
foreach
local rearrange
global rearrange
package
foreach
package
store
- 13 -
Physical to Map-Reduce Plan
Physical Planload users load pageviews
filter
local rearrange
global rearrange
foreach
local rearrange
global rearrange
package
foreach
package
store
filter
local rearrange
foreach
package
local rearrange
package
foreach
Map-Reduce Plan
map
map
reduce
reduce
- 14 -
Sharing Scans
load users
filter out bots
group by stategroup by
demographic
apply UDFs apply UDFs
store into ‘bystate’
store into ‘bydemo’
- 15 -
Multiple Group Map-Reduce Plan
map filter
local rearrange
split
local rearrange
reduce
multiplexpackage package
foreach foreach
- 16 -
Performance
- 17 -
Questions