alan f. gates, olga natkovich, shubham chopra, pradeep kamath, shravan m. narayanamurthy,...

Post on 01-Apr-2015

212 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh

Srinivasan, Utkarsh Srivastava

Building a High-LevelDataflow System on top of Map-Reduce:The Pig Experience

- 2 -

What is Pig?

• Procedural dataflow language (Pig Latin) for Map-Reduce• Provides standard relational transforms (group, join, filter,

sort, etc.)• Schemas are optional; if used, can be part of data or

specified at run time• User defined functions are first class citizens of the language

- 3 -

An Example

You have a dataset urls: (url, category, pagerank)

You want to know the top 10 urls per category as measured by pagerank for sufficiently large categories:urls = load ‘dataset’

as (url, category, pagerank);grps = group urls by category;bgrps = filter grps by

COUNT(urls) > 1000000;rslt = foreach bgrps

generate group, top10(urls);store rslt into ‘myOutput’;

- 4 -

Pig Latin = Sweet Spot between SQL & Map-Reduce

SQL Pig Map-Reduce

Programming style

Large blocks of declarative constraints

“Plug together pipes”

Built-in data manipulations

Group-by, Sort, Join, Filter, Aggregate, Top-k, etc...

Group-by, Sort

Execution model Fancy; trust the query optimizer

Simple, transparent

Opportunities for automatic optimization

Many

Few (logic buried in map() and reduce())

Data Schema Must be known at table creation

Not required, may be defined at runtime

- 5 -

Building Pig

• Type System and Type Inference• Compilation to Map-Reduce Jobs• Plan Execution• Streaming; supporting user provided executables• Performance Measurements• Project Experience

- 6 -

Map Reduce Overview

- 7 -

From Pig Latin to Map Reduce

Parser

ScriptA = loadB = filterC = groupD = foreach

Logical PlanSemanticChecks

Logical PlanLogicalOptimizer

Logical Plan

Logical toPhysicalTranslatorPhysical Plan

PhysicalTo MRTranslator

MapReduceLauncher

Jar tohadoop

Map-Reduce Plan

Logical Plan ≈ relational algebra

Plan standard optimizations

Physical Plan = physical operators to be executed

Map-Reduce Plan = physical operators broken into Map, Combine, and Reduce stages

- 8 -

Pig Latin to Logical Plan

A = load ‘users’ as (user, age);B = load ‘pageviews’ as (user, url);C = filter A by age < 18;D = join A by user, B by user;E = group D by url;F = foreach E generate group, CalcScore(url);store F into ‘scored_urls’;

Pig Latin Logical Plan

load users

load pageviews

filter

join

group

foreach

store

- 9 -

Group

(tim, 17, yahoo.com)(tim, 17, ebay.com)(joe, 15, yahoo.com)

D = join A by user, B by user;E = group D by url;F = foreach E generate group, CalcScore(url);

join

group

foreach

(yahoo.com, )(tim, 17),(joe, 15)

(ebay.com, (tim, 17) )

(yahoo.com, 0.95)(ebay.com, 0.90)

- 10 -

Join

join cogroup

foreach(tim, 17, yahoo.com)(tim, 17, ebay.com)(joe, 15, yahoo.com)

(tim, yahoo.com)(tim, ebay.com)(joe, yahoo.com)

(tim, 17)(joe, 15)(bob, 11)

(tim, (17) )

(joe, (15) , (yahoo.com) )

(bob, (11) , )

(yahoo.com)(ebay.com)

load pageviews

filter

load pageviews

filter

- 11 -

Join Implementations

• Default is symmetric hash join• Fragment-replicate for joining large and small inputs• Merge join for joining inputs sorted on join key• Skew join for handling inputs with significant skew in the join

key

- 12 -

Logical to Physical Plan

Logical Plan

load users

load pageviews

filter

join

group

foreach

store

Physical Planload users load pageviews

filter

local rearrange

global rearrange

foreach

local rearrange

global rearrange

package

foreach

package

store

- 13 -

Physical to Map-Reduce Plan

Physical Planload users load pageviews

filter

local rearrange

global rearrange

foreach

local rearrange

global rearrange

package

foreach

package

store

filter

local rearrange

foreach

package

local rearrange

package

foreach

Map-Reduce Plan

map

map

reduce

reduce

- 14 -

Sharing Scans

load users

filter out bots

group by stategroup by

demographic

apply UDFs apply UDFs

store into ‘bystate’

store into ‘bydemo’

- 15 -

Multiple Group Map-Reduce Plan

map filter

local rearrange

split

local rearrange

reduce

multiplexpackage package

foreach foreach

- 16 -

Performance

- 17 -

Questions

top related