alan f. gates, olga natkovich, shubham chopra, pradeep kamath, shravan m. narayanamurthy,...

17
Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh Srivastava Building a High-Level Dataflow System on top of Map-Reduce: The Pig Experience

Upload: karlee-ivy

Post on 01-Apr-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh

Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh

Srinivasan, Utkarsh Srivastava

Building a High-LevelDataflow System on top of Map-Reduce:The Pig Experience

Page 2: Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh

- 2 -

What is Pig?

• Procedural dataflow language (Pig Latin) for Map-Reduce• Provides standard relational transforms (group, join, filter,

sort, etc.)• Schemas are optional; if used, can be part of data or

specified at run time• User defined functions are first class citizens of the language

Page 3: Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh

- 3 -

An Example

You have a dataset urls: (url, category, pagerank)

You want to know the top 10 urls per category as measured by pagerank for sufficiently large categories:urls = load ‘dataset’

as (url, category, pagerank);grps = group urls by category;bgrps = filter grps by

COUNT(urls) > 1000000;rslt = foreach bgrps

generate group, top10(urls);store rslt into ‘myOutput’;

Page 4: Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh

- 4 -

Pig Latin = Sweet Spot between SQL & Map-Reduce

SQL Pig Map-Reduce

Programming style

Large blocks of declarative constraints

“Plug together pipes”

Built-in data manipulations

Group-by, Sort, Join, Filter, Aggregate, Top-k, etc...

Group-by, Sort

Execution model Fancy; trust the query optimizer

Simple, transparent

Opportunities for automatic optimization

Many

Few (logic buried in map() and reduce())

Data Schema Must be known at table creation

Not required, may be defined at runtime

Page 5: Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh

- 5 -

Building Pig

• Type System and Type Inference• Compilation to Map-Reduce Jobs• Plan Execution• Streaming; supporting user provided executables• Performance Measurements• Project Experience

Page 6: Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh

- 6 -

Map Reduce Overview

Page 7: Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh

- 7 -

From Pig Latin to Map Reduce

Parser

ScriptA = loadB = filterC = groupD = foreach

Logical PlanSemanticChecks

Logical PlanLogicalOptimizer

Logical Plan

Logical toPhysicalTranslatorPhysical Plan

PhysicalTo MRTranslator

MapReduceLauncher

Jar tohadoop

Map-Reduce Plan

Logical Plan ≈ relational algebra

Plan standard optimizations

Physical Plan = physical operators to be executed

Map-Reduce Plan = physical operators broken into Map, Combine, and Reduce stages

Page 8: Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh

- 8 -

Pig Latin to Logical Plan

A = load ‘users’ as (user, age);B = load ‘pageviews’ as (user, url);C = filter A by age < 18;D = join A by user, B by user;E = group D by url;F = foreach E generate group, CalcScore(url);store F into ‘scored_urls’;

Pig Latin Logical Plan

load users

load pageviews

filter

join

group

foreach

store

Page 9: Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh

- 9 -

Group

(tim, 17, yahoo.com)(tim, 17, ebay.com)(joe, 15, yahoo.com)

D = join A by user, B by user;E = group D by url;F = foreach E generate group, CalcScore(url);

join

group

foreach

(yahoo.com, )(tim, 17),(joe, 15)

(ebay.com, (tim, 17) )

(yahoo.com, 0.95)(ebay.com, 0.90)

Page 10: Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh

- 10 -

Join

join cogroup

foreach(tim, 17, yahoo.com)(tim, 17, ebay.com)(joe, 15, yahoo.com)

(tim, yahoo.com)(tim, ebay.com)(joe, yahoo.com)

(tim, 17)(joe, 15)(bob, 11)

(tim, (17) )

(joe, (15) , (yahoo.com) )

(bob, (11) , )

(yahoo.com)(ebay.com)

load pageviews

filter

load pageviews

filter

Page 11: Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh

- 11 -

Join Implementations

• Default is symmetric hash join• Fragment-replicate for joining large and small inputs• Merge join for joining inputs sorted on join key• Skew join for handling inputs with significant skew in the join

key

Page 12: Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh

- 12 -

Logical to Physical Plan

Logical Plan

load users

load pageviews

filter

join

group

foreach

store

Physical Planload users load pageviews

filter

local rearrange

global rearrange

foreach

local rearrange

global rearrange

package

foreach

package

store

Page 13: Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh

- 13 -

Physical to Map-Reduce Plan

Physical Planload users load pageviews

filter

local rearrange

global rearrange

foreach

local rearrange

global rearrange

package

foreach

package

store

filter

local rearrange

foreach

package

local rearrange

package

foreach

Map-Reduce Plan

map

map

reduce

reduce

Page 14: Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh

- 14 -

Sharing Scans

load users

filter out bots

group by stategroup by

demographic

apply UDFs apply UDFs

store into ‘bystate’

store into ‘bydemo’

Page 15: Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh

- 15 -

Multiple Group Map-Reduce Plan

map filter

local rearrange

split

local rearrange

reduce

multiplexpackage package

foreach foreach

Page 16: Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh

- 16 -

Performance

Page 17: Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh

- 17 -

Questions