alan gates becoming a pig developer. - 2 - who am i? pig committer hadoop pmc member yahoo!...

19
Alan Gates Becoming a Pig Developer

Upload: beverley-hodge

Post on 17-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Alan Gates Becoming a Pig Developer. - 2 - Who Am I? Pig committer Hadoop PMC Member Yahoo! architect for Pig

Alan GatesBecoming a Pig Developer

Page 2: Alan Gates Becoming a Pig Developer. - 2 - Who Am I? Pig committer Hadoop PMC Member Yahoo! architect for Pig

- 2 -

Who Am I?

• Pig committer• Hadoop PMC Member• Yahoo! architect for Pig

Page 3: Alan Gates Becoming a Pig Developer. - 2 - Who Am I? Pig committer Hadoop PMC Member Yahoo! architect for Pig

- 3 -

Current Status

• Release 0.3 June 2009– Multi-store queries

• Pig added to Amazon Elastic MapReduce August 2009• Release 0.4 September 2009

– Added skew and merge join– Added outer join (for default hash join only)

• Release 0.5 November 2009– Hadoop 0.20

Page 4: Alan Gates Becoming a Pig Developer. - 2 - Who Am I? Pig committer Hadoop PMC Member Yahoo! architect for Pig

- 4 -

Components

User machine

Hadoop Cluster

Pig resides on user machine

Job executes on cluster

No need to install anything extra on your Hadoop cluster.

Page 5: Alan Gates Becoming a Pig Developer. - 2 - Who Am I? Pig committer Hadoop PMC Member Yahoo! architect for Pig

- 5 -

How It Works

Parser

ScriptA = loadB = filterC = groupD = foreach

Logical PlanSemanticChecks

Logical PlanLogicalOptimizer

Logical Plan

Logical toPhysicalTranslatorPhysical Plan

PhysicalTo MRTranslator

MapReduceLauncher

Jar tohadoop

Map-Reduce Plan

Logical Plan ≈ relational algebra

Plan standard optimizations

Physical Plan = physical operators to be executed

Map-Reduce Plan = physical operators broken into Map, Combine, and Reduce stages

Page 6: Alan Gates Becoming a Pig Developer. - 2 - Who Am I? Pig committer Hadoop PMC Member Yahoo! architect for Pig

- 6 -

Fragment Replicate Join

Users = load ‘users’ as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Pages by user, Users by name using “replicated”;

Pages Users

Map 1

Map 2

Users

Users

Pagesblock 1

Pagesblock 2

Page 7: Alan Gates Becoming a Pig Developer. - 2 - Who Am I? Pig committer Hadoop PMC Member Yahoo! architect for Pig

- 7 -

Hash Join

Pages Users

Users = load ‘users’ as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Users by name, Pages by user;

Map 1

Pagesblock n

Map 2

Usersblock m

Reducer 1

Reducer 2

(1, user)

(2, name)

(1, fred)(2, fred)(2, fred)

(1, jane)(2, jane)(2, jane)

Page 8: Alan Gates Becoming a Pig Developer. - 2 - Who Am I? Pig committer Hadoop PMC Member Yahoo! architect for Pig

- 8 -

Skew Join

Pages Users

Users = load ‘users’ as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Pages by user, Users by name using “skewed”;

Map 1

Pagesblock n

Map 2

Usersblock m

Reducer 1

Reducer 2

(1, user)

(2, name)

(1, fred, p1)(1, fred, p2)(2, fred)

(1, fred, p3)(1, fred, p4)(2, fred)

SP

SP

Page 9: Alan Gates Becoming a Pig Developer. - 2 - Who Am I? Pig committer Hadoop PMC Member Yahoo! architect for Pig

- 9 -

Merge Join

Pages Usersaaron . . . . . . . .zach

aaron . . . . . . . .zach

Users = load ‘users’ as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Pages by user, Users by name using “merge”;

Map 1

Map 2

Users

Users

Pages

Pages

aaron…amr

aaron…

amy…barb

amy…

Page 10: Alan Gates Becoming a Pig Developer. - 2 - Who Am I? Pig committer Hadoop PMC Member Yahoo! architect for Pig

- 10 -

Multi-store script

A = load ‘users’ as (name, age, gender, city, state);B = filter A by name is not null;C1 = group B by age, gender;D1 = foreach C1 generate group, COUNT(B);store D into ‘bydemo’;C2= group B by state;D2 = foreach C2 generate group, COUNT(B);store D2 into ‘bystate’;

load users filter nulls

group by state

group by age, gender

apply UDFs

apply UDFs

store into ‘bystate’

store into ‘bydemo’

Page 11: Alan Gates Becoming a Pig Developer. - 2 - Who Am I? Pig committer Hadoop PMC Member Yahoo! architect for Pig

- 11 -

Multi-Store Map-Reduce Plan

map filter

local rearrange

split

local rearrange

reduce

multiplexpackage package

foreach foreach

Page 12: Alan Gates Becoming a Pig Developer. - 2 - Who Am I? Pig committer Hadoop PMC Member Yahoo! architect for Pig

- 12 -

Basic User Defined Functions

A = load ‘users’;B = group A all;C = foreach B generate COUNT(A);

long exec(bag b) { return b.size();}

Reduce

Page 13: Alan Gates Becoming a Pig Developer. - 2 - Who Am I? Pig committer Hadoop PMC Member Yahoo! architect for Pig

- 13 -

Algebraic User Defined Functions

A = load ‘users’;B = group A all;C = foreach B generate COUNT(A);

long exec(tuple t){ return 1;}

long exec(bag b) { long sum = 0; for (long s : b) { sum += s; } return sum;}

long exec(bag b) { long sum = 0; for (long s : b) { sum += s; } return sum;}

Reduce CombineMapInitial Intermediate Final

Page 14: Alan Gates Becoming a Pig Developer. - 2 - Who Am I? Pig committer Hadoop PMC Member Yahoo! architect for Pig

- 14 -

Accumulative User Defined Functions

A = load ‘users’ as (name, url, timestamp);B = group A by name;C = foreach B { D = order A by timestamp; generate SessionAnalysis(A);}

public interface Accumulator <T> { public void accumulate(List<Tuple> b);

public T getValue()}

Reduce

Page 15: Alan Gates Becoming a Pig Developer. - 2 - Who Am I? Pig committer Hadoop PMC Member Yahoo! architect for Pig

- 15 -

Performance Tips

• Project early and often• Use Parallel• Filter out nulls before join• For integer arithmetic, use types

Page 16: Alan Gates Becoming a Pig Developer. - 2 - Who Am I? Pig committer Hadoop PMC Member Yahoo! architect for Pig

- 16 -

Performance

0.1 0.2 0.3 0.4,0.5

trunk

Page 17: Alan Gates Becoming a Pig Developer. - 2 - Who Am I? Pig committer Hadoop PMC Member Yahoo! architect for Pig

- 17 -

Upcoming Features

• Redesign of load and store function interfaces• Adding outer join to all join types• UDFs in python and ruby• Changing spilling strategy to avoid running out of memory• Adding Accumulator interface

Page 18: Alan Gates Becoming a Pig Developer. - 2 - Who Am I? Pig committer Hadoop PMC Member Yahoo! architect for Pig

- 18 -

Learn More

• Read the online documentation: http://hadoop.apache.org/pig/

• On line tutorials– From Yahoo, http://developer.yahoo.com/hadoop/tutorial/– From Cloudera, http://www.cloudera.com/hadoop-training

• A couple of Hadoop books available that include chapters on Pig, search at your favorite bookstore

• Join the mailing lists:– [email protected] for user questions– [email protected] for developer issues

• Contribute back your work, over 40 people have contributed so far

Page 19: Alan Gates Becoming a Pig Developer. - 2 - Who Am I? Pig committer Hadoop PMC Member Yahoo! architect for Pig

- 19 -

Questions