pig vs mapreduce

45
Pig vs. MapReduce By Donald Miner NYC Pig User Group August 21, 2013

Post on 15-Sep-2014

21 views

Category:

Technology


0 download

DESCRIPTION

This was a talk given to the Pig meetup group in NYC on August 22nd. We talked about reasons why you would use Pig over Hadoop and vice versa, plus just some random thoughts and gripes. Audio/video recording here: http://vimeo.com/73211764

TRANSCRIPT

Page 1: Pig vs mapreduce

Pig vs. MapReduce

By Donald Miner

NYC Pig User GroupAugust 21, 2013

Page 2: Pig vs mapreduce

About Don

@donaldpminer

[email protected]

Page 3: Pig vs mapreduce

I’ll be talking about

What is Java MapReduce good for?

Why is Pig better in some ways?

When should I use which?

Page 4: Pig vs mapreduce

When do I use Pig??

Can I use Pig to do this?

YES NO

Let’s get to the point

Page 5: Pig vs mapreduce

When do I use Pig??

Can I use Pig to do this?

YES NO

USE PIG!

Page 6: Pig vs mapreduce

When do I use Pig??

Can I use Pig to do this?

YES NO

TRY TO USE PIG ANYWAYS!

Page 7: Pig vs mapreduce

When do I use Pig??

Can I use Pig to do this?

YES NO

TRY TO USE PIG ANYWAYS!Did that work?

YES NO

Page 8: Pig vs mapreduce

When do I use Pig??

Can I use Pig to do this?

YES NO

TRY TO USE PIG ANYWAYS!Did that work?

YES NO

OK… use Java MapReduce

Page 9: Pig vs mapreduce

Why?

• If you can do it with Pig, save yourself the pain• Almost always developer time is worth more

than machine time• Trying something out in Pig is not risky (time-

wise) – you might learn something about your problem– Ok, so it turned out to look a bit like a hack, but

who cares?– Ok, so it ended up being slow, but who cares?

Page 10: Pig vs mapreduce

Use the right tool for the job

Pig

Java MapReduce

HTML

Get the job done faster and better

Big Data Problem TM

Page 11: Pig vs mapreduce

Which is faster,Pig or Java MapReduce?

Hypothetically, any Pig job could be rewritten using MapReduce… so Java MR can only be faster.

The TRUE battle is thePig optimizer vs. the developer

VS

Are you better than the Pig optimizer than figuring out how to string multiple jobs together (and other things)?

Page 12: Pig vs mapreduce

Things that are hard to express in Pig

• When something is hard to express succinctly in Pig, you are going to end up with a slow job i.e., building something up of several primitives

• Some examples:– Tricky groupings or joins– Combining lots of data sets– Tricky usage of the distributed cache (replicated join)– Tricky cross products– Doing crazy stuff in nested FOREACH

• In these cases, Pig is going to spawn off a bunch of MapReduce jobs, which could have been done with less

This is change in “speed” that doesn’t just have to do with cost-of-abstraction

Page 13: Pig vs mapreduce

The Fancy MAPREDUCE keyword!

Pig has a relational operator called MAPREDUCE that allows your to plug in a Java MapReduce job!

Use this to only replace the tricky things … don’t throw out all the stuff Pig is good at

B = MAPREDUCE 'wordcount.jar' STORE A INTO 'inputDir' LOAD 'outputDir' AS (word:chararray, count: int) `org.myorg.WordCount inputDir outputDir`;

Have the best of both worlds!

To the rescue…

Page 14: Pig vs mapreduce

Somewhat related:Is developer time worthless?

Does speed really matter?

Time spent writing Pig jobRuntime of Pig job x times job is ranTime spent maintaining Pig job

Time spent writing MR jobRuntime of MR job x times job is ran

Time spent maintaining MR job

When does the scale tip in one direction or the other? Will the job run many times? Or once? Are your Java programmers sloppy? Is the Java MR significantly faster in this case? Is 14 minutes really that different from 20 minutes?

Page 15: Pig vs mapreduce

Why is development so much faster in Pig?

• Fewer java-level bugs to work out … but bugs might be harder to figure out

• Fewer lines of code simply means less typing• Compilation and deployment can significantly slow

down incremental improvements• Easier to read: The purpose of the analytic is more

straightforward (the context is self-evident)

Page 16: Pig vs mapreduce

Avoiding Java!

• Not everyone is a Java expert … especially all those SQL guys you are

repurposing

• The higher level of abstraction makes Pig easier to learn and read– I’ve had both software engineers and SQL

developers become productive in Pig in <4 days

Oh, you want to learn Hadoop? Read this first!

Page 17: Pig vs mapreduce

But can I really?not really.

Pig is good at moving data sets between states … but not so good at manipulating the data itself

examples: advanced string operations, math, complex aggregates, dates, NLP, model building

You need user-defined functions (UDFs)

I’ve seen too many people try to avoid UDFsUDFs are powerful: manipulate bags after a GROUP BY Plug into external libraries like NLTK or OpenNLP Loaders for complex custom data types Exploiting the order of data

Page 18: Pig vs mapreduce

Ok, so I still want to avoid Java

Do you work by yourself???Give someone else the task of writing you a UDF! (they are bite-size little projects)

Current UDF support in 0.11.1: Java, Python, JavaScript, Ruby, Groovy These can help you avoid Java if you simply don’t like it (me)

Page 19: Pig vs mapreduce

Why did you write a book on MR Design Patterns if you think you should do stuff in

Pig??

Good question!• I’ve seen plenty of devs do DUMB stuff in

Pig just because there is a keyword for it e.g., silly joins, ordering, using the PARALLEL keyword wrong

• Knowing how MapReduce works will result in you writing better Pig

• In particular– how do Pig optimizations and relational keywords translate into MapReduce design patterns?

Page 20: Pig vs mapreduce

SCENARIO #1:JUST CHANGE THAT ONE LITTLE LINE

RING RINGRING RING

A STORY ABOUT MAINTAINABILITY

Page 21: Pig vs mapreduce

SCENARIO #1:JUST CHANGE THAT ONE LITTLE LINE

IT guy here. Your MapReduce job is blowing up the cluster, how do I fix

this thing?

Page 22: Pig vs mapreduce

SCENARIO #1:JUST CHANGE THAT ONE LITTLE LINE

Ah, that’s pretty easy to fix. Just comment out that

first line in the mapper function.

Page 23: Pig vs mapreduce

SCENARIO #1:JUST CHANGE THAT ONE LITTLE LINE

Ok, how do I do that?

Page 24: Pig vs mapreduce

SCENARIO #1:JUST CHANGE THAT ONE LITTLE LINE

Oh, that’s easy

Page 25: Pig vs mapreduce

SCENARIO #1:JUST CHANGE THAT ONE LITTLE LINE

Oh, that’s easy

First, check the code out of git

Page 26: Pig vs mapreduce

SCENARIO #1:JUST CHANGE THAT ONE LITTLE LINE

Oh, that’s easyFirst, check the code out of git

Then, download, install and configure Eclipse.

Don’t forget to set your CLASSPATH!

Page 27: Pig vs mapreduce

SCENARIO #1:JUST CHANGE THAT ONE LITTLE LINE

Oh, that’s easyFirst, check the code out of git

Then, download, install and configure Eclipse. Don’t forget

to set your CLASSPATH!

Ok, now comment out line # 851 in

/home/itguy/java/src/com/hadooprus/hadoop/hadoop/mapreducejobs/job

s/codes/analytic/mymapreducejob/

mapper.java

. . .

Page 28: Pig vs mapreduce

SCENARIO #1:JUST CHANGE THAT ONE LITTLE LINE

Oh, that’s easyFirst, check the code out of git

Then, download, install and configure Eclipse. Don’t forget

to set your CLASSPATH!

Ok, now comment out line # 851 in

/home/itguy/java/src/com/hadooprus/hadoop/hadoop/mapreducejobs/job

s/codes/analytic/mymapreducejob/

mapper.java

. . . . . . Now, build the .jar

Page 29: Pig vs mapreduce

SCENARIO #1:JUST CHANGE THAT ONE LITTLE LINE

Oh, that’s easyFirst, check the code out of git

Then, download, install and configure Eclipse. Don’t forget

to set your CLASSPATH!

Ok, now comment out line # 851 in

/home/itguy/java/src/com/hadooprus/hadoop/hadoop/mapreducejobs/job

s/codes/analytic/mymapreducejob/

mapper.java

. . . . . . . . . Now, compile the .jarAnd ship the .jar to the

cluster, replacing the old one

Page 30: Pig vs mapreduce

SCENARIO #1:JUST CHANGE THAT ONE LITTLE LINE

Oh, that’s easyFirst, check the code out of git

Then, download, install and configure Eclipse. Don’t forget

to set your CLASSPATH!

Ok, now comment out line # 851 in

/home/itguy/java/src/com/hadooprus/hadoop/hadoop/mapreducejobs/job

s/codes/analytic/mymapreducejob/

mapper.java

. . . . . . . . . . . . . .

Now, compile the .jar

And ship the .jar to the cluster, replacing

the old one

Ok, now run the hadoop jar command. Don’t

forget the CLASSPATH!

Page 31: Pig vs mapreduce

SCENARIO #1:JUST CHANGE THAT ONE LITTLE LINE

Oh, that’s easyFirst, check the code out of git

Then, download, install and configure Eclipse. Don’t forget

to set your CLASSPATH!

Ok, now comment out line # 851 in

/home/itguy/java/src/com/hadooprus/hadoop/hadoop/mapreducejobs/job

s/codes/analytic/mymapreducejob/

mapper.java

. . . . . . . . . . . . . . . .

Now, compile the .jar

And ship the .jar to the cluster, replacing

the old one

Ok, now run the hadoop jar command. Don’t

forget the CLASSPATH!

Did that work?

Page 32: Pig vs mapreduce

SCENARIO #1:JUST CHANGE THAT ONE LITTLE LINE

No

Page 33: Pig vs mapreduce

SCENARIO #1:JUST CHANGE THAT ONE LITTLE LINE

. . .

Ah, let’s try something else and do that again!

Page 34: Pig vs mapreduce

SCENARIO #2:JUST CHANGE THAT ONE LITTLE LINE

(this time with Pig)

RING RINGRING RING

Page 35: Pig vs mapreduce

SCENARIO #2:JUST CHANGE THAT ONE LITTLE LINE

(this time with Pig)

IT guy here. Your MapReduce job is blowing up the cluster, how do I fix

this thing?

Page 36: Pig vs mapreduce

SCENARIO #2:JUST CHANGE THAT ONE LITTLE LINE

(this time with Pig)

Ah, that’s pretty easy to fix. Just comment out that line that says “FILTER blah

blah” and save the file.

Page 37: Pig vs mapreduce

SCENARIO #2:JUST CHANGE THAT ONE LITTLE LINE

(this time with Pig)

Ok, thanks!

Page 38: Pig vs mapreduce

Pig: Deployment & Maintainability

• Don’t have to worry about version mismatch (for the most part)

• You can have multiple Pig client libraries installed at once

• Takes compilation out of the build and deployment process

• Can make changes to scripts in place if you have to• Iteratively tweaking scripts during development and

debugging• Less chances for the developer to write Java-level bugs

Page 39: Pig vs mapreduce

Some Caveats

• Hadoop Streaming provides some of these same benefits

• Big problems in both are still going to take time

• If you are using Java UDFs, you still need to compile them (which is why I use Python)

Page 40: Pig vs mapreduce

Unstructured Data

• Delimited data is pretty easy• Pig has issues dealing with out of the box:– Media: images, videos, audio– Time series: utilizing order of data, lists– Ambiguously delimited text– Log data: rows with different context/meaning/format

You can write custom loaders and tons of UDFs… but what’s the point?

OH NO!

Page 41: Pig vs mapreduce

What about semi-structured data?

• Some forms more natural that others– Well-defined JSON/XML schemas are usually OK

• Pig has trouble dealing with:– Complex operations on unbounded lists of objects (e.g.,

bags)– Very Flexible schemas (think BigTable/Hbase)– Poorly designed JSON/XML

Sometimes, it’s just more pain than it’s worth to try to do in Pig

Page 42: Pig vs mapreduce

Pig vs. Hive vs. MapReduce

• Same arguments apply for Hive vs. Java MR• Using Pig or Hive doesn’t make that big of a difference

… but pick one because UDFs/Storage functions aren’t easily interchangeable

• I think you’ll like Pig better than Hive(just like everyone likes emacs more than vi)

Page 43: Pig vs mapreduce

WRAP UP: AN ANALOGY (#1)Pig is a scripting language,

Hadoop’s MapReduce is a compiled language.

PYTHON

C::

Page 44: Pig vs mapreduce

WRAP UP: AN ANALOGY (#2)

Pig is a higher level of abstraction,Hadoop’s MapReduce is a lower level of abstraction.

SQL

C

::

Page 45: Pig vs mapreduce

A lot of the same arguments apply!• Compilation

– Don’t have to compile Pig• Efficiency of code

– Pig will be a bit less efficient (but…)• Lines of code and verbosity

– Pig will have fewer lines of code• Optimization

– Pig has more opportunities to do automatic optimization of queries• Code portability

– The same Pig script will work across versions (for the most part)• Code readability

– It should be easier to understand a Pig script• Underlying bugs

– Underlying bugs in Pig can cause frustrating problems (thanks be to God for open source)• Amount of control and space of possibilities

– There are fewer things you CAN do in Pig