introduction to apache pig - ut · –writing low level mapreduce code slow ... advantages of pig...

Introduction to Apache Pig

Pelle Jakovits

23 September, 2013, Tartu

Outline

• MapReduce recollection

• Why Pig? – Advantages

• Running pig

• Pig Latin

• Examples

• Architecture

• Disadvantages

Pelle Jakovits 2/18

You already know MapReduce

• MapReduce = Map, GroupBy, Sort, Reduce”

• Designed or huge scale data processing

• Provides – Distributed file system

– High scalability

– Automatic parallelisation

– Automatic fault recovery • Data is replicated

• Failed tasks are re-executed on other nodes

Pelle Jakovits 3/18

But is MapReduce enough?

• Hadoop MapReduce is one of the most used frameworks for large scale data processing

• However:

– Writing low level Mapreduce code slow

– Need a lot of expertise to optimize MapReduce code

– Prototyping is slow

– A lot of custom code required

• Even for the most simplest tasks

– Hard to manage more complex mapreduce job chains

Pelle Jakovits 4/18

Apache Pig

• A data flow framework on top of Hadoop MapReduce – Retains all its advantages

– And some of it’s disadvantages

• Models a scripting language – Fast prototyping

• Uses Pig Latin language

– Similiar to declarative SQL

– Easier to get started with

• Pig Latin statements are automatically translated into MapReduce jobs

Pelle Jakovits 5/18

Advantages of Pig

• Easy to Program – 5% of the code, 5% of the time required

• Self-Optimizing – Pig Latin statment optimizations – Generated MapReduce code optimizations

• Can manage more complex data flows – Easy to use and join multiple separate inputs,

transformations and outputs

• Extensible – Can be extended with User Defined Functions (UDF)

to provide more functionality

Pelle Jakovits 6/18

Running Pig

• Local mode – Everything installed locally on one machine

• Distributed mode – Everything runs in a MapReduce cluster

• Interactive mode – Grunt shell

• Batch mode – Pig scripts

Pelle Jakovits 7/18

Pig Latin

• Write complex MapReduce transformations using much simpler scripting language

• Not quite SQL, but similar

• Lazy evaluation

• Compiling is hidden from the user

Pelle Jakovits 8/18

Pig Latin Example

I = load ‘/mydata/images’ using ImageParser() as (id, image);

F = foreach I generate id, detectFaces(image);

store F into ‘/mydata/faces’;

• Input and output are HDFS folders or files – /mydata/images

– /mydata/faces

• I and F are relations

• Right hand side contains Pig expressions

Pelle Jakovits 9/18

Relations, Bags, Tuples, Fields

• Relation – Can have nested relations

– Similiar to a table in a relational database

– Consists of a Bag

• Bag – Collection of unordered tuples

• Tuple – An ordered set of fields

– Similiar to a row in a relational database

– Can contain any number of fields, does not have to match other tuples

• Fields – A piece of data

Pelle Jakovits 10/18

Fields

• Consists of either: – Data atoms - Int, long, float, double, chararray, boolean,

datetime, etc.

– Complex data - Bag, Map, Tuple

• Assigning types to fields – A = LOAD 'student' AS (name:chararray, age:int, gpa:float);

• Referencing Fields – By order - $0, $1, $2

– By name - assigned by user schemas • A = LOAD ‘in.txt‘ AS (age, name, occupation);


Complex data types

• Looking into complex, nested data

– client.$0

– author.age


Loading and storing data

• LOAD – A = LOAD ‘myfile.txt’ USING PigStorage(‘\t’) AS (f1:int,

f2:int, f3:int); – User defines data loader and delimiters

• STORE – STORE A INTO ‘output_1.txt’ USING PigStorage (‘,’); – STORE B INTO ‘output_2.txt’ USING PigStorage (‘*’);

• Other data loaders – BinStorage – PigDump – TextLoader – Or create a custom one.


FOREACH … GENERATE

• General data transformation statement

• Used to:

– Change the structure of data

– Apply functions to data

– Flatten complex data to remove nesting

• X = FOREACH C GENERATE FLATTEN (A.(a1, a2)), FLATTEN(B.$1);


Group .. BY

• A = load 'student' AS (name:chararray, age:int, gpa:float);

• DUMP A; – (John, 18, 4.0F)

– (Mary, 19, 3.8F)

– (Bill, 20, 3.9F)

– (Joe, 18, 3.8F)

• B = GROUP A BY age;

• DUMP B;

– (18, {(John, 18, 4.0F), (Joe, 18, 3.8F)})

– (19, {(Mary, 19, 3.8F)})

– (20, {(Bill, 20, 3.9F)})


JOIN

• A = LOAD 'data1' AS (a1:int,a2:int,a3:int);

• B = LOAD 'data2' AS (b1:int,b2:int);

• X = JOIN A BY a1, B BY b1;


DUMP A; (1,2,3) (4,2,1)

DUMP B;

(1,3) (2,7) (4,6)

DUMP X;

(1,2,3,1,3) (4,2,1,4,6)

Union

• A = LOAD 'data' AS (a1:int, a2:int, a3:int);

• B = LOAD 'data' AS (b1:int, b2:int);

• X = UNION A, B;


DUMP A; (1,2,3) (4,2,1)

DUMP A;

(2,4) (8,9)

DUMP X;

(1,2,3) (4,2,1) (2,4) (8,9)

Functions

• SAMPLE

– A = LOAD 'data' AS (f1:int,f2:int,f3:int);

– X = SAMPLE A 0.01;

– X will contain 1% of tuples in A

• FILTER

– A = LOAD 'data' AS (a1:int, a2:int, a3:int);

– X = FILTER A BY a3 == 3;


Functions

• DISTINCT – removes duplicate tuples

– X = DISTINCT A;

• LIMIT –

– X = LIMIT B 3;

• SPLIT –

– SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6);


Pig Example 1

• A = LOAD 'student' USING PigStorage() AS (name, age, gpa);

• DUMP A;

– (John, 18, 4.0F)

– (Mary, 19, 3.8F)

– (Bill, 20, 3.9F)

– (Joe, 18, 3.8F)

• B = GROUP A BY age;

• C = FOREACH B GENERATE AVG(gpa)


Pig Example 2

• batting = load 'Batting.csv' using PigStorage(','); • runs = FOREACH batting GENERATE $0 as playerID, $1

as year, $8 as runs; • grp_data = GROUP runs by (year); • max_runs = FOREACH grp_data GENERATE group as

grp, MAX(runs.runs) as max_runs; • join_max_run = JOIN max_runs by ($0, max_runs), runs

by (year,runs); • join_data = FOREACH join_max_run GENERATE $0 as

year, $2 as playerID, $1 as runs; • dump join_data;


User Defined Functions (UDF)

• DEFINE alias function

–

• DEFINE alias command input output

– For streaming


Pig workflow


Pig disadvantages

• Slow start-up and clean-up of MapReduce jobs

– It takes time for Hadoop to schedule MR jobs

• Not suitable for interactive OLAP Analytics

– When results are expected in < 1 sec

• Complex applications may require many UDF’s

– Pig loses it’s simplicity over MapReduce


Other noteworthy Hadoop projects

• Hbase – Open-source distributed database ontop of HDFS

• Hive™ – A data warehouse infrastructure that provides data

summarization and ad hoc querying. – Developed by Facebook

• Mahout™ – A Scalable machine learning and data mining library.

• ZooKeeper™ – A high-performance coordination service for distributed

applications. – Centralised configuration and synchronization


Thats All

• This week`s practice session

– Processing data with Pig

– Similiar exercise as last week, but this time using Pig

• Next lecture: Hive

– What is Hive

– HiveQL language

– Hive vs Pig


introduction to apache pig - ut · –writing low level mapreduce code slow ... advantages of pig...

Documents