introduction to pig

53
I EAT BIG!!! I HANDLE BIG!!! Presented by J.Ramsingh M.C.A., M.Phil., Ph.D Research Scholar Department of Computer Applications Bharathiar University

Upload: karthika-karthi

Post on 16-Apr-2017

184 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Introduction to pig

I EAT BIG!!!I HANDLE BIG!!!

Presented byJ.Ramsingh M.C.A., M.Phil.,

Ph.D Research ScholarDepartment of Computer Applications

Bharathiar University

Page 2: Introduction to pig

Contents• Why PIG?• Overview of PIG• PIG Installation• PIG Latin Basics• Developing PIG Scripts

J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

2

Page 3: Introduction to pig

WHY • Not preferred for data analytics

• 200 LOC = 10 LOC

• Not rich in Built-in-functions

3J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 4: Introduction to pig

J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 4

Page 5: Introduction to pig

Overview of PIG

Is a

Can Handle Large Data Sets

I LOVE TO EAT MORE N MORE5J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar

University,- WDABT 2016

Page 6: Introduction to pig

PIG Vs MAPREDUCE

6J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 7: Introduction to pig

Running Environment

LOCAL MODE

HADOOP MODE

7J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 8: Introduction to pig

Pig Execution in Hadoop Cluster

8J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 9: Introduction to pig

Pig Execution in Hadoop Cluster

9J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 10: Introduction to pig

Components of PIG

Pig Latin

Grunt

PIG SERVER

Command based language

Execution Environment

Compiler strives to optimize execution

J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 10

Page 11: Introduction to pig

Compilation• Pig system does two tasks:

Logical PlanPhysical Plan

–Builds a Logical Plan from a Pig Latin script –Supports execution platform independence–No processing of data performed at this stage

Compiles the Logical Plan to a Physical Plan and Executes– Convert the Logical Plan into a series of Map-Reduce statements to be executed by Hadoop Map-Reduce

11J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 12: Introduction to pig

Building a logical planA = LOAD ‘dataset 1.dat’ AS

(name, dob, designation);B = GROUP A BY designation;C = FOREACH B GENERATE group

AS dob,COUNT(A);

D = FILTER C BY name IS ‘XXX’OR name IS ‘yyy’;

STORE D INTO ‘result.dat’;

LOAD DATA

12J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 13: Introduction to pig

Building a logical planA = LOAD ‘dataset 1.dat’ AS

(name, dob, designation);B = GROUP A BY designation;C = FOREACH B GENERATE group

AS dob,COUNT(A);

D = FILTER C BY name IS ‘XXX’OR name IS ‘yyy’;

STORE D INTO ‘result.dat’;

LOAD DATA

GROUP DATA

13J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 14: Introduction to pig

Building a logical planA = LOAD ‘dataset 1.dat’ AS

(name, dob, designation);B = GROUP A BY designation;C = FOREACH B GENERATE group

AS dob,COUNT(A);

D = FILTER C BY name IS ‘XXX’OR name IS ‘yyy’;

STORE D INTO ‘result.dat’;

LOAD DATA

GROUP DATA

J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 14

FOREACH

Page 15: Introduction to pig

Building a logical planA = LOAD ‘dataset 1.dat’ AS

(name, dob, designation);B = GROUP A BY designation;C = FOREACH B GENERATE group

AS dob,COUNT(A);

D = FILTER C BY name IS ‘XXX’OR name IS ‘yyy’;

STORE D INTO ‘result.dat’;

LOAD DATA

GROUP DATA

FOREACH

FILTER

J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 15

Page 16: Introduction to pig

Building a logical planA = LOAD ‘dataset 1.dat’ AS

(name, dob, designation);B = GROUP A BY designation;C = FOREACH B GENERATE group

AS dob,COUNT(A);

D = FILTER C BY name IS ‘XXX’OR name IS ‘yyy’;

STORE D INTO ‘result.dat’;

LOAD DATA

FILTER

GROUP

FOREACH

J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 16

Page 17: Introduction to pig

Building a logical planA = LOAD ‘dataset 1.dat’ AS

(name, dob, designation);B = GROUP A BY designation;C = FOREACH B GENERATE group

AS dob,COUNT(A);

D = FILTER C BY name IS ‘XXX’OR name IS ‘yyy’;

STORE D INTO ‘result.dat’;

LOAD DATA

FILTER

GROUP

FOREACH

Only happens when output isspecified by STORE or DUMP

J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 17

Page 18: Introduction to pig

Building a Physical planStep 1: Create a map-reduce job for each

COGROUP

MapReduce

Load(user.dat)

Filter

Group

Foreach

LOAD DATA

FILTER

GROUP

FOREACH

J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 18

Page 19: Introduction to pig

Building a Physical planStep 1: Create a map-reduce job for each

COGROUPStep 2: Push other commands into the

map and reduce functions where possible Step 3:May be the case certain

commands require their own map-reduce job (ie: ORDER needs separate map-reduce jobs)

Map

Reduce

Load(user.dat)

Filter

Group

Foreach

LOAD DATA

FILTER

GROUP

FOREACH

J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 19

Page 20: Introduction to pig

I Need all These

• Linux above 10• Java above 6• Hadoop• Pig

J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 20

Page 21: Introduction to pig

Execution of Pig

J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 21

Page 22: Introduction to pig

PIG Latin Basics• Pig Latin is a data flow language rather than procedural or

declarative , in which the program consists of a collection of statements.

• A statement can be thought of as an operation, or a command.

Building blocks(Complex Data Types)• Fields - Field is a piece of data [eg : student_id = 01]• Tuples - Tuple is a ordered set of fields

[eg : ( 01, Raja,MCA, C++)]• Bags - Bag collection of tuples [eg : ( 01, Raja, MCA,

C++), eg: ( 22, Ramesh, MBA, C) ]

J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 22

Page 23: Introduction to pig

PIG Data TypesSimple Type Description

int Signed 32-bit integerlong Signed 64-bit integer

float 32-bit floating point

double 64-bit floating point

chararray Character array (string) in Unicode UTF-8 format

bytearray Byte array (blob)boolean boolean

J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 23

Page 24: Introduction to pig

PIG Basic CommandsStatement Description

Load Read data from the file systemStore Write data to the file system

Dump Generate outputForeach Apply expression to each record and generate one or more records

Filter Apply predicate to each record and remove records where false

Group / Cogroup Collect records with the same key from one or more inputs

Join Join two or more inputs based on a keyOrder Sort records based on a KeyDistinct Remove duplicate recordsUnion Merge two datasets

Limit Limit the number of recordsSplit Split data into 2 or more sets, based on filter conditions

J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 24

Page 25: Introduction to pig

Load Command LOAD 'data' [USING function] [AS schema];

• data – name of the directory or fileMust be in single quotes

• USING – specifies the load function to use By default uses PigStorage () which parses each

line into fields using a delimiterDefault delimiter is tab (‘\t’)

• AS – assign a schema to incoming dataAssigns names to fieldsDeclares types to fields

J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 25

Page 26: Introduction to pig

LOAD Command Example

data• data = load '$dir/age.csv' using PigStorage(',') as

(name:chararray, age:chararray)

Schema

J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 26

Page 27: Introduction to pig

DUMP and STORE statements

• No action is taken until DUMP or STORE commands are encountered

Pig will parse, validate and analyze statements but not execute them

• DUMP – displays the results to the screen• STORE – saves results (typically to a file)data = load '$dir/newfine.csv' using PigStorage(',') as (MemberCode:chararray, IssueDate:chararray, ReturnDate:chararray):::::::::::::DUMP dataRam,22....

Nothing isexecuted;Pig willoptimize thisentirechunk ofscript

J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 27

Page 28: Introduction to pig

FOREACH• FOREACH <bag> GENERATE <data> Iterate over each element in the bag and

produce a result result = FOREACH data GENERATE name;

J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 28

Page 29: Introduction to pig

FOREACH with Functions

FOREACH B GENERATE group, FUNCTION(A);

• Pig comes with many functions including COUNT, FLATTEN, CONCAT, etc...

• Can implement a custom functionExample

counts = FOREACH data GENERATE group, COUNT(name);

Dump countsRam,3Raj,4Sam,2Mani,1

J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 29

Page 30: Introduction to pig

Diagnostic ToolsDESCRIBE

Display the structure of the BagDESCRIBE <bag_name>;

EXPLAIN Display Execution Plan Produces Various reports

• Logical Plan• MapReduce Plan

EXPLAIN <bag_name>;ILLUSTRATE

Illustrate how Pig engine transforms the data

ILLUSTRATE <bag_name>; J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,-

WDABT 2016 30

Page 31: Introduction to pig

FLATTEN Operator• Flattens nested bags and data types• FLATTEN is not a function, it’s an operator

Re-arranges outputEg grunt > dump data({(this),(is),(a),(line),(of),(text)})({(yet),(another),(line),(of),(text)})({(third),(line),(of),(words)})grunt> flatBag = FOREACH data GENERATE flatten($0);(this)(is)(a)......

Nested structure: bag ofbags of tuples

Each row is flatten resulting in abag of simple tokens

J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 31

Page 32: Introduction to pig

Group• Groups the data in one or multiple relations.• The GROUP operator groups together tuples that have the

same group key (key field). • The key field will be a tuple if the group key has more than

one field, otherwise it will be the same type as that of the group key.

Examplegroupme= group data by name;Dump groupme(Ram,{(Ram, 30),(Ram, 22), (Ram, 25)})(Raj ,{(Raj, 5),(Raj, 22), (Raj, 52), (Raj, 62)})(Sam,{(Sam, 15),(Sam, 22)}) J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,-

WDABT 2016 32

Page 33: Introduction to pig

Co-Group• COGROUP is the same as GROUP.• Group two datasets together by a common

attribute.• Groups data into nested bags“Use GROUP when only one relation is involved and

COGROUP with multiple relations re involved”ExampleData1=load '$dir/data.csv' using PigStorage(',') as

(name:chararray, age:chararray)Data2=load '$dir/data2.csv' using PigStorage(',') as

(name:chararray, address:chararray)X = COGROUP Data1 BY name, Data2 BY name;

J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 33

Page 34: Introduction to pig

cont …Dump x(Ram,{(Ram, 30),(Ram, 22), (Ram, 25)},{(Ram,Cbe),(Ram,Che)})(Raj ,{(Raj, 5),(Raj, 22), (Raj, 52), (Raj,62)},{(Raj,Mdu),

(Raj,Mumbai), (Raj,Delhi) })(Sam,{(Sam, 15),(Sam, 22),{}})

Cogroup by default is an OUTER JOIN You can remove empty records with empty bags by performing INNER on each bag

X = COGROUP Data1 BY name INNER, Data2 BY name INNER; Dump x(Ram,{(Ram, 30),(Ram, 22), (Ram, 25)},{(Ram,Cbe),(Ram,Che)})(Raj ,{(Raj, 5),(Raj, 22), (Raj, 52), (Raj,62)},{(Raj,Mdu),

(Raj,Mumbai), (Raj,Delhi) })

First field is a bag which came from data 1 bag (first dataset);second bag is from the data 2 bag (second data-set)

J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 34

Page 35: Introduction to pig

Filtering• Select a subset of the tuples in a bag FILTER bag BY expression ;• Expression uses simple comparison operators (==, !

=, <, >, …) and Logical connectors (AND, NOT, OR)ExampleFilterdata = filter data by age >20 Dump filterdata(Ram, 30),(Ram, 22), (Ram, 25)(Raj, 22), (Raj, 52), (Raj, 62) (Sam, 22)

J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 35

Page 36: Introduction to pig

Ordering• Sorts a relation based on one or more fields. alias = ORDER alias BY { * [ASC|DESC]}Exampleorderddata = order data by age DESC;Dump filterdata(Raj, 52)(Raj, 62)(Ram, 30)(Ram, 22)(Raj, 22)(Sam, 22)

J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 36

Page 37: Introduction to pig

Join

• Joins two datasets together by a common attribute.

• By default JOIN operator always performs an inner join.

• Inner joins ignore null keys, so it makes sense to filter them out before the join.

Note : The JOIN and COGROUP operators perform similar functions. JOIN creates a flat set of output records while COGROUP creates a nested set of output records

Data 1 Data 2Join

J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 37

Page 38: Introduction to pig

Outer Join• Records which will not join with the ‘other’

record-set are included using outer join

• Left Outer Records from the first data-set are included whether

they have a match or not. Fields from the unmatched (second) bag are set to null.

Data 2Data 1

J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 38

Page 39: Introduction to pig

cont …• Right Outer The opposite of Left Outer Join: Records from the second

data-set are included no matter what. Fields from the unmatched (first) bag are set to null.

• Full Outer Records from both sides are included. For unmatched records the fields from the ‘other’ bag are set to null.

Data 1 Data 2

Data 1 Data 2

J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 39

Page 40: Introduction to pig

J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 40

Page 41: Introduction to pig

EVAL FUNCTIONS• AVG• CONCAT• COUNT• ISEMPTY• MAX• MIN• SUM

J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 41

Page 42: Introduction to pig

UDF’S• User Defined Functions• Is a way to operate on fields • But not on group• Can be called using the pig script

J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 42

Page 43: Introduction to pig

UDF to Rescue [Embeded Mode]

• Easy to use• Easy to code• Keeps the power of PIG• You are free to write in

J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 43

Page 44: Introduction to pig

PIG a WRAPPER

J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 44

Page 45: Introduction to pig

Does What Ever You Want• Image Feature Extraction• Geo Computations• Data Cleaning• Retrieve Web Pages• NLP ………• Even more…….

J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 45

Page 46: Introduction to pig

Why PIG IS Faster?

• Few bugs • Few LOC• Easier to read(purpose of analytics is

straight forward)

J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 46

Page 47: Introduction to pig

SQL Vs PIG LATINSQL PIG Latin

SELECT name, SUM(FineAmount) from Dataset 1 orders GROUP BY name

A= GROUP orders BY nameB= FOREACH A GENERATE $0 AS name, SUM($1.FineAmount) AS order Total;

…….HAVING SUM(Fine Amount)>500…

C=FILTER B BY FIneAmount > 500

…..ORDER BY name ASC; D= ORDER C BY name ASCSELECT DISTINCT name FROM Dataset 1;

Names= FOREACH users GENERATE name;UniqueNames= DISTINCT names;

SELECT name, COUNT(DISTINCT age) FROM Dataset 1 GROUP BY name

UserBYNAme=GROUP users BY name;

J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 47

Page 48: Introduction to pig

Pit Falls• Version match

J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 48

Page 49: Introduction to pig

Pit Falls• Bugs in older version requires register of jars

J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 49

Page 50: Introduction to pig

Conclusion• Pig is a data processing environment in

Hadoop which targets procedural programmers, who do large-scale data analysis.

• Pig-Latin offers high-level data manipulation in a procedural style.

50J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 51: Introduction to pig

References • http://hadooptutorials.co.in• https://www.youtube.com• https://flume.apache.org• http://hortonworks.com• http://www-01.ibm.com• https://www.youtube.com• http://kafka.apache.org• https://www.youtube.com

51J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 52: Introduction to pig

QUERIES PLZZ

52J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 53: Introduction to pig

Thank you !!!!!

J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 53