introduction to pig
TRANSCRIPT
I EAT BIG!!!I HANDLE BIG!!!
Presented byJ.Ramsingh M.C.A., M.Phil.,
Ph.D Research ScholarDepartment of Computer Applications
Bharathiar University
Contents• Why PIG?• Overview of PIG• PIG Installation• PIG Latin Basics• Developing PIG Scripts
J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
2
WHY • Not preferred for data analytics
• 200 LOC = 10 LOC
• Not rich in Built-in-functions
3J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 4
Overview of PIG
Is a
Can Handle Large Data Sets
I LOVE TO EAT MORE N MORE5J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar
University,- WDABT 2016
PIG Vs MAPREDUCE
6J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
Running Environment
LOCAL MODE
HADOOP MODE
7J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
Pig Execution in Hadoop Cluster
8J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
Pig Execution in Hadoop Cluster
9J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
Components of PIG
Pig Latin
Grunt
PIG SERVER
Command based language
Execution Environment
Compiler strives to optimize execution
J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 10
Compilation• Pig system does two tasks:
Logical PlanPhysical Plan
–Builds a Logical Plan from a Pig Latin script –Supports execution platform independence–No processing of data performed at this stage
Compiles the Logical Plan to a Physical Plan and Executes– Convert the Logical Plan into a series of Map-Reduce statements to be executed by Hadoop Map-Reduce
11J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
Building a logical planA = LOAD ‘dataset 1.dat’ AS
(name, dob, designation);B = GROUP A BY designation;C = FOREACH B GENERATE group
AS dob,COUNT(A);
D = FILTER C BY name IS ‘XXX’OR name IS ‘yyy’;
STORE D INTO ‘result.dat’;
LOAD DATA
12J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
Building a logical planA = LOAD ‘dataset 1.dat’ AS
(name, dob, designation);B = GROUP A BY designation;C = FOREACH B GENERATE group
AS dob,COUNT(A);
D = FILTER C BY name IS ‘XXX’OR name IS ‘yyy’;
STORE D INTO ‘result.dat’;
LOAD DATA
GROUP DATA
13J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
Building a logical planA = LOAD ‘dataset 1.dat’ AS
(name, dob, designation);B = GROUP A BY designation;C = FOREACH B GENERATE group
AS dob,COUNT(A);
D = FILTER C BY name IS ‘XXX’OR name IS ‘yyy’;
STORE D INTO ‘result.dat’;
LOAD DATA
GROUP DATA
J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 14
FOREACH
Building a logical planA = LOAD ‘dataset 1.dat’ AS
(name, dob, designation);B = GROUP A BY designation;C = FOREACH B GENERATE group
AS dob,COUNT(A);
D = FILTER C BY name IS ‘XXX’OR name IS ‘yyy’;
STORE D INTO ‘result.dat’;
LOAD DATA
GROUP DATA
FOREACH
FILTER
J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 15
Building a logical planA = LOAD ‘dataset 1.dat’ AS
(name, dob, designation);B = GROUP A BY designation;C = FOREACH B GENERATE group
AS dob,COUNT(A);
D = FILTER C BY name IS ‘XXX’OR name IS ‘yyy’;
STORE D INTO ‘result.dat’;
LOAD DATA
FILTER
GROUP
FOREACH
J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 16
Building a logical planA = LOAD ‘dataset 1.dat’ AS
(name, dob, designation);B = GROUP A BY designation;C = FOREACH B GENERATE group
AS dob,COUNT(A);
D = FILTER C BY name IS ‘XXX’OR name IS ‘yyy’;
STORE D INTO ‘result.dat’;
LOAD DATA
FILTER
GROUP
FOREACH
Only happens when output isspecified by STORE or DUMP
J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 17
Building a Physical planStep 1: Create a map-reduce job for each
COGROUP
MapReduce
Load(user.dat)
Filter
Group
Foreach
LOAD DATA
FILTER
GROUP
FOREACH
J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 18
Building a Physical planStep 1: Create a map-reduce job for each
COGROUPStep 2: Push other commands into the
map and reduce functions where possible Step 3:May be the case certain
commands require their own map-reduce job (ie: ORDER needs separate map-reduce jobs)
Map
Reduce
Load(user.dat)
Filter
Group
Foreach
LOAD DATA
FILTER
GROUP
FOREACH
J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 19
I Need all These
• Linux above 10• Java above 6• Hadoop• Pig
J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 20
Execution of Pig
J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 21
PIG Latin Basics• Pig Latin is a data flow language rather than procedural or
declarative , in which the program consists of a collection of statements.
• A statement can be thought of as an operation, or a command.
Building blocks(Complex Data Types)• Fields - Field is a piece of data [eg : student_id = 01]• Tuples - Tuple is a ordered set of fields
[eg : ( 01, Raja,MCA, C++)]• Bags - Bag collection of tuples [eg : ( 01, Raja, MCA,
C++), eg: ( 22, Ramesh, MBA, C) ]
J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 22
PIG Data TypesSimple Type Description
int Signed 32-bit integerlong Signed 64-bit integer
float 32-bit floating point
double 64-bit floating point
chararray Character array (string) in Unicode UTF-8 format
bytearray Byte array (blob)boolean boolean
J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 23
PIG Basic CommandsStatement Description
Load Read data from the file systemStore Write data to the file system
Dump Generate outputForeach Apply expression to each record and generate one or more records
Filter Apply predicate to each record and remove records where false
Group / Cogroup Collect records with the same key from one or more inputs
Join Join two or more inputs based on a keyOrder Sort records based on a KeyDistinct Remove duplicate recordsUnion Merge two datasets
Limit Limit the number of recordsSplit Split data into 2 or more sets, based on filter conditions
J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 24
Load Command LOAD 'data' [USING function] [AS schema];
• data – name of the directory or fileMust be in single quotes
• USING – specifies the load function to use By default uses PigStorage () which parses each
line into fields using a delimiterDefault delimiter is tab (‘\t’)
• AS – assign a schema to incoming dataAssigns names to fieldsDeclares types to fields
J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 25
LOAD Command Example
data• data = load '$dir/age.csv' using PigStorage(',') as
(name:chararray, age:chararray)
Schema
J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 26
DUMP and STORE statements
• No action is taken until DUMP or STORE commands are encountered
Pig will parse, validate and analyze statements but not execute them
• DUMP – displays the results to the screen• STORE – saves results (typically to a file)data = load '$dir/newfine.csv' using PigStorage(',') as (MemberCode:chararray, IssueDate:chararray, ReturnDate:chararray):::::::::::::DUMP dataRam,22....
Nothing isexecuted;Pig willoptimize thisentirechunk ofscript
J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 27
FOREACH• FOREACH <bag> GENERATE <data> Iterate over each element in the bag and
produce a result result = FOREACH data GENERATE name;
J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 28
FOREACH with Functions
FOREACH B GENERATE group, FUNCTION(A);
• Pig comes with many functions including COUNT, FLATTEN, CONCAT, etc...
• Can implement a custom functionExample
counts = FOREACH data GENERATE group, COUNT(name);
Dump countsRam,3Raj,4Sam,2Mani,1
J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 29
Diagnostic ToolsDESCRIBE
Display the structure of the BagDESCRIBE <bag_name>;
EXPLAIN Display Execution Plan Produces Various reports
• Logical Plan• MapReduce Plan
EXPLAIN <bag_name>;ILLUSTRATE
Illustrate how Pig engine transforms the data
ILLUSTRATE <bag_name>; J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,-
WDABT 2016 30
FLATTEN Operator• Flattens nested bags and data types• FLATTEN is not a function, it’s an operator
Re-arranges outputEg grunt > dump data({(this),(is),(a),(line),(of),(text)})({(yet),(another),(line),(of),(text)})({(third),(line),(of),(words)})grunt> flatBag = FOREACH data GENERATE flatten($0);(this)(is)(a)......
Nested structure: bag ofbags of tuples
Each row is flatten resulting in abag of simple tokens
J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 31
Group• Groups the data in one or multiple relations.• The GROUP operator groups together tuples that have the
same group key (key field). • The key field will be a tuple if the group key has more than
one field, otherwise it will be the same type as that of the group key.
Examplegroupme= group data by name;Dump groupme(Ram,{(Ram, 30),(Ram, 22), (Ram, 25)})(Raj ,{(Raj, 5),(Raj, 22), (Raj, 52), (Raj, 62)})(Sam,{(Sam, 15),(Sam, 22)}) J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,-
WDABT 2016 32
Co-Group• COGROUP is the same as GROUP.• Group two datasets together by a common
attribute.• Groups data into nested bags“Use GROUP when only one relation is involved and
COGROUP with multiple relations re involved”ExampleData1=load '$dir/data.csv' using PigStorage(',') as
(name:chararray, age:chararray)Data2=load '$dir/data2.csv' using PigStorage(',') as
(name:chararray, address:chararray)X = COGROUP Data1 BY name, Data2 BY name;
J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 33
cont …Dump x(Ram,{(Ram, 30),(Ram, 22), (Ram, 25)},{(Ram,Cbe),(Ram,Che)})(Raj ,{(Raj, 5),(Raj, 22), (Raj, 52), (Raj,62)},{(Raj,Mdu),
(Raj,Mumbai), (Raj,Delhi) })(Sam,{(Sam, 15),(Sam, 22),{}})
Cogroup by default is an OUTER JOIN You can remove empty records with empty bags by performing INNER on each bag
X = COGROUP Data1 BY name INNER, Data2 BY name INNER; Dump x(Ram,{(Ram, 30),(Ram, 22), (Ram, 25)},{(Ram,Cbe),(Ram,Che)})(Raj ,{(Raj, 5),(Raj, 22), (Raj, 52), (Raj,62)},{(Raj,Mdu),
(Raj,Mumbai), (Raj,Delhi) })
First field is a bag which came from data 1 bag (first dataset);second bag is from the data 2 bag (second data-set)
J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 34
Filtering• Select a subset of the tuples in a bag FILTER bag BY expression ;• Expression uses simple comparison operators (==, !
=, <, >, …) and Logical connectors (AND, NOT, OR)ExampleFilterdata = filter data by age >20 Dump filterdata(Ram, 30),(Ram, 22), (Ram, 25)(Raj, 22), (Raj, 52), (Raj, 62) (Sam, 22)
J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 35
Ordering• Sorts a relation based on one or more fields. alias = ORDER alias BY { * [ASC|DESC]}Exampleorderddata = order data by age DESC;Dump filterdata(Raj, 52)(Raj, 62)(Ram, 30)(Ram, 22)(Raj, 22)(Sam, 22)
J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 36
Join
• Joins two datasets together by a common attribute.
• By default JOIN operator always performs an inner join.
• Inner joins ignore null keys, so it makes sense to filter them out before the join.
Note : The JOIN and COGROUP operators perform similar functions. JOIN creates a flat set of output records while COGROUP creates a nested set of output records
Data 1 Data 2Join
J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 37
Outer Join• Records which will not join with the ‘other’
record-set are included using outer join
• Left Outer Records from the first data-set are included whether
they have a match or not. Fields from the unmatched (second) bag are set to null.
Data 2Data 1
J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 38
cont …• Right Outer The opposite of Left Outer Join: Records from the second
data-set are included no matter what. Fields from the unmatched (first) bag are set to null.
• Full Outer Records from both sides are included. For unmatched records the fields from the ‘other’ bag are set to null.
Data 1 Data 2
Data 1 Data 2
J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 39
J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 40
EVAL FUNCTIONS• AVG• CONCAT• COUNT• ISEMPTY• MAX• MIN• SUM
J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 41
UDF’S• User Defined Functions• Is a way to operate on fields • But not on group• Can be called using the pig script
J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 42
UDF to Rescue [Embeded Mode]
• Easy to use• Easy to code• Keeps the power of PIG• You are free to write in
J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 43
PIG a WRAPPER
J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 44
Does What Ever You Want• Image Feature Extraction• Geo Computations• Data Cleaning• Retrieve Web Pages• NLP ………• Even more…….
J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 45
Why PIG IS Faster?
• Few bugs • Few LOC• Easier to read(purpose of analytics is
straight forward)
J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 46
SQL Vs PIG LATINSQL PIG Latin
SELECT name, SUM(FineAmount) from Dataset 1 orders GROUP BY name
A= GROUP orders BY nameB= FOREACH A GENERATE $0 AS name, SUM($1.FineAmount) AS order Total;
…….HAVING SUM(Fine Amount)>500…
C=FILTER B BY FIneAmount > 500
…..ORDER BY name ASC; D= ORDER C BY name ASCSELECT DISTINCT name FROM Dataset 1;
Names= FOREACH users GENERATE name;UniqueNames= DISTINCT names;
SELECT name, COUNT(DISTINCT age) FROM Dataset 1 GROUP BY name
UserBYNAme=GROUP users BY name;
J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 47
Pit Falls• Version match
J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 48
Pit Falls• Bugs in older version requires register of jars
J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 49
Conclusion• Pig is a data processing environment in
Hadoop which targets procedural programmers, who do large-scale data analysis.
• Pig-Latin offers high-level data manipulation in a procedural style.
50J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
References • http://hadooptutorials.co.in• https://www.youtube.com• https://flume.apache.org• http://hortonworks.com• http://www-01.ibm.com• https://www.youtube.com• http://kafka.apache.org• https://www.youtube.com
51J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
QUERIES PLZZ
52J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
Thank you !!!!!
J.Ram Singh , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016 53