apache pig presented by priagung khusumanegara prof. kyungbaek kim

21
APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim

Upload: james-stevens

Post on 03-Jan-2016

229 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim

APACHE PIGPresented by Priagung Khusumanegara

Prof. Kyungbaek Kim

Page 2: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim

Agenda

• Introducing Pig Pig CharacteristicsPig Element

• Pig Latin Foundation Data FlowPig FeatureData Types

• Pig Operator and Function

Page 3: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim

Pig Characteristics

• A platform for analyzing large data sets that runs on top Hadoop• Provides a high-level language for

expressing data analysis• Uses both HDFS (read and write files)

and MapReduce (execute jobs)

Page 4: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim

Pig Elements

Pig Latin- High-level scripting language- Designed specifically for data transformation and flow expression

Grunt- The environment in which Pig Latin commands are executed- Currently there is support for Local and Hadoop modes.

Pig Interpreter- Pig interpreter converts Pig Latin to MapReduce

Page 5: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim

Pig Latin Data Flow

• A LOAD statement to read data from the file system.• A series of "transformation" statements to process the data.• A DUMP statement to view results or a STORE statement to save the

results.

LOAD TRANSFORM DUMP OR STORE

Page 6: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim

Running Pig

• Script - Execute commands in a file

- $ pig scriptFile.pig

• Grunt- Interactive shell for executing Pig Commands- Started when script file is NOT provided

Page 7: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim

Running Modes

• Local Executes in a single JVM Works exclusively with local file system Great for development, experimentation and prototyping

• Hadoop ModeAlso known as MapReduce modePig renders Pig Latin into MapReduce jobs and executes them on the

clusterCan execute against pseudo-distributed or fully distributed

Page 8: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim

Running Modes- $pig -x local

- $pig -x mapreduce

Page 9: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim

Hadoop Mode

Page 10: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim

Pig Relation

Pig Latin statements work with relation• A field is a piece of data 19

• A tuple is an ordered set of fields (19,2)

• A bag is a collection of unordered tuples {(19,2), (18,1)}

• A relation is a bag

Field

Tuple

FieldField

Bag

Page 11: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim

Data Type

Data Typeint

DescriptionSigned 32-bit integer

Example10

long Signed 64-bit integer Data: 10L or 10lDisplay: 10L

float 32-bit floating point Data: 10.5F or 10.5f or 10.5e2f or 10.5E2F

Display: 10.5F or 1050.0F

double 64-bit floating point Data: 10.5 or 10.5e2 or 10.5E2Display: 10.5 or 1050.0

chararray Character array (string) in Unicode UTF-8 format

hello world

boolean boolean true/false (case insensitive)

datetime datetime 1970-01-01T00:00:00.000+00:00

Page 12: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim

LOAD operator

Load contents of text files into a bag names data

schema

Page 13: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim

DUMP and STORE operator

• No action is taken until DUMP or STORE commands are encountered- Pig will parse, validate and analyzed statements but not

execute them• DUMP – display the results to screen • STORE – save results to a file

Page 14: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim

DUMP and STORE operatorDUMP Example

STORE Example

Page 15: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim

FILTER and GROUP operatorFilter the data bag

Group bag filtered by score

Page 16: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim

ORDER operator

Note:For descending orderSorted = ORDER data BY score DESC;

Page 17: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim

FOREACH operator

For each row emit score, status fields

Page 18: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim

DISTINCT operator

Remove duplicate tuples in bag

Page 19: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim

UNION operator

Merge the contents of two or more bags

Page 20: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim

JOIN operator

Bag data1 and data2 are joined by their first fields.

Page 21: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim

SUM, MIN, AVG Function

Note:find min value : MINfind sum value : SUMfind average value : AVG