pig power tools_by_viswanath_gangavaram

25
Pig Power Tools a quick tour By Viswanath Gangavaram Data Scientist R&D, DSG, Ilabs, [24] 7 INC 06/26/2022 1 Pig provides a higher level of abstraction for data users, giving them access to the power and flexibility of Hadoop without requiring them to write extensive data processing applications in low-level Java Code(MapReduce code). From the preface of “Programming Pig”

Upload: viswanath-gangavaram

Post on 06-May-2015

335 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Pig power tools_by_viswanath_gangavaram

04/11/2023 1

Pig Power Toolsa quick tour

ByViswanath Gangavaram

Data ScientistR&D, DSG, Ilabs, [24] 7 INC

Pig provides a higher level of abstraction for data users, giving them access to the power and flexibility of Hadoop without requiring them to write extensive data processing applications in low-level Java Code(MapReduce code). From the preface of “Programming Pig”

Page 2: Pig power tools_by_viswanath_gangavaram

04/11/2023 2

What we are going to cover A very short introduction to Apache Pig Use Grunt shell to work with the Hadoop Distributed File System Advanced Pig Operators(relational) Pig Macros and Modularity features Embed Pig Latin in Python for Iterative Processing and other advanced tasks(SIMOD golden Journeys) Json Parsing XML Parsing UDFs(Jython) Pig Streaming UDFs Vs. Streaming Custom load and store Functions to handle data formats and storage mechanisms Single Row Relations Python in Pig(Bringing nltk, numpy, scipy, pandas into pig) Lipstick Hue Performance Tips External libraries

Piggybank, DataFu, DataFu Hour Glass, SimpleJson, ElephantBird

Note:- This is general Pig tutorial, will have minimum references to any particular data set

Page 3: Pig power tools_by_viswanath_gangavaram

04/11/2023 3

A short introduction to “Apache Pig” in five minutes• Apache Pig is a high-level platform for executing data flows in parallel on Hadoop. The language for this

platform is called Pig Latin, which includes operators for many of the traditional data operations (join, sort, filter, etc.), as well as the ability for users to develop their own functions for reading, processing, and writing data.

– Pigs fly• Pig processes data quickly. Designers want to consistently improve its performance, and not

implement features in ways that weigh pig down so it can't fly.

• What does it mean to be Pig?

– Pigs Eats Everything• Pig can operate on data whether it has metadata or not. It can operate on data that is relational, nested,

or unstructured. And it can easily be extended to operate on data beyond files, including key/value stores, databases, etc.

– Pigs Live Everywhere• Pig is intended to be a language for parallel data processing. It is not tied to one particular

parallel framework. Check for Pig on Tez– Pigs Are Domestic Animals

• Pig is designed to be easily controlled and modified by its users.• Pig allows integration of user code where ever possible, so it currently supports user defined field

transformation functions, user defined aggregates, and user defined conditionals. • Pig supports user provided load and store functions. • It supports external executables via its stream command and Map Reduce jars via its MapReduce

command. • It allows users to provide a custom partitioner for their jobs in some circumstances and to set the

level of reduce parallelism for their jobs.

Page 4: Pig power tools_by_viswanath_gangavaram

04/11/2023 4

Apache Pig “Word counting is hello world of MapReduce”inputFile = LOAD ‘mary’ as ( line );words = FOREACH inputFile GENERATE FLATTEN( TOKENIZE(line) ) as word;grpd = GROUP words by word;cntd = FOREACH grpd GENERATE group, COUNT(words)DUMP cntd;

Output:- (This , 2)(is, 2)(my, 2 )(first , 2)(apache, 2)(pig,2)(program, 2)

“mary” file content:-This is my first apache pig programThis is my first apache pig program

Page 5: Pig power tools_by_viswanath_gangavaram

04/11/2023 5

Apache Pig Latin: A data flow language• Pig Latin is a dataflow language. This means it allows users to describe how data from one or more inputs

should be read, processed, and then stored to one or more outputs in parallel.• To be mathematically precise, a Pig Latin script describes a directed acyclic graph (DAG), where the edges

are data flows and the nodes are operators that process the data.

Comparing query(HIVE/SQL) and data flow languages(PIG)• After a cursory look, people often say that Pig Latin is a procedural version of SQL. Although there are

certainly similarities, there are more differences. SQL is a query language. Its focus is to allow users to form queries. It allows users to describe what question they want answered, but not how they want it answered. In Pig Latin, on the other hand, the user describes exactly how to process the input data.

• Another major difference is that SQL is oriented around answering one question. When users want to do several data operations together, they must either write separate queries, storing the intermediate data into temporary tables, or write it in one query using subqueries inside that query to do the earlier steps of the processing. However, many SQL users find subqueries confusing and difficult to form properly. Also, using subqueries creates an inside-out design where the first step in the data pipeline is the innermost query.

• Pig, however, is designed with a long series of data operations in mind, so there is no need to write the data pipeline in an inverted set of subqueries or to worry about storing data in temporary tables.

• SQL is the English of data processing. It has the nice feature that everyone and every tool knows it, which means the barrier to adoption is very low. Our goal is to make Pig Latin the native language of parallel data-processing systems such as Hadoop. It may take some learning, but it will allow users to utilize the power of Hadoop much more fully. - Extracted from “Programming Pig”

Page 6: Pig power tools_by_viswanath_gangavaram

04/11/2023 6

Pig’s Data types Scalar types

• int, long, float, double, chararray, bytearray Complex types

• Map– A map in Pig is a chararray to data element mapping, where that element can be any Pig

type, including a complex type. – The chararray is called a key and is used as index to find the element, referred to as the

value.– Map constants are formed using brackets to delimit the map, a hash between keys and

values, and a comma between key-value pairs. » [‘dept’#’dsg’, ‘team’#’r&d’]

• Tuple– A tuple is a fixed-length, ordered collection of Pig data elements. Tuples are divided into

fields, with each field containing one data element. These elements can be of any type.– Tuple constants use parentheses to indicate the tuple and commas to delimit fields in

the tuple.» (‘boss’, 55)

• Bag– A bag is an unordered collection of tuples.– Bag constants are constructed using braces, with the tuples in the bag separated by

commas. » { (‘a’, 20), (‘b’, 20), (‘c’, 30) }

Page 7: Pig power tools_by_viswanath_gangavaram

04/11/2023 7

• Nulls– Pig includes the concept of a data element being null. Data of any type can be null. A null data

element means the value is unknown. This might be because the data is missing, an error occurred in processing it, etc.

• Schemas– Pig has a very lax attitude when it comes to schemas. This is a consequence of Pig’s philosophy of

eating anything

• Casts

Page 8: Pig power tools_by_viswanath_gangavaram

04/11/2023 8

Basic operators1. LOAD2. STORE3. LIMIT4. DEFINE5. FOREACH6. FILTER7. DISTINCT8. (CO)GROUP9. JOIN10. UNION11. CROSS12. ORDER BY

Page 9: Pig power tools_by_viswanath_gangavaram

04/11/2023 9

Grunt shell• Grunt is Pig’s interactive shell. It enables users to enter Pig Latin interactively and

provides a shell to interact with HDFS

• Command line history, editing, Tab completion. • No-pipes, no re-directions and no-background executions

• Grunts shell commands– Shell for HDFS*

• fs –ls, fs –du, fs –stat, etc…– Shell for Unix commands (working in the local directory)

• sh ls, sh cat – exec– run– Kill jobid– Set– dump– explain– describe

• *: fs is default

Page 10: Pig power tools_by_viswanath_gangavaram

04/11/2023 10

Advanced operators1. ASSERT2. CUBE3. IMPORT4. MAPREDUCE5. ORDER BY6. RANK7. SAMPLE8. SPLIT9. STREAM

Page 11: Pig power tools_by_viswanath_gangavaram

04/11/2023 11

Pig’s Debugging tools

Use the DUMP operator to display results to your terminal screen. Use the DESCRIBE operator to review the schema of a relation. Use the EXPLAIN operator to view the logical, physical, or map reduce execution plans to

compute a relation. Use the ILLUSTRATE operator to view the step-by-step execution of a series of

statements.

Shortcuts for Debugging operators \d alias - shortcut for DUMP. If alias is ignored last defined alias will be used. \de alias - shortcut for DESCRIBE. If alias is ignored last defined alias will be used. \e alias - shortcut for EXPLAIN. If alias is ignored last defined alias will be used. \i alias - shortcut for ILLUSTRATE. If alias is ignored last defined alias will be used. \q - To quit grunt shell

Page 12: Pig power tools_by_viswanath_gangavaram

04/11/2023 12

Json Parsing

Page 13: Pig power tools_by_viswanath_gangavaram

04/11/2023 13

XML Parsing

Page 14: Pig power tools_by_viswanath_gangavaram

04/11/2023 14

User Defined Functions

Page 15: Pig power tools_by_viswanath_gangavaram

04/11/2023 15

Pig Streaming

Page 16: Pig power tools_by_viswanath_gangavaram

04/11/2023 16

UDFs Vs. Pig streaming

Page 17: Pig power tools_by_viswanath_gangavaram

04/11/2023 17

Cython in Pig(Bringing nltk, numpy, scipy, pandas into pig)

Page 18: Pig power tools_by_viswanath_gangavaram

04/11/2023 18

Lipstick:- Let’s add some color to Pig

Page 19: Pig power tools_by_viswanath_gangavaram

04/11/2023 19

Hue:- Hadoop and its ecosystem in Browser

Page 20: Pig power tools_by_viswanath_gangavaram

04/11/2023 20

Piggybank

Page 21: Pig power tools_by_viswanath_gangavaram

04/11/2023 21

DataFu

Page 22: Pig power tools_by_viswanath_gangavaram

04/11/2023 22

DataFu Hourglass

Page 23: Pig power tools_by_viswanath_gangavaram

04/11/2023 23

SimpleJson

Page 24: Pig power tools_by_viswanath_gangavaram

04/11/2023 24

Elephant Bird

Page 25: Pig power tools_by_viswanath_gangavaram

04/11/2023 25

So what is pig?

Pig is a champion