hadoop pig

The analytics stack Hadoop & Pig

Hadoop Motivations. What is it? And high-level concepts

The Ecosystem. The MapReduce model & framework and HDFS

Programming with Hadoop

Pig What is it? Motivations

Model & components

Integration with Cassandra

Outline of the presentation

2

3

Please interrupt and ask questions!

CPU-intensive computations

Relatively small amount of data

Tightly-coupled applications

Highly concurrent I/O requirements

Complex message passing paradigms such as MPI, PVM…

Developers might need to spend some time designing for failure

4

Traditional HPC systems

Challenges

Data and storage

Locality, computation close to the data

In large-scale systems, nodes fail

Mean time between failures: 1 node / 3 years, 1000 nodes / 1 day

Built-in fault-tolerance

Distributed programming is complex

Need a simple data-parallel programming model. Users would structure the application in high-level functions, the system distributes the data & jobs and handles communications and faults

5

What requirements

A simple data-parallel programming model, designed for high scalability and resiliency

Scalability to large-scale data volumes

Automated fault-tolerance at application level rather than relying on high-availability hardware

Simplified I/O and tasks monitoring

All based on cost-efficient commodity machines (cheap, but unreliable), and commodity network

6

Data spread in advance, persistent (in terms of locality), and replicated

No inter-dependencies / shared nothing architecture

Applications written in two pieces of code

And developers do not have to worry about the underlying issues in networking, jobs interdependencies, scheduling, etc…

7

Hadoop’s core concepts

Hadoop originated from Apache Nutch, an open source web search engine

After the publications of the GFS and MapReduce papers, in 2003 & 2004, the Nutch developers decided to implement open source versions

In February 2006, it became Hadoop, with a dedicated team at Yahoo!

September 2007 - release 0.14.1

Last release 1.0.3 out last week

Used by a large number of companies including Facebook, LinkedIn, Twitter, hulu, among many others..

8

Where does it come from?

The model

A map function processes a key/value pair to generate a set of intermediate key/value pairs

Divides the problem into smaller ‘intermediate key/value’ pairs

The reduce function merge all intermediate values associated with the same intermediate key

Run-time system takes care of:

Partitioning the input data across nodes (blocks/chunks typically of 64Mb to 128Mb)

Scheduling the data and execution. Maps operate on a single block.

Manages node failures, replication, re-submissions..

9

Simple Word Count

♯key: offset, value: line

def mapper():

for line in open(“doc”):

for word in line.split():

output(word, 1)

♯key: a word, value: iterator over counts

def reducer():

output(key, sum(value))

10

The Combiner

def combiner():

output(key, sum(values))

A combiner is a local aggregation function for repeated keys produced by the map

Works for associative functions like sum, count, max

Decreases the size of intermediate data / communications

map-side aggregation for word count:

11

Some other basic examples…

Distributed Grep: Map function emits a line if it matches a supplied pattern Reduce function is an identity function that copies the supplied intermediate

data to the output

Count of URL accesses: Map function processes logs of web page requests and outputs <URL, 1> Reduce function adds together all values for the same URL, emitting <URL, total

count> pairs

Reverse Web-Link graph: Map function outputs <tgt, src> for each link to a tgt in a page named src Reduce concatenates the list of all src URLS associated with a given tgt URL and

emits the pair: <tgt, list(src)>

Inverted Index: Map function parses each document, emitting a sequence of <word, doc_ID> Reduce accepts all pairs for a given word and emits a <word, list(doc_ID)> pair

12

from http://indoos.wordpress.com/2010/08/16/hadoop-ecosystem-world-map/

Hadoop Ecosystem

13 Core components

Hadoop consists of two core components

The MapReduce framework, and

The Hadoop Distributed File System

14

Hadoop components

MapReduce layer JobTracker

TaskTrackers

HDFS layer Namenode

Secondary namenode

Datanode Example of a typical physical distribution within a Hadoop cluster

HDFS

Scalable and fault-tolerant. Based on Google’s GFS

Single namenode stores metadata (file names, block locations, etc.).

Files split into chunks, replicated across several datanodes (typically 3+). It is rack-aware

Optimised for large files, sequential streaming reads, rather than random

Files written once, no append

Namenode

Datanodes

1 2 3 4

1 2 4

2 1 3

1 4 3

3 2 4

File1

15

HDFS API / HDFS FS Shell for command line* > hadoop fs –copyFromLocal local_dir hdfs_dir

> hadoop fs –copToLocal hdfs_dir local_dir

Tools

Flume: collects, aggregates and move log data from application servers to HDFS

Sqoop: HDFS import and export to SQL

16

HDFS

*http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html

http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html

http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html

In Hadoop, a Job (full program) is a set of tasks

Each task (mapper or reducer) is attempted at least once, or multiple times if it crashes. Multiple attempts may also occur in parallel

The tasks run inside a separate JVM on the tasktracker

All the class files are assembled into a jar file, which will be uploaded into HDFS, before notifying the tasktracker

17

MapReduce execution

18

MapReduce execution

MapReduce Job

Master

Split 0

Split 1

Split 2

Split 3

Split 4

Worker

Worker

Worker

Worker

Worker

Local write read

Output files

Intermediate files locally

Remote read

Input files

Getting Started…

Multiple choices - Vanilla Apache version, or one of the numerous existing distros hadoop.apache.org

www.cloudera.com [A set of VMs is also provided]

http://www.karmasphere.com/

…

Three ways to write jobs in Hadoop: Java API

Hadoop Streaming (for Python, Perl, etc.)

Pipes API (C++)

19

http://hadoop.apache.org/core/docs/current/quickstart.html

http://www.cloudera.com



Word Count in Java

20

public static void main(String[] args) throws Exception {

JobConf conf = new JobConf(WordCount.class);

conf.setJobName("wordcount");

conf.setMapperClass(MapClass.class);

conf.setCombinerClass(ReduceClass.class);

conf.setReducerClass(ReduceClass.class);

FileInputFormat.setInputPaths(conf, args[0]);

FileOutputFormat.setOutputPath(conf, new Path(args[1]));

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(IntWritable.class);

JobClient.runJob(conf);

}

public class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable ONE = new IntWritable(1); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> out, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { out.collect(new text(itr.nextToken()), ONE); } } }

Word Count in Java – mapper

21

Word Count in Java – reducer

public class ReduceClass extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> out, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } out.collect(key, new IntWritable(sum)); } }

22

23

Getting keys and values

Input file

Input split Input split

RecordReader RecordReader Inp

ut

Form

at

Mapper Mapper

Reducer Reducer

RecordWriter RecordWriter

Output file Output file

Ou

tpu

t Fo

rmat

Hadoop Streaming

#!/usr/bin/env python

import sys for line in sys.stdin: for word in line.split(): print "%s\t%s" % (word, 1)

#!/usr/bin/env python

import sys

dict={}

for line in sys.stdin:

word, count = line.split("\t", 1)

dict[word] = dict.get(word, 0) + int(count)

counts = dict.items()

for word, count in counts:

print "%s\t%s" % (word.lower(), count)

Mapper.py:

Reducer.py:

24

You can locally test your code on the command line: $> cat data | mapper | sort | reducer

High-level tools

MapReduce is fairly low-level: must think about keys, values, partitioning, etc.

How to express parallel algorithms by a series of MapReduce jobs

Can be hard to capture common job building blocks

Different use cases require different tools!

25

Apache Pig is a platform raising a level of abstraction for processing large datasets. Its language, Pig Latin is a simple query algebra expressing data transformations and applying functions to records

Started at Yahoo! Research, >60% of Hadoop jobs within Yahoo! are Pig jobs

26

Pig

Pig job submission

MapReduce jobs Hadoop / HDFS

MapReduce requires a Java programmer

Solution was to abstract it and create a system where users are familiar with scripting languages

Other than very trivial applications, MapReduce requires multiple stages, leading to long development cycles

Rapid prototyping. Increased productivity

In MapReduce users have to reinvent common functionality (join, filter, etc.)

Pig provides them

27

Motivations

Rapid prototyping of algorithms for processing large datasets

Log analysis

Ad hoc queries across various large datasets

Analytics (including through sampling)

Pig Mix provides a set of performance and scalability benchmarks. Currently 1.1 times MapReduce speed.

28

Used for

Grunt, the Pig shell

Executing scripts directly

Embedding Pig in Java (using PigServer, similar to SQL using JDBC), or Python

A range of tools including Eclipse plug-ins

PigPen, Pig Editor…

29

Using Pig

Pig has two execution types or modes: local mode and Hadoop mode

Local

Pig runs in a single JVM and accesses the local filesystem. Starting form v0.7 it uses the Hadoop job runner.

Hadoop mode

Pig runs on a Hadoop cluster (you need to tell Pig about the version and point it to your Namenode and Jobtracker

30

Execution modes

Pig resides on the user’s machine and can be independent from the Hadoop cluster

Pig is written in Java and is portable

Compiles into map reduce jobs and submit them to the cluster

No need to install anything extra on the cluster

31

Running Pig

Pig client

Pig defines a DAG. A step-by-step set of operations, each performing a transformation

Pig defines a logical plan for these transformations:

32

How does it work

A = LOAD ’file' as (line); B = FOREACH A GENERATE FLATTEN(TOKENIZE(line)) AS word; C = GROUP B BY word; D = FOREACH C GENERATE group, COUNT(words); STORE D INTO ‘output’

• Parses, checks, & optimises • Plan the execution

• Maps & Reduces • Passes the jar to Hadoop • Monitor the progress

Scalar type:

Int, Long, Float, Double, Chararray, Bytearray

Complex type representing nested structures:

Tuple: sequence of fields of any type

Bag: an unordered collection of tuples

Map: a set of key-value pairs. Keys must be atoms, values may be any type

Expressions:

used in Pig as a part of a statement; field name, position ($), arithmetic, conditional, comparison, Boolean, etc.

33

Data types & expressions

Load / Store

Data loaders; PigStorage, BinStorage, BinaryStorage, TextLoader, PigDump

Evaluation

Many built-in functions MAX, COUNT, SUM, DIFF, SIZE…

Filter

A special type of eval function used by the FILTER operator. IsEmpty is a built-in function

Comparison

Function used in ORDER statement; ASC | DESC

34

Functions

Schemas enable you to associate names and types of the fields in the relation

Schemas are optional but recommended whenever possible; type declarations result in better parse-time error checking and more efficient code execution

They are defined using the AS keyword with operators

Schema definition for simple data types:

> records = LOAD 'input/data' AS (id:int, date:chararray);

35

Schemas

Each statement, defining a data processing operator / relation, produces a dataset with an alias

grunt> records = LOAD 'input/data' AS (id:int, date:chararray);

LOAD returns a tuple, which elements can be referenced by position or by name

Very useful operators are DUMP, ILLUSTRATE, and DESCRIBE

36

Statements and aliases

Filter is user to work with tuples and rows of data

Select data you want, or remove the data you are not interested in

Filtering early in the processing pipeline minimises the amount of data flowing through the system, which can improve efficiency

grunt> filtered_records = FILTER records BY id == 234;

37

Filtering data

Foreach .. Generate acts on columns on every row in a relation

grunt> ids = FOREACH records GENERATE id;

Positional reference. This statement has the same output grunt> ids = FOREACH records GENERATE $0;

The elements of ‘ids’ however are not named ‘id’ unless you add ‘AS id’ at the end of your statement

grunt> ids = FOREACH records GENERATE $0 AS id;

38

Foreach .. Generate

Group .. by makes an output bag containing grouped fields with the same schema using a grouping key

Join performs inner, equijoin of two or more relations based on common field values.

You can also perform outer joins using keywords left, right and full

Cogroup is similar to Group, using multiple relations, and creates a nested set of output tuples

39

Grouping and joining

Order imposes an order on the output to sort a relation by one or more fields

The Limit statement limits the number of results

Split partitions a relation into two or more relations

the Sample operator selects a random data sample with the stated sample size

the Union operator to merge the contents of two or more relations

40

Ordering, combining, splitting…

The Stream operator allows to transform data in a relation using an external program or script

grunt> C = STREAM A THROUGH `cut -f 2`;

Extract the second field of A using cut

The scripts are shipped to the cluster using grunt> DEFINE script `script.py` SHIP (‘script.py’);

grunt> D = STREAM C THROUGH script AS (…);

41

Stream

Support and a community of user-defined functions (UDFs)

UDFs can encapsulate users processing logic in filtering, comparison, evaluation, grouping, or storage

filter functions for instance are all subclasses of FilterFunc, which itself is a subclass of EvalFunc

PiggyBank: the Pig community sharing their UDFs

DataFu: Linkedin's collection of Pig UDFs

42

User defined functions

package myudfs; import … public class UPPER extends EvalFunc<String> { public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try{ String str = (String) input.get(0); return str.toUpperCase(); }catch(Exception e){ throw WrappedIOException.wrap("Caught exception processing input row ", e); } } }

43

A simple eval UDF example

An Example

Let’s find the top 5 most visited pages by users aged 18 – 25. Input: user data file, and page view data file.

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

44

Users = LOAD ‘users’ AS (name, age); Filtered = FILTER Users BY age >= 18 and age <= 25; Pages = LOAD ‘pages’ AS (user, url); Joined = JOIN Filtered BY name, Pages by user; Grouped = GROUP Joined BY url; Summed = FOREACH Grouped GENERATE group, count(Joined) AS clicks; Sorted = ORDER Summed BY clicks desc; Top5 = LIMIT Sorted 5;

STORE Top5 INTO ‘top5sites’;

A simple script

45

In MapReduce! i m p o r t j a v a . i o . I O E x c e p t i o n ;

i m p o r t j a v a . u t i l . A r r a y L i s t ;

i m p o r t j a v a . u t i l . I t e r a t o r ;

i m p o r t j a v a . u t i l . L i s t ;

i m p o r t o r g . a p a c h e . h a d o o p . f s . P a t h ;

i m p o r t o r g . a p a c h e . h a d o o p . i o . L o n g W r i t a b l e ;

i m p o r t o r g . a p a c h e . h a d o o p . i o . T e x t ;

i m p o r t o r g . a p a c h e . h a d o o p . i o . W r i t a b l e ;

im p o r t o r g . a p a c h e . h a d o o p . i o . W r i t a b l e C o m p a r a b l e ;

i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . F i l e I n p u t F o r m a t ;

i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . F i l e O u t p u t F o r m a t ;

i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . J o b C o n f ;

i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . K e y V a l u e T e x t I n p u t F o r m a t ;

i m p o r t o r g . ap a c h e . h a d o o p . m a p r e d . M a p p e r ;

i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . M a p R e d u c e B a s e ;

i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . O u t p u t C o l l e c t o r ;

i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e c o r d R e a d e r ;

i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e d u c e r ;

i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e p o r t e r ;

i m po r t o r g . a p a c h e . h a d o o p . m a p r e d . S e q u e n c e F i l e I n p u t F o r m a t ;

i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . S e q u e n c e F i l e O u t p u t F o r m a t ;

i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . T e x t I n p u t F o r m a t ;

i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . j o b c o n t r o l . J o b ;

i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . j o b c o n t r o l . J o b Co n t r o l ;

i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . l i b . I d e n t i t y M a p p e r ;

p u b l i c c l a s s M R E x a m p l e {

p u b l i c s t a t i c c l a s s L o a d P a g e s e x t e n d s M a p R e d u c e B a s e

i m p l e m e n t s M a p p e r < L o n g W r i t a b l e , T e x t , T e x t , T e x t > {

p u b l i c v o i d m a p ( L o n g W r i t a b l e k , T e x t v a l ,

O u t p u t C o l l e c t o r < T e x t , T e x t > o c ,

R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n {

/ / P u l l t h e k e y o u t

S t r i n g l i n e = v a l . t o S t r i n g ( ) ;

i n t f i r s t C o m m a = l i n e . i n d e x O f ( ' , ' ) ;

S t r i n g k e y = l i n e . s u bs t r i n g ( 0 , f i r s t C o m m a ) ;

S t r i n g v a l u e = l i n e . s u b s t r i n g ( f i r s t C o m m a + 1 ) ;

T e x t o u t K e y = n e w T e x t ( k e y ) ;

/ / P r e p e n d a n i n d e x t o t h e v a l u e s o w e k n o w w h i c h f i l e

/ / i t c a m e f r o m .

T e x t o u t V a l = n e w T e x t ( " 1" + v a l u e ) ;

o c . c o l l e c t ( o u t K e y , o u t V a l ) ;

}

}

p u b l i c s t a t i c c l a s s L o a d A n d F i l t e r U s e r s e x t e n d s M a p R e d u c e B a s e

i m p l e m e n t s M a p p e r < L o n g W r i t a b l e , T e x t , T e x t , T e x t > {

p u b l i c v o i d m a p ( L o n g W r i t a b l e k , T e x t v a l ,



/ / P u l l t h e k e y o u t



S t r i n g v a l u e = l i n e . s u b s t r i n g (f i r s t C o m m a + 1 ) ;

i n t a g e = I n t e g e r . p a r s e I n t ( v a l u e ) ;

i f ( a g e < 1 8 | | a g e > 2 5 ) r e t u r n ;

S t r i n g k e y = l i n e . s u b s t r i n g ( 0 , f i r s t C o m m a ) ;


/ / P r e p e n d a n i n d e x t o t h e v a l u e s o we k n o w w h i c h f i l e

/ / i t c a m e f r o m .

T e x t o u t V a l = n e w T e x t ( " 2 " + v a l u e ) ;

o c . c o l l e c t ( o u t K e y , o u t V a l ) ;

}

}

p u b l i c s t a t i c c l a s s J o i n e x t e n d s M a p R e d u c e B a s e

i m p l e m e n t s R e d u c e r < T e x t , T e x t , T e x t , T e x t > {

p u b l i c v o i d r e d u c e ( T e x t k e y ,

I t e r a t o r < T e x t > i t e r ,



/ / F o r e a c h v a l u e , f i g u r e o u t w h i c h f i l e i t ' s f r o m a n d

s t o r e i t

/ / a c c o r d i n g l y .

L i s t < S t r i n g > f i r s t = n e w A r r a y L i s t < S t r i n g > ( ) ;

L i s t < S t r i n g > s e c o n d = n e w A r r a y L i s t < S t r i n g > ( ) ;

w h i l e ( i t e r . h a s N e x t ( ) ) {

T e x t t = i t e r . n e x t ( ) ;

S t r i n g v a l u e = t . t oS t r i n g ( ) ;

i f ( v a l u e . c h a r A t ( 0 ) = = ' 1 ' )

f i r s t . a d d ( v a l u e . s u b s t r i n g ( 1 ) ) ;

e l s e s e c o n d . a d d ( v a l u e . s u b s t r i n g ( 1 ) ) ;

r e p o r t e r . s e t S t a t u s ( " O K " ) ;

}

/ / D o t h e c r o s s p r o d u c t a n d c o l l e c t t h e v a l u e s

f o r ( S t r i n g s 1 : f i r s t ) {

f o r ( S t r i n g s 2 : s e c o n d ) {

S t r i n g o u t v a l = k e y + " , " + s 1 + " , " + s 2 ;

o c . c o l l e c t ( n u l l , n e w T e x t ( o u t v a l ) ) ;


}

}

}

}

p u b l i c s t a t i c c l a s s L o a d J o i n e d e x t e n d s M a p R e d u c e B a s e

i m p l e m e n t s M a p p e r < T e x t , T e x t , T e x t , L o n g W r i t a b l e > {

p u b l i c v o i d m a p (

T e x t k ,

T e x t v a l ,

O u t p u t C o l l ec t o r < T e x t , L o n g W r i t a b l e > o c ,


/ / F i n d t h e u r l



i n t s e c o n d C o m m a = l i n e . i n d e x O f ( ' , ' , f i r s tC o m m a ) ;

S t r i n g k e y = l i n e . s u b s t r i n g ( f i r s t C o m m a , s e c o n d C o m m a ) ;

/ / d r o p t h e r e s t o f t h e r e c o r d , I d o n ' t n e e d i t a n y m o r e ,

/ / j u s t p a s s a 1 f o r t h e c o m b i n e r / r e d u c e r t o s u m i n s t e a d .


o c . c o l l e c t ( o u t K e y , n e w L o n g W r i t a b l e ( 1 L ) ) ;

}

}

p u b l i c s t a t i c c l a s s R e d u c e U r l s e x t e n d s M a p R e d u c e B a s e

i m p l e m e n t s R e d u c e r < T e x t , L o n g W r i t a b l e , W r i t a b l e C o m p a r a b l e ,

W r i t a b l e > {

p u b l i c v o i d r e d u c e (

T e x t k ey ,

I t e r a t o r < L o n g W r i t a b l e > i t e r ,

O u t p u t C o l l e c t o r < W r i t a b l e C o m p a r a b l e , W r i t a b l e > o c ,


/ / A d d u p a l l t h e v a l u e s w e s e e

l o n g s u m = 0 ;

w hi l e ( i t e r . h a s N e x t ( ) ) {

s u m + = i t e r . n e x t ( ) . g e t ( ) ;


}

o c . c o l l e c t ( k e y , n e w L o n g W r i t a b l e ( s u m ) ) ;

}

}

p u b l i c s t a t i c c l a s s L o a d C l i c k s e x t e n d s M a p R e d u c e B a s e

im p l e m e n t s M a p p e r < W r i t a b l e C o m p a r a b l e , W r i t a b l e , L o n g W r i t a b l e ,

T e x t > {

p u b l i c v o i d m a p (

W r i t a b l e C o m p a r a b l e k e y ,

W r i t a b l e v a l ,

O u t p u t C o l l e c t o r < L o n g W r i t a b l e , T e x t > o c ,


o c . c o l l e c t ( ( L o n g W r i t a b l e ) v a l , ( T e x t ) k e y ) ;

}

}

p u b l i c s t a t i c c l a s s L i m i t C l i c k s e x t e n d s M a p R e d u c e B a s e

i m p l e m e n t s R e d u c e r < L o n g W r i t a b l e , T e x t , L o n g W r i t a b l e , T e x t > {

i n t c o u n t = 0 ;

p u b l i c v o i d r e d u c e (

L o n g W r i t a b l e k e y ,

I t e r a t o r < T e x t > i t e r ,

O u t p u t C o l l e c t o r < L o n g W r i t a b l e , T e x t > o c ,


/ / O n l y o u t p u t t h e f i r s t 1 0 0 r e c o r d s

w h i l e ( c o u n t < 1 0 0 & & i t e r . h a s N e x t ( ) ) {

o c . c o l l e c t ( k e y , i t e r . n e x t ( ) ) ;

c o u n t + + ;

}

}

}

p u b l i c s t a t i c v o i d m a i n ( S t r i n g [ ] a r g s ) t h r o w s I O E x c e p t i o n {

J o b C o n f l p = n e w J o b C o n f ( M R E x a m p l e . c l a s s ) ;

l p . s et J o b N a m e ( " L o a d P a g e s " ) ;

l p . s e t I n p u t F o r m a t ( T e x t I n p u t F o r m a t . c l a s s ) ;

l p . s e t O u t p u t K e y C l a s s ( T e x t . c l a s s ) ;

l p . s e t O u t p u t V a l u e C l a s s ( T e x t . c l a s s ) ;

l p . s e t M a p p e r C l a s s ( L o a d P a g e s . c l a s s ) ;

F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( l p , n e w

P a t h ( " /u s e r / g a t e s / p a g e s " ) ) ;

F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( l p ,

n e w P a t h ( " / u s e r / g a t e s / t m p / i n d e x e d _ p a g e s " ) ) ;

l p . s e t N u m R e d u c e T a s k s ( 0 ) ;

J o b l o a d P a g e s = n e w J o b ( l p ) ;

J o b C o n f l f u = n e w J o b C o n f ( M R E x a m p l e . c l a s s ) ;

l f u . se t J o b N a m e ( " L o a d a n d F i l t e r U s e r s " ) ;

l f u . s e t I n p u t F o r m a t ( T e x t I n p u t F o r m a t . c l a s s ) ;

l f u . s e t O u t p u t K e y C l a s s ( T e x t . c l a s s ) ;

l f u . s e t O u t p u t V a l u e C l a s s ( T e x t . c l a s s ) ;

l f u . s e t M a p p e r C l a s s ( L o a d A n d F i l t e r U s e r s . c l a s s ) ;

F i l e I n p u t F o r m a t . a d dI n p u t P a t h ( l f u , n e w

P a t h ( " / u s e r / g a t e s / u s e r s " ) ) ;

F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( l f u ,

n e w P a t h ( " / u s e r / g a t e s / t m p / f i l t e r e d _ u s e r s " ) ) ;

l f u . s e t N u m R e d u c e T a s k s ( 0 ) ;

J o b l o a d U s e r s = n e w J o b ( l f u ) ;

J o b C o n f j o i n = n e w J o b C o n f (M R E x a m p l e . c l a s s ) ;

j o i n . s e t J o b N a m e ( " J o i n U s e r s a n d P a g e s " ) ;

j o i n . s e t I n p u t F o r m a t ( K e y V a l u e T e x t I n p u t F o r m a t . c l a s s ) ;

j o i n . s e t O u t p u t K e y C l a s s ( T e x t . c l a s s ) ;

j o i n . s e t O u t p u t V a l u e C l a s s ( T e x t . c l a s s ) ;

j o i n . s e t M a p p e r C l a s s ( I d e n t i t y M a pp e r . c l a s s ) ;

j o i n . s e t R e d u c e r C l a s s ( J o i n . c l a s s ) ;

F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( j o i n , n e w

P a t h ( " / u s e r / g a t e s / t m p / i n d e x e d _ p a g e s " ) ) ;

F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( j o i n , n e w

P a t h ( " / u s e r / g a t e s / t m p / f i l t e r e d _ u s e r s " ) ) ;

F i l e O u t p u t F o r m a t . s et O u t p u t P a t h ( j o i n , n e w

P a t h ( " / u s e r / g a t e s / t m p / j o i n e d " ) ) ;

j o i n . s e t N u m R e d u c e T a s k s ( 5 0 ) ;

J o b j o i n J o b = n e w J o b ( j o i n ) ;

j o i n J o b . a d d D e p e n d i n g J o b ( l o a d P a g e s ) ;

j o i n J o b . a d d D e p e n d i n g J o b ( l o a d U s e r s ) ;

J o b C o n f g r o u p = n e w J o b C o n f ( M R Ex a m p l e . c l a s s ) ;

g r o u p . s e t J o b N a m e ( " G r o u p U R L s " ) ;

g r o u p . s e t I n p u t F o r m a t ( K e y V a l u e T e x t I n p u t F o r m a t . c l a s s ) ;

g r o u p . s e t O u t p u t K e y C l a s s ( T e x t . c l a s s ) ;

g r o u p . s e t O u t p u t V a l u e C l a s s ( L o n g W r i t a b l e . c l a s s ) ;

g r o u p . s e t O u t p u t F o r m a t ( S e q u e n c e F il e O u t p u t F o r m a t . c l a s s ) ;

g r o u p . s e t M a p p e r C l a s s ( L o a d J o i n e d . c l a s s ) ;

g r o u p . s e t C o m b i n e r C l a s s ( R e d u c e U r l s . c l a s s ) ;

g r o u p . s e t R e d u c e r C l a s s ( R e d u c e U r l s . c l a s s ) ;

F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( g r o u p , n e w

P a t h ( " / u s e r / g a t e s / t m p / j o i n e d " ) ) ;

F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( g r o u p , n e w

P a t h ( " / u s e r / g a t e s / t m p / g r o u p e d " ) ) ;

g r o u p . s e t N u m R e d u c e T a s k s ( 5 0 ) ;

J o b g r o u p J o b = n e w J o b ( g r o u p ) ;

g r o u p J o b . a d d D e p e n d i n g J o b ( j o i n J o b ) ;

J o b C o n f t o p 1 0 0 = n e w J o b C o n f ( M R E x a m p l e . c l a s s ) ;

t o p 1 0 0 . s e t J o b N a m e ( " T o p 1 0 0 s i t e s " ) ;

t o p 1 0 0 . s e t I n p u t F o r m a t ( S e q u e n c e F i l e I n p u t F o r m a t . c l a s s ) ;

t o p 1 0 0 . s e t O u t p u t K e y C l a s s ( L o n g W r i t a b l e . c l a s s ) ;

t o p 1 0 0 . s e t O u t p u t V a l u e C l a s s ( T e x t . c l a s s ) ;

t o p 1 0 0 . s e t O u t p u t F o r m a t ( S e q u e n c e F i l e O u t p u t Fo r m a t . c l a s s ) ;

t o p 1 0 0 . s e t M a p p e r C l a s s ( L o a d C l i c k s . c l a s s ) ;

t o p 1 0 0 . s e t C o m b i n e r C l a s s ( L i m i t C l i c k s . c l a s s ) ;

t o p 1 0 0 . s e t R e d u c e r C l a s s ( L i m i t C l i c k s . c l a s s ) ;

F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( t o p 1 0 0 , n e w

P a t h ( " / u s e r / g a t e s / t m p / g r o u p e d " ) ) ;

F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( t o p 1 0 0 , n e w

P a t h ( " / u s e r / g a t e s / t o p 1 0 0 s i t e s f o r u s e r s 1 8 t o 2 5 " ) ) ;

t o p 1 0 0 . s e t N u m R e d u c e T a s k s ( 1 ) ;

J o b l i m i t = n e w J o b ( t o p 1 0 0 ) ;

l i m i t . a d d D e p e n d i n g J o b ( g r o u p J o b ) ;

J o b C o n t r o l j c = n e w J o b C o n t r o l ( " F i n d t o p 1 0 0 s i t e s f o r u s e r s

1 8 t o 2 5 " ) ;

j c . a d d J o b ( l o a d P a g e s ) ;

j c . a d d J o b ( l o a d U s e r s ) ;

j c . a d d J o b ( j o i n J o b ) ;

j c . a d d J o b ( g r o u p J o b ) ;

j c . a d d J o b ( l i m i t ) ;

j c . r u n ( ) ;

}

}

46

Ease of Translation

Users = LOAD … Filtered = FILTER … Pages = LOAD … Joined = JOIN … Grouped = GROUP … Summed = … COUNT()… Sorted = ORDER … Top5 = LIMIT …

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

47

Cassandra has gained some significant integration points with Hadoop and its analytics tools

In order to achieve Hadoop’s data locality, Cassandra nodes must be part of the Hadoop cluster by running a tasktracker process. So the namenode and jobtracker can reside outside of the Cassandra cluster

48

The Hadoop/Pig/Cassandra stack

A three- node Cassandra/Hadoop cluster with external namenode / jobtracker

Cassandra has a Java source package for Hadoop integration org.apache.cassandra.hadoop

ColumnFamilyInputFormat extends InputFormat

ColumnFamilyOutputFormat extends OutputFormat

ConfigHelper a helper class to configure Cassandra-specific information

Hadoop output streaming was introduced in 0.7 but removed from 0.8

49

Hadoop jobs

The Pig integration CassandraStorage() (a LoadFunc implementation) allows Pig to Load/Store data from/to Cassandra grunt> LOAD 'cassandra://Keyspace/cf' USING CassandraStorage();

The pig_cassandra script, shipped with Cassandra source, performs the necessary initialisation (Pig environments variables still needs to be set)

Pygmalion is a set of scripts and UDFs to facilitate the use of Pig alongside Cassandra

50

Pig alongside Cassandra

A workflow system provides an infrastructure to set up & manage a sequence of interdependent jobs / set of jobs

The hadoop ecosystem includes a set of workflow tools to run applications over MapReduce processes or High-level languages

Cascading (http://www.cascading.org/). A java library defining data processing workflows and rendering them to MapReduce jobs

Oozie (http://yahoo.github.com/oozie/)

51

Workflow

http://www.cascading.org/

http://www.cascading.org/

http://yahoo.github.com/oozie/

http://yahoo.github.com/oozie/

Some links

http://hadoop.apache.org

http://pig.apache.org/

https://cwiki.apache.org/confluence/display/PIG/Index

PiggyBank: https://cwiki.apache.org/confluence/display/PIG/PiggyBank

DataFu: https://github.com/linkedin/datafu

Pygmalion: https://github.com/jeromatron/pygmalion

http://code.google.com/edu/parallel/mapreduce-tutorial.html

Video tutorials from Cloudera: http://www.cloudera.com/hadoop-training

Interesting papers: http://bit.ly/rskJho - Original MapReduce paper

http://bit.ly/KvFXxT - Pig paper: ‘Building a High-Level Dataflow System on top of MapReduce: The Pig Experience’

52

http://hadoop.apache.org





https://cwiki.apache.org/confluence/display/PIG/PiggyBank

https://cwiki.apache.org/confluence/display/PIG/PiggyBank

https://github.com/linkedin/datafu

https://github.com/linkedin/datafu

https://github.com/jeromatron/pygmalion




http://www.cloudera.com/hadoop-training



http://bit.ly/rskJho

http://bit.ly/rskJho

http://bit.ly/KvFXxT

http://bit.ly/KvFXxT

53

A simple data flow

Load checkins data

Keep only the two ids

Group by user/loc id & Order

Limit to top 50

Top 50 users / locations [same script, different group key]

54

Another data flow

Load checkins data

Split_date

Group by date

Count the tuples Group by weeks using

Stream

All the checkins, over weeks

hadoop pig

Technology

hadoop components hadoop

dir hdfs

hdfs programming

data jobs

hadoop cluster

map function outputs

hdfs hdfs api hdfs fs

input data