dataiku - paris jug 2013 - hadoop is a batch

Hadoop Is A BatchPig, Hive, Cascading …

Paris Jug May 2013 Florian Douetteau

04/10/2023Dataiku Training – Hadoop for Data Science 2

Florian Douetteau <[email protected]>

CEO at Dataiku Freelance at Criteo (Online Ads) CTO at IsCool Ent. (#1 French Social Gamer) VP R&D Exalead (Search Engine Technology)

About me

Dataiku - Pig, Hive and Cascading

Hadoop and Context (->0:03) Pig, Hive, Cascading, … (->0:09) How they work (->0:15) Comparing the tools (->0:35) Make them work together (->0:40) Wrap’up and question (->Beer)

Agenda


CHOOSE TECHNOLOGY

HadoopCeph

Sphere

Cassandra

Spark

Scikit-Learn

MahoutWEKA

MLBase LibSVM

SASRapidMiner

SPSS Panda

QlickViewTableau

SpotFireHTML5/D3

InfiniDBVertica

GreenPlumImpalaNetezza

Elastic Search

SOLR

MongoDBRiak

Membase

Pig

Cascading

Talend

Machine Learning Mystery Land

Scalability CentralNoSQL-Slavia

SQL Colunnar Republic

Vizualization CountyData Clean Wasteland

Statistician Old House

R


How do I (pre)process data?

Implicit User Data(Views, Searches…)

Content Data(Title, Categories, Price, …)

Explicit User Data(Click, Buy, …)

User Information(Location, Graph…)

500TB

50TB

1TB

200GB

Transformation Matrix

Transformation Predictor

Per User Stats

Per Content Stats

User Similarity

Rank Predictor

Content Similarity

A/B Test Data

Predictor Runtime

Online User Information


Analyse Raw Logs (Trackers, Web Logs)

Extract IP, Page, … Detect and remove robots Build Statistics

◦ Number of page view, per produt

◦ Best Referers◦ Traffic Analysis◦ Funnel◦ SEO Analysis◦ …

Typical Use Case 1 Web Analytics Processing


Extract Query Logs Perform query

normalization Compute Ngrams Compute Search

“Sessions” Compute Log-

Likehood Ratio for ngrams across sesions

Typical Use Case 2Mining Search Logs for Synonyms


Compute User – Product Association Matrix

Compute different similarities ratio (Ochiai, Cosine, …)

Filter out bad predictions

For each user, select best recommendable products

Typical Use Case 3Product Recommender



Agenda


Yahoo Research in 2006 Inspired from Sawzall, a Google Paper from

2003 2007 as an Apache Project

Initial motivation◦ Search Log Analytics: how long is the average

user session ? how many links does a user click ? on before leaving a website ? how do click patterns vary in the course of a day/week/month ? …

Pig History

words = LOAD '/training/hadoop-wordcount/output‘ USING PigStorage(‘\t’)

AS (word:chararray, count:int);

sorted_words = ORDER words BY count DESC;first_words = LIMIT sorted_words 10;

DUMP first_words;


Developed by Facebook in January 2007

Open source in August 2008

Initial Motivation◦ Provide a SQL like abstraction to perform

statistics on status updates

Hive History

create external table wordcounts ( word string, count int) row format delimited fields terminated by '\t' location '/training/hadoop-wordcount/output';

select * from wordcounts order by count desc limit 10;

select SUM(count) from wordcounts where word like ‘th%’;


Authored by Chris Wensel 2008

Associated Projects◦ Cascalog : Cascading in Closure◦ Scalding : Cascading in Scala

(Twitter in 2012)◦ Lingual ( to be released soon): SQL

layer on top of cascading

Cascading History



Agenda

04/10/2023Dataiku - Innovation Services 14

MapReduceSimplicity is a complexity


Pig & HiveMapping to Mapreduce jobs

* VAT excluded

events = LOAD ‘/events’ USING PigStorage(‘\t’) AS (type:chararray, user:chararray, price:int, timestamp:int);

events_filtered = FILTER events BY type;

by_user = GROUP events_filtered BY user;

price_by_user = FOREACH by_user GENERATE type, SUM(price) AS total_price, MAX(timestamp) as max_ts;

high_pbu = FILTER price_by_user BY total_price > 1000;

Job 1 : Mapper Job 1 : Reducer1

LOAD FILTERGROU

PFOREACH FILTER

Shuffle and sort by user


Pig & HiveMapping to Mapreduce jobs

events = LOAD ‘/events’ USING PigStorage(‘\t’) AS (type:chararray, user:chararray, price:int, timestamp:int);

events_filtered = FILTER events BY type;

by_user = GROUP events_filtered BY user;

price_by_user = FOREACH by_user GENERATE type, SUM(price) AS total_price, MAX(timestamp) as max_ts;

high_pbu = FILTER price_by_user BY total_price > 1000;

recent_high = ORDER high_pbu BY max_ts DESC;

STORE recent_high INTO ‘/output’;

Job 1: Mapper Job 1 :Reducer

LOAD FILTERGROU

PFOREACH FILTER

Shuffle and sort by user

Job 2: Mapper Job 2: Reducer

LOAD(from tmp)

STOREShuffle and sort by max_ts


Pig How does it work

Data Execution Plan compiled into 10 map reduce jobs executed in parallel (or not)


Reducer 2Mappers output

Reducer 1

Hive JoinsHow to join with MapReduce ?

tbl_idx

uid name

1 1 Dupont

1 2 Durand

tbl_idx

uid type

2 1 Type1

2 1 Type2

2 2 Type1

Shuffle by uidSort by (uid, tbl_idx)

Uid

Tbl_idx Name Type

1 1 Dupont

1 2 Type1

1 2 Type2

Uid

Tbl_idx Name Type

2 1 Durand

2 2 Type1

Uid

Name Type

1 Dupont Type1

1 Dupont Type2

Uid

Name Type

2 Durand Type1



Agenda


Philosophy◦ Procedural Vs Declarative◦ Data Model and Schema

Productivity◦ Headachability ◦ Checkpointing◦ Testing and environment

Integration◦ Partitioning◦ Formats Integration◦ External Code Integration

Performance and optimization

Comparing without Comparable


Transformation as a sequence of operations

Transformation as a set of formulas

Procedural Vs Declarative

insert into ValuableClicksPerDMA select dma, count(*)from geoinfo join (

select name, ipaddr from users join clicks on (users.name = clicks.user)

where value > 0;) using ipaddr

group by dma;

Users = load 'users' as (name, age, ipaddr);Clicks = load 'clicks' as (user, url, value);ValuableClicks = filter Clicks by value > 0;UserClicks = join Users by name, ValuableClicks by user;Geoinfo = load 'geoinfo' as (ipaddr, dma);UserGeo = join UserClicks by ipaddr, Geoinfo by ipaddr;ByDMA = group UserGeo by dma;ValuableClicksPerDMA = foreach ByDMA generate group, COUNT(UserGeo);store ValuableClicksPerDMA into 'ValuableClicksPerDMA';


All three Extend basic data model with extended data types ◦ array-like [ event1, event2, event3] ◦ map-like { type1:value1, type2:value2, …}

Different approach◦ Resilient Schema ◦ Static Typing ◦ No Static Typing

Data type and ModelRationale

04/10/2023 24

HiveData Type and Schema

Simple type Details

TINYINT, SMALLINT, INT, BIGINT 1, 2, 4 and 8 bytes

FLOAT, DOUBLE 4 and 8 bytes

BOOLEAN

STRING Arbitrary-length, replaces VARCHAR

TIMESTAMP

Complex type Details

ARRAY Array of typed items (0-indexed)

MAP Associative map

STRUCT Complex class-like objects

Dataiku Training – Hadoop for Data Science

CREATE TABLE visit (user_name STRING,user_id INT,user_details STRUCT<age:INT, zipcode:INT>

);

04/10/2023 25

rel = LOAD '/folder/path/'USING PigStorage(‘\t’)AS (col:type, col:type, col:type);

Data types and SchemaPig

Simple type Details

int, long, float, double

32 and 64 bits, signed

chararray A string

bytearray An array of … bytes

boolean A boolean


tuple a tuple is an ordered fieldname:value map

bag a bag is a set of tuples



Support for Any Java Types, provided they can be serialized in Hadoop

No support for Typing

Data Type and Schema Cascading

Simple type Details

Int, Long, Float, Double

32 and 64 bits, signed

String A string

byte[] An array of … bytes

Boolean A boolean


Object Object must be « Hadoop serializable »


Style Summary

Style Typing Data Model

Metadata store

Pig Procedural Static + Dynamic

scalar + tuple+ bag

(fully recursive)

No (HCatalog)

Hive Declarative Static + Dynamic,

enforced at execution

time

scalar+ list + map

Integrated

Cascading Procedural Weak scalar+ java objects

No



Productivity◦ Headachability ◦ Checkpointing◦ Testing, error management and environment

Integration◦ Partitioning◦ Formats Integration◦ External Code Integration




Does debugging the tool lead to bad headaches ?

HeadachilityMotivation


Out Of Memory Error

(Reducer)

Exception in Building /

Extended Functions

(handling of null)

Null vs “”

Nested Foreach and scoping

Date Management (pig 0.10)

Field implicit ordering

HeadachesPig


A Pig Error


Out of Memory Errors in

Reducers

Few Debugging Options

Null / “”

No builtin “first”

HeadachesHive


Weak Typing Errors

(comparing Int and String … )

Illegal Operation Sequence

(Group after group …)

Field Implicit Ordering

HeadachesCascading


How to perform unit tests ? How to have different versions of the same script

(parameter) ?

TestingMotivation


System Variables Comment to test No Meta Programming pig –x local to execute on local files

TestingPig


Junit Tests are possible Ability to use code to actually comment out some

variables

Testing / Environment Cascading


Lots of iteration while developing on Hadoop Sometime jobs fail Sometimes need to restart from the start …

Checkpointing Motivation

Page User Correlation OutputFilteringParse Logs Per Page StatsFA

IL

FIX and relaunch


STORE Command to manually store files

PigManual Checkpointing

Page User Correlation OutputFilteringParse Logs Per Page Stats

// COMMENT Beginning of script and relaunch


Ability to re-run a flow automatically from the last saved checkpoint

Cascading Automated Checkpointing

addCheckpoint(…)


Check each file intermediate timestamp Execute only if more recent

Cascading Topological Scheduler

Page User Correlation OutputFilteringParse Logs Per Page Stats


Productivity Summary

Headaches Checkpointing/Replay

Testing / Metaprogrammation

Pig Lots Manual Save Difficult Meta programming, easy

local testing

Hive Few, but without

debugging options

None (That’s SQL) None (That’s SQL)

Cascading Weak TypingComplexity

Checkpointing Partial Updates

Possible




Integration◦ Formats Integration◦ Partitioning◦ External Code Integration




Ability to integrate different file formats◦ Text Delimited◦ Sequence File (Binary Hadoop format)◦ Avro, Thrift ..

Ability to integrate with external data sources or sink ( MongoDB, ElasticSearch, Database. …)

Formats IntegrationMotivation

Format Size on Disk (GB) HIVE Processing time (24 cores)

Text File, uncompressed 18.7 1m32s

1 Text File, Gzipped 3.89 6m23s (no parallelization)

JSON compressed 7.89 2m42s

multiple text file gzipped 4.02 43s

Sequence File, Block, Gzip 5.32 1m18s

Text File, LZO Indexed 7.03 1m22s

Format impact on size and performance


Hive: Serde (Serialize-Deserializer) Pig : Storage Cascading: Tap

Format Integration


No support for “UPDATE” patterns, any increment is performed by adding or deleting a partition

Common partition schemas on Hadoop◦ By Date /apache_logs/dt=2013-01-23◦ By Data center /apache_logs/dc=redbus01/…◦ By Country◦ …◦ Or any combination of the above

PartitionsMotivation

04/10/2023 46

Hive PartitioningPartitioned tables

CREATE TABLE event (user_id INT,type STRING,message STRING)

PARTITIONED BY (day STRING, server_id STRING);

Disk structure

/hive/event/day=2013-01-27/server_id=s1/file0/hive/event/day=2013-01-27/server_id=s1/file1/hive/event/day=2013-01-27/server_id=s2/file0/hive/event/day=2013-01-27/server_id=s2/file1…/hive/event/day=2013-01-28/server_id=s2/file0/hive/event/day=2013-01-28/server_id=s2/file1


INSERT OVERWRITE TABLE event PARTITION(ds='2013-01-27', server_id=‘s1’)SELECT * FROM event_tmp;


No Direct support for partition Support for “Glob” Tap, to build read from files

using patterns

You can code your own custom or virtual partition schemes

Cascading Partition

Florian Douetteau

TODO


External Code IntegrationSimple UDF

Pig Hive

Cascading


Hive Complex UDF(Aggregators)


Cascading Direct Code Evaluation

Uses Janino, a very cool project: http://docs.codehaus.org/display/JANINO


Allow to call a cascading flow from a Spring Batch

Spring Batch Cascading Integration

No full Integration with Spring MessageSource or MessageHandler yet (only for local flows)


IntegrationSummary

Partition/Incremental Updates

External Code Format Integration

Pig No Direct Support

Simple Doable and rich community

Hive Fully integrated, SQL Like

Very simple, but complex dev setup

Doable and existing

community

Cascading With Coding Complex UDFS but regular, and Java Expression

embeddable

Doable and growing

commuinty

Florian Douetteau

Compare different version




Integration◦ Formats Integration◦ Partitioning◦ External Code Integration




Several Common Map Reduce Optimization Patterns◦ Combiners◦ MapJoin◦ Job Fusion◦ Job Parallelism◦ Reducer Parallelism

Different support per framework◦ Fully Automatic◦ Pragma / Directives / Options◦ Coding style / Code to write

Optimization


SELECT date, COUNT(*) FROM product GROUP BY date

CombinerPerform Partial Aggregate at Mapper Stage

Map Reduce2012-02-14 4354

…

2012-02-15 21we2

2012-02-14 qa334

…

2012-02-15 23aq2

2012-02-14 20

2012-02-15 35

2012-02-16 1

2012-02-14 4354

…

2012-02-15 21we2

2012-02-14 qa334

…

2012-02-15 23aq2


SELECT date, COUNT(*) FROM product GROUP BY date

CombinerPerform Partial Aggregate at Mapper Stage

Map Reduce2012-02-14 4354

…

2012-02-15 21we2

2012-02-14 qa334

…

2012-02-15 23aq22012-02-14 12

2012-02-15 23

2012-02-16 1

2012-02-14 8

2012-02-15 12

2012-02-14 20

2012-02-15 35

2012-02-16 1

Reduced network bandwith. Better parallelism


Join OptimizationMap Join

set hive.auto.convert.join = true;

Hive

Pig

Cascading

( no aggregation support after HashJoin)


Critical for performance

Estimated per the size of input file◦ Hive

divide size per hive.exec.reducers.bytes.per.reducer (default 1GB)◦ Pig

divide size pig.exec.reducers.bytes.per.reducer (default 1GB)

Number of Reducers


CombinerOptimization

JoinOptimization

Number of reducers optimization

Pig Automatic Option Estimate or DIY

Cascading DIY HashJoin DIY

Hive PartialDIY

Automatic(Map Join)

Estimate or DIY

Performance & Optimization Summary



Agenda


Follow the Flow

Tracker Log

MongoDB

MySQL

MySQL

Syslog

Product Catalog

Order

Apache Logs

Session

Product Transformation

Category Affinity

Category Targeting

Customer Profile

Product Recommender

S3

Search Logs (External) Search Engine Optimization

(Internal) Search Ranking

MongoDB

MySQL

Partner FTP

Sync In Sync Out

Pig

Pig

Hive

Hive

ElasticSearch


E.g. Product Recommender

Page Views

Orders

Catalog

Bots, Special Users

Filtered Page Views User Affinity

Product Popularity

User Similarity (Per Category)

Recommendation Graph

Recommendation

Order Summary

User Similarity (Per Brand)

Machine Learning


Schema Maintenance between tools

Proper incremental and efficient synchronization between tools and NoSQL Store and Logs Systems

Proper “management” partition (daily jobs, …)

Job Sequence and Management ◦ How to handle properly a new field ? a missing data ?

recompute everything ?

Pain PointsOn Large Projects


Hcatalog provides an interoberability between Hive and Pig in term of schema

Integration OptionHCatalog

Hive Pig

HCatalog


1970 Shell script 1977 Makefile 1980 Makedeps 1999 Cons/CMake 2001 Maven 2004 Ivy 2008 Gradle

Shell Script 2008 HaMake 2009 Oozie … ETL Hadoop Next … ?

Similar to “Build”



Agenda


Want to keep close to SQL ?◦ Hive

Want to write large flows ?◦ Pig

Want to integrate in large scale programming projects ◦ Cascading (cascalog / scalding)

Presentation Available On http://www.slideshare.net/Dataiku

dataiku - paris jug 2013 - hadoop is a batch

Technology

pig historydataiku pig

user informationlocation

pig hivemapping

workdataiku pig

technologydataiku pig

filter price

implicit user dataviews

theaverage user session