dataiku - paris jug 2013 - hadoop is a batch

66
Hadoop Is A Batch Pig, Hive, Cascading … Paris Jug May 2013 Florian Douetteau

Upload: dataiku

Post on 14-Dec-2014

773 views

Category:

Technology


6 download

DESCRIPTION

Present

TRANSCRIPT

Page 1: Dataiku - Paris JUG 2013 - Hadoop is a batch

Hadoop Is A BatchPig, Hive, Cascading …

Paris Jug May 2013 Florian Douetteau

Page 2: Dataiku - Paris JUG 2013 - Hadoop is a batch

04/10/2023Dataiku Training – Hadoop for Data Science 2

Florian Douetteau <[email protected]>

CEO at Dataiku Freelance at Criteo (Online Ads) CTO at IsCool Ent. (#1 French Social Gamer) VP R&D Exalead (Search Engine Technology)

About me

Page 3: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Hadoop and Context (->0:03) Pig, Hive, Cascading, … (->0:09) How they work (->0:15) Comparing the tools (->0:35) Make them work together (->0:40) Wrap’up and question (->Beer)

Agenda

Page 4: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

CHOOSE TECHNOLOGY

HadoopCeph

Sphere

Cassandra

Spark

Scikit-Learn

MahoutWEKA

MLBase LibSVM

SASRapidMiner

SPSS Panda

QlickViewTableau

SpotFireHTML5/D3

InfiniDBVertica

GreenPlumImpalaNetezza

Elastic Search

SOLR

MongoDBRiak

Membase

Pig

Cascading

Talend

Machine Learning Mystery Land

Scalability CentralNoSQL-Slavia

SQL Colunnar Republic

Vizualization CountyData Clean Wasteland

Statistician Old House

R

Page 5: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

How do I (pre)process data?

Implicit User Data(Views, Searches…)

Content Data(Title, Categories, Price, …)

Explicit User Data(Click, Buy, …)

User Information(Location, Graph…)

500TB

50TB

1TB

200GB

Transformation Matrix

Transformation Predictor

Per User Stats

Per Content Stats

User Similarity

Rank Predictor

Content Similarity

A/B Test Data

Predictor Runtime

Online User Information

Page 6: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Analyse Raw Logs (Trackers, Web Logs)

Extract IP, Page, … Detect and remove robots Build Statistics

◦ Number of page view, per produt

◦ Best Referers◦ Traffic Analysis◦ Funnel◦ SEO Analysis◦ …

Typical Use Case 1 Web Analytics Processing

Page 7: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Extract Query Logs Perform query

normalization Compute Ngrams Compute Search

“Sessions” Compute Log-

Likehood Ratio for ngrams across sesions

Typical Use Case 2Mining Search Logs for Synonyms

Page 8: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Compute User – Product Association Matrix

Compute different similarities ratio (Ochiai, Cosine, …)

Filter out bad predictions

For each user, select best recommendable products

Typical Use Case 3Product Recommender

Page 9: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Hadoop and Context (->0:03) Pig, Hive, Cascading, … (->0:09) How they work (->0:15) Comparing the tools (->0:35) Make them work together (->0:40) Wrap’up and question (->Beer)

Agenda

Page 10: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Yahoo Research in 2006 Inspired from Sawzall, a Google Paper from

2003 2007 as an Apache Project

Initial motivation◦ Search Log Analytics: how long is the average

user session ? how many links does a user click ? on before leaving a website ? how do click patterns vary in the course of a day/week/month ? …

Pig History

words = LOAD '/training/hadoop-wordcount/output‘ USING PigStorage(‘\t’)

AS (word:chararray, count:int);

sorted_words = ORDER words BY count DESC;first_words = LIMIT sorted_words 10;

DUMP first_words;

Page 11: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Developed by Facebook in January 2007

Open source in August 2008

Initial Motivation◦ Provide a SQL like abstraction to perform

statistics on status updates

Hive History

create external table wordcounts ( word string, count int) row format delimited fields terminated by '\t' location '/training/hadoop-wordcount/output';

select * from wordcounts order by count desc limit 10;

select SUM(count) from wordcounts where word like ‘th%’;

Page 12: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Authored by Chris Wensel 2008

Associated Projects◦ Cascalog : Cascading in Closure◦ Scalding : Cascading in Scala

(Twitter in 2012)◦ Lingual ( to be released soon): SQL

layer on top of cascading

Cascading History

Page 13: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Hadoop and Context (->0:03) Pig, Hive, Cascading, … (->0:09) How they work (->0:15) Comparing the tools (->0:35) Make them work together (->0:40) Wrap’up and question (->Beer)

Agenda

Page 14: Dataiku - Paris JUG 2013 - Hadoop is a batch

04/10/2023Dataiku - Innovation Services 14

MapReduceSimplicity is a complexity

Page 15: Dataiku - Paris JUG 2013 - Hadoop is a batch

04/10/2023Dataiku - Innovation Services 15

Pig & HiveMapping to Mapreduce jobs

* VAT excluded

events = LOAD ‘/events’ USING PigStorage(‘\t’) AS (type:chararray, user:chararray, price:int, timestamp:int);

events_filtered = FILTER events BY type;

by_user = GROUP events_filtered BY user;

price_by_user = FOREACH by_user GENERATE type, SUM(price) AS total_price, MAX(timestamp) as max_ts;

high_pbu = FILTER price_by_user BY total_price > 1000;

Job 1 : Mapper Job 1 : Reducer1

LOAD FILTERGROU

PFOREACH FILTER

Shuffle and sort by user

Page 16: Dataiku - Paris JUG 2013 - Hadoop is a batch

04/10/2023Dataiku - Innovation Services 16

Pig & HiveMapping to Mapreduce jobs

events = LOAD ‘/events’ USING PigStorage(‘\t’) AS (type:chararray, user:chararray, price:int, timestamp:int);

events_filtered = FILTER events BY type;

by_user = GROUP events_filtered BY user;

price_by_user = FOREACH by_user GENERATE type, SUM(price) AS total_price, MAX(timestamp) as max_ts;

high_pbu = FILTER price_by_user BY total_price > 1000;

recent_high = ORDER high_pbu BY max_ts DESC;

STORE recent_high INTO ‘/output’;

Job 1: Mapper Job 1 :Reducer

LOAD FILTERGROU

PFOREACH FILTER

Shuffle and sort by user

Job 2: Mapper Job 2: Reducer

LOAD(from tmp)

STOREShuffle and sort by max_ts

Page 17: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Pig How does it work

Data Execution Plan compiled into 10 map reduce jobs executed in parallel (or not)

Page 18: Dataiku - Paris JUG 2013 - Hadoop is a batch

04/10/2023Dataiku - Innovation Services 19

Reducer 2Mappers output

Reducer 1

Hive JoinsHow to join with MapReduce ?

tbl_idx

uid name

1 1 Dupont

1 2 Durand

tbl_idx

uid type

2 1 Type1

2 1 Type2

2 2 Type1

Shuffle by uidSort by (uid, tbl_idx)

Uid

Tbl_idx Name Type

1 1 Dupont

1 2 Type1

1 2 Type2

Uid

Tbl_idx Name Type

2 1 Durand

2 2 Type1

Uid

Name Type

1 Dupont Type1

1 Dupont Type2

Uid

Name Type

2 Durand Type1

Page 19: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Hadoop and Context (->0:03) Pig, Hive, Cascading, … (->0:09) How they work (->0:15) Comparing the tools (->0:35) Make them work together (->0:40) Wrap’up and question (->Beer)

Agenda

Page 20: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Philosophy◦ Procedural Vs Declarative◦ Data Model and Schema

Productivity◦ Headachability ◦ Checkpointing◦ Testing and environment

Integration◦ Partitioning◦ Formats Integration◦ External Code Integration

Performance and optimization

Comparing without Comparable

Page 21: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Transformation as a sequence of operations

Transformation as a set of formulas

Procedural Vs Declarative

insert into ValuableClicksPerDMA select dma, count(*)from geoinfo join (

select name, ipaddr from users join clicks on (users.name = clicks.user)

where value > 0;) using ipaddr

group by dma;

Users = load 'users' as (name, age, ipaddr);Clicks = load 'clicks' as (user, url, value);ValuableClicks = filter Clicks by value > 0;UserClicks = join Users by name, ValuableClicks by user;Geoinfo = load 'geoinfo' as (ipaddr, dma);UserGeo = join UserClicks by ipaddr, Geoinfo by ipaddr;ByDMA = group UserGeo by dma;ValuableClicksPerDMA = foreach ByDMA generate group, COUNT(UserGeo);store ValuableClicksPerDMA into 'ValuableClicksPerDMA';

Page 22: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

All three Extend basic data model with extended data types ◦ array-like [ event1, event2, event3] ◦ map-like { type1:value1, type2:value2, …}

Different approach◦ Resilient Schema ◦ Static Typing ◦ No Static Typing

Data type and ModelRationale

Page 23: Dataiku - Paris JUG 2013 - Hadoop is a batch

04/10/2023 24

HiveData Type and Schema

Simple type Details

TINYINT, SMALLINT, INT, BIGINT 1, 2, 4 and 8 bytes

FLOAT, DOUBLE 4 and 8 bytes

BOOLEAN

STRING Arbitrary-length, replaces VARCHAR

TIMESTAMP

Complex type Details

ARRAY Array of typed items (0-indexed)

MAP Associative map

STRUCT Complex class-like objects

Dataiku Training – Hadoop for Data Science

CREATE TABLE visit (user_name STRING,user_id INT,user_details STRUCT<age:INT, zipcode:INT>

);

Page 24: Dataiku - Paris JUG 2013 - Hadoop is a batch

04/10/2023 25

rel = LOAD '/folder/path/'USING PigStorage(‘\t’)AS (col:type, col:type, col:type);

Data types and SchemaPig

Simple type Details

int, long, float, double

32 and 64 bits, signed

chararray A string

bytearray An array of … bytes

boolean A boolean

Complex type Details

tuple a tuple is an ordered fieldname:value map

bag a bag is a set of tuples

Dataiku Training – Hadoop for Data Science

Page 25: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Support for Any Java Types, provided they can be serialized in Hadoop

No support for Typing

Data Type and Schema Cascading

Simple type Details

Int, Long, Float, Double

32 and 64 bits, signed

String A string

byte[] An array of … bytes

Boolean A boolean

Complex type Details

Object Object must be « Hadoop serializable »

Page 26: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Style Summary

Style Typing Data Model

Metadata store

Pig Procedural Static + Dynamic

scalar + tuple+ bag

(fully recursive)

No (HCatalog)

Hive Declarative Static + Dynamic,

enforced at execution

time

scalar+ list + map

Integrated

Cascading Procedural Weak scalar+ java objects

No

Page 27: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Philosophy◦ Procedural Vs Declarative◦ Data Model and Schema

Productivity◦ Headachability ◦ Checkpointing◦ Testing, error management and environment

Integration◦ Partitioning◦ Formats Integration◦ External Code Integration

Performance and optimization

Comparing without Comparable

Page 28: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Does debugging the tool lead to bad headaches ?

HeadachilityMotivation

Page 29: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Out Of Memory Error

(Reducer)

Exception in Building /

Extended Functions

(handling of null)

Null vs “”

Nested Foreach and scoping

Date Management (pig 0.10)

Field implicit ordering

HeadachesPig

Page 30: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

A Pig Error

Page 31: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Out of Memory Errors in

Reducers

Few Debugging Options

Null / “”

No builtin “first”

HeadachesHive

Page 32: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Weak Typing Errors

(comparing Int and String … )

Illegal Operation Sequence

(Group after group …)

Field Implicit Ordering

HeadachesCascading

Page 33: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

How to perform unit tests ? How to have different versions of the same script

(parameter) ?

TestingMotivation

Page 34: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

System Variables Comment to test No Meta Programming pig –x local to execute on local files

TestingPig

Page 35: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Junit Tests are possible Ability to use code to actually comment out some

variables

Testing / Environment Cascading

Page 36: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Lots of iteration while developing on Hadoop Sometime jobs fail Sometimes need to restart from the start …

Checkpointing Motivation

Page User Correlation OutputFilteringParse Logs Per Page StatsFA

IL

FIX and relaunch

Page 37: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

STORE Command to manually store files

PigManual Checkpointing

Page User Correlation OutputFilteringParse Logs Per Page Stats

// COMMENT Beginning of script and relaunch

Page 38: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Ability to re-run a flow automatically from the last saved checkpoint

Cascading Automated Checkpointing

addCheckpoint(…)

Page 39: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Check each file intermediate timestamp Execute only if more recent

Cascading Topological Scheduler

Page User Correlation OutputFilteringParse Logs Per Page Stats

Page 40: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Productivity Summary

Headaches Checkpointing/Replay

Testing / Metaprogrammation

Pig Lots Manual Save Difficult Meta programming, easy

local testing

Hive Few, but without

debugging options

None (That’s SQL) None (That’s SQL)

Cascading Weak TypingComplexity

Checkpointing Partial Updates

Possible

Page 41: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Philosophy◦ Procedural Vs Declarative◦ Data Model and Schema

Productivity◦ Headachability ◦ Checkpointing◦ Testing and environment

Integration◦ Formats Integration◦ Partitioning◦ External Code Integration

Performance and optimization

Comparing without Comparable

Page 42: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Ability to integrate different file formats◦ Text Delimited◦ Sequence File (Binary Hadoop format)◦ Avro, Thrift ..

Ability to integrate with external data sources or sink ( MongoDB, ElasticSearch, Database. …)

Formats IntegrationMotivation

Format Size on Disk (GB) HIVE Processing time (24 cores)

Text File, uncompressed 18.7 1m32s

1 Text File, Gzipped 3.89 6m23s (no parallelization)

JSON compressed 7.89 2m42s

multiple text file gzipped 4.02 43s

Sequence File, Block, Gzip 5.32 1m18s

Text File, LZO Indexed 7.03 1m22s

Format impact on size and performance

Page 43: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Hive: Serde (Serialize-Deserializer) Pig : Storage Cascading: Tap

Format Integration

Page 44: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

No support for “UPDATE” patterns, any increment is performed by adding or deleting a partition

Common partition schemas on Hadoop◦ By Date /apache_logs/dt=2013-01-23◦ By Data center /apache_logs/dc=redbus01/…◦ By Country◦ …◦ Or any combination of the above

PartitionsMotivation

Page 45: Dataiku - Paris JUG 2013 - Hadoop is a batch

04/10/2023 46

Hive PartitioningPartitioned tables

CREATE TABLE event (user_id INT,type STRING,message STRING)

PARTITIONED BY (day STRING, server_id STRING);

Disk structure

/hive/event/day=2013-01-27/server_id=s1/file0/hive/event/day=2013-01-27/server_id=s1/file1/hive/event/day=2013-01-27/server_id=s2/file0/hive/event/day=2013-01-27/server_id=s2/file1…/hive/event/day=2013-01-28/server_id=s2/file0/hive/event/day=2013-01-28/server_id=s2/file1

Dataiku Training – Hadoop for Data Science

INSERT OVERWRITE TABLE event PARTITION(ds='2013-01-27', server_id=‘s1’)SELECT * FROM event_tmp;

Page 46: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

No Direct support for partition Support for “Glob” Tap, to build read from files

using patterns

You can code your own custom or virtual partition schemes

Cascading Partition

Florian Douetteau
TODO
Page 47: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

External Code IntegrationSimple UDF

Pig Hive

Cascading

Page 48: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Hive Complex UDF(Aggregators)

Page 49: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Cascading Direct Code Evaluation

Uses Janino, a very cool project: http://docs.codehaus.org/display/JANINO

Page 50: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Allow to call a cascading flow from a Spring Batch

Spring Batch Cascading Integration

No full Integration with Spring MessageSource or MessageHandler yet (only for local flows)

Page 51: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

IntegrationSummary

Partition/Incremental Updates

External Code Format Integration

Pig No Direct Support

Simple Doable and rich community

Hive Fully integrated, SQL Like

Very simple, but complex dev setup

Doable and existing

community

Cascading With Coding Complex UDFS but regular, and Java Expression

embeddable

Doable and growing

commuinty

Florian Douetteau
Compare different version
Page 52: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Philosophy◦ Procedural Vs Declarative◦ Data Model and Schema

Productivity◦ Headachability ◦ Checkpointing◦ Testing and environment

Integration◦ Formats Integration◦ Partitioning◦ External Code Integration

Performance and optimization

Comparing without Comparable

Page 53: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Several Common Map Reduce Optimization Patterns◦ Combiners◦ MapJoin◦ Job Fusion◦ Job Parallelism◦ Reducer Parallelism

Different support per framework◦ Fully Automatic◦ Pragma / Directives / Options◦ Coding style / Code to write

Optimization

Page 54: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

SELECT date, COUNT(*) FROM product GROUP BY date

CombinerPerform Partial Aggregate at Mapper Stage

Map Reduce2012-02-14 4354

2012-02-15 21we2

2012-02-14 qa334

2012-02-15 23aq2

2012-02-14 20

2012-02-15 35

2012-02-16 1

2012-02-14 4354

2012-02-15 21we2

2012-02-14 qa334

2012-02-15 23aq2

Page 55: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

SELECT date, COUNT(*) FROM product GROUP BY date

CombinerPerform Partial Aggregate at Mapper Stage

Map Reduce2012-02-14 4354

2012-02-15 21we2

2012-02-14 qa334

2012-02-15 23aq22012-02-14 12

2012-02-15 23

2012-02-16 1

2012-02-14 8

2012-02-15 12

2012-02-14 20

2012-02-15 35

2012-02-16 1

Reduced network bandwith. Better parallelism

Page 56: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Join OptimizationMap Join

set hive.auto.convert.join = true;

Hive

Pig

Cascading

( no aggregation support after HashJoin)

Page 57: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Critical for performance

Estimated per the size of input file◦ Hive

divide size per hive.exec.reducers.bytes.per.reducer (default 1GB)◦ Pig

divide size pig.exec.reducers.bytes.per.reducer (default 1GB)

Number of Reducers

Page 58: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

CombinerOptimization

JoinOptimization

Number of reducers optimization

Pig Automatic Option Estimate or DIY

Cascading DIY HashJoin DIY

Hive PartialDIY

Automatic(Map Join)

Estimate or DIY

Performance & Optimization Summary

Page 59: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Hadoop and Context (->0:03) Pig, Hive, Cascading, … (->0:09) How they work (->0:15) Comparing the tools (->0:35) Make them work together (->0:40) Wrap’up and question (->Beer)

Agenda

Page 60: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Follow the Flow

Tracker Log

MongoDB

MySQL

MySQL

Syslog

Product Catalog

Order

Apache Logs

Session

Product Transformation

Category Affinity

Category Targeting

Customer Profile

Product Recommender

S3

Search Logs (External) Search Engine Optimization

(Internal) Search Ranking

MongoDB

MySQL

Partner FTP

Sync In Sync Out

Pig

Pig

Hive

Hive

ElasticSearch

Page 61: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

E.g. Product Recommender

Page Views

Orders

Catalog

Bots, Special Users

Filtered Page Views User Affinity

Product Popularity

User Similarity (Per Category)

Recommendation Graph

Recommendation

Order Summary

User Similarity (Per Brand)

Machine Learning

Page 62: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Schema Maintenance between tools

Proper incremental and efficient synchronization between tools and NoSQL Store and Logs Systems

Proper “management” partition (daily jobs, …)

Job Sequence and Management ◦ How to handle properly a new field ? a missing data ?

recompute everything ?

Pain PointsOn Large Projects

Page 63: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Hcatalog provides an interoberability between Hive and Pig in term of schema

Integration OptionHCatalog

Hive Pig

HCatalog

Page 64: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

1970 Shell script 1977 Makefile 1980 Makedeps 1999 Cons/CMake 2001 Maven 2004 Ivy 2008 Gradle

Shell Script 2008 HaMake 2009 Oozie … ETL Hadoop Next … ?

Similar to “Build”

Page 65: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Hadoop and Context (->0:03) Pig, Hive, Cascading, … (->0:09) How they work (->0:15) Comparing the tools (->0:35) Make them work together (->0:40) Wrap’up and question (->Beer)

Agenda

Page 66: Dataiku - Paris JUG 2013 - Hadoop is a batch

Dataiku - Pig, Hive and Cascading

Want to keep close to SQL ?◦ Hive

Want to write large flows ?◦ Pig

Want to integrate in large scale programming projects ◦ Cascading (cascalog / scalding)

Presentation Available On http://www.slideshare.net/Dataiku