scaling etl with hadoop - avoiding failure

1

Scaling ETL with HadoopGwen Shapira@[email protected]

Coming soon to a bookstore near you…

• Hadoop Application Architectures

How to build end-to-end solutions using Apache Hadoop and related tools

@hadooparchbookwww.hadooparchitecturebook.com

https://twitter.com/hadooparchbook

https://twitter.com/hadooparchbook

3

ETL is…

• Extracting data from outside sources• Transforming it to fit operational needs• Loading it into the end target

• (Wikipedia: http://en.wikipedia.org/wiki/Extract,_transform,_load)

4

Hadoop Is…

• HDFS – Massive, redundant data storage• MapReduce – Batch oriented data processing at scale• Many many ways to process data in parallel at scale

5

The Ecosystem

• High level languages and abstractions• File, relational and streaming data integration• Process Orchestration and Scheduling• Libraries for data wrangling• Low latency query language

6

Why ETL with Hadoop?

Data Has Changed in the Last 30 YearsDA

TA G

ROW

TH

END-USERAPPLICATIONS

THE INTERNET

MOBILE DEVICES

SOPHISTICATEDMACHINES

STRUCTURED DATA – 10%

1980 2013

UNSTRUCTURED DATA – 90%

Volume, Variety, Velocity Cause Problems

8

OLTP

EnterpriseApplications

Data Warehouse

QueryExtract

Transform

LoadBusiness

IntelligenceTransform

1

1

1

Slow data transformations. Missed SLAs.

2

2

Slow queries. Frustrated business and IT.

3 Must archive. Archived data can’t provide value.

9

Got unstructured data?

• Traditional ETL:• Text• CSV• XLS• XML

• Hadoop:• HTML• XML, RSS• JSON• Apache Logs• Avro, ProtoBuffs, ORC, Parquet• Compression• Office, OpenDocument, iWorks• PDF, Epup, RTF• Midi, MP3• JPEG, Tiff• Java Classes• Mbox, RFC822• Autocad• TrueType Parser• HFD / NetCDF

10

What is Apache Hadoop?

Has the Flexibility to Store and Mine Any Type of Data

Ask questions across structured and unstructured data that were previously impossible to ask or solve

Not bound by a single schema

Excels atProcessing Complex Data

Scale-out architecture divides workloads across multiple nodes

Flexible file system eliminates ETL bottlenecks

ScalesEconomically

Can be deployed on commodity hardware

Open source platform guards against vendor lock

Hadoop Distributed File System (HDFS)

Self-Healing, High Bandwidth Clustered

Storage

MapReduce

Distributed Computing Framework

Apache Hadoop is an open source platform for data storage and processing that is…

Distributed Fault tolerant Scalable

CORE HADOOP SYSTEM COMPONENTS

11

What I often see

ETL Cluster

ELT in DWH

ETL in Hadoop

12

Moving your transformations from the DWH to Hadoop?

Lets do it right.

13

Best Practices

Arup Nanda taught me to ask:1. Why is it better than the rest?2. What happens if it is not followed?3. When are they not applicable?

14

Or at leastLets avoid the worst mistakes

15

Extract

16

Let me count the ways

1. From Databases: Sqoop 2. Log Data: Flume 3. Copy data to HDFS

17

Data Loading Mistake #1

Hadoop is scalable.Lets run as many Sqoop mappers as possible, to get the data from our DB faster!

— Famous last words

18

Result:

19

Lesson:

• Start with 2 mappers, add slowly• Watch DB load and network utilization• Use FairScheduler to limit number of mappers

20


Database specific connectors are complicated and scary. Lets just use the default JDBC connector.


21

Result:

22

Lesson:

1. There are connectors to:Oracle, Netezza and Teradata

2. Download them3. Read documentation4. Ask questions if not clear5. Follow installation instructions

6. Use Sqoop with connectors

23


Just copying files?This sounds too simple. We probably need some cool whizzbang tool.


24

Result

25

Lessons:

• Copying files is a legitimate solution• In general, simple is good

26

Transform

27

Endless Possibilities

• Map Reduce• Crunch / Cascading • Spark• Hive (i.e. SQL)• Pig• R• Shell scripts• Plain old Java

28

Data Processing Mistake #0

29


This system must be ready in 12 month.We have to convert 100 data sources and 5000 transformations to Hadoop. Lets spend 2 days planning a schedule and budget for the entire year and then just go and implement it.

Prototype? Who needs that?


30

Result

31

Lessons

• Take learning curve into account• You don’t know what you don’t know• Hadoop will be difficult and frustrating

for at least 3 month.

32


Hadoop is all about MapReduce. So I’ll use MapReduce for all my data processing needs.


33

Result:

34

Lessons:

MapReduce is the assembly language of Hadoop:

Simple things are hard.Hard things are possible.

35


I got 5000 tiny XMLs, and Hadoop is great at processing unstructured data. So I’ll just leave the data like that and parse the XML in every job.


36

Result

37

Lessons

1. Consolidate small files2. Don’t argue about #13. Convert files to easy-to-query formats4. De-normalize

38


Partitions are for relational databases


39

Result

40

Lessons

1. Without partitions every query is a full table scan2. Yes, Hadoop scans fast. 3. But faster read is the one you don’t perform4. Cheap storage allows you to store same dataset,

partitioned multiple ways.5. Use partitions for fast data loading

41

Load

42

Technologies

• Sqoop• Fuse-DFS• Oracle Connectors• Just copy files• Query Hadoop

43


All of the data must end up in a relational DWH.


44

Result

45

Lessons:

• Use Relational:• To maintain tool compatibility• DWH enrichment

• Stay in Hadoop for:• Text search• Graph analysis• Reduce time in pipeline• Big data & small network• Congested database

46


We used Sqoop to get data out of Oracle. Lets use Sqoop to get it back in.


47

Result

48

Lesson

Use Oracle direct connectors if you can afford them.

They are:1. Faster than any alternative2. Use Hadoop to make Oracle more efficient3. Make *you* more efficient

49

Workflow Management

50

Tools

• Oozie• Pentaho, Talend, ActiveBatch, AutoSys, Informatica,

UC4, Cron

51

Workflow Mistake #1

Workflow management is easy. I’ll just write few scripts.


52

Writing a workflow engine isthe software engineering equivalent of getting involved in a land war in Asia.

“ ”— Josh Wills

53

Lesson:

Workflow management tool should enable:• Keeping track of metadata, components and

integrations• Scheduling and Orchestration• Restarts and retries• Cohesive System View• Instrumentation, Measurement and Monitoring• Reporting

54

Workflow Mistake #2

Schema? This is Hadoop. Why would we need a schema?


55

Result

56

Lesson

/user/…/user/gshapira/testdata/orders

/data/<database>/<table>/<partition>/data/<biz unit>/<app>/<dataset>/partition/data/pharmacy/fraud/orders/date=20131101

/etl/<biz unit>/<app>/<dataset>/<stage>/etl/pharmacy/fraud/orders/validated

57

Workflow Mistake #3

Oozie was written for Hadoop, so the right solution will always use Oozie


58

Result

59

Lessons:

• Oozie has advantages• Use the tool that works for you

60

Hue + Oozie

61

I hope that in this year to come, you make mistakes.“

”— Neil Gaiman

62

Always Make New Mistakes“ ”— Esther Dyson

64

Should DBAs learn Hadoop?

• Hadoop projects are more visible• 48% of Hadoop clusters are owned by DWH team• Big Data == Business pays attention to data• New skills – from coding to cluster administration• Interesting projects

• No, you don’t need to learn Java

65

Beginner Projects

• Take a class• Download a VM• Install 5 node Hadoop cluster in AWS• Load data:

• Complete works of Shakespeare• Movielens database

• Find the 10 most common words in Shakespeare• Find the 10 most recommended movies• Run TPC-H• Cloudera Data Science Challenge• Actual use-case:

XML ingestion, ETL process, DWH history

66

Books

67

More Books

scaling etl with hadoop - avoiding failure

Software

unstructured

copying files

words result

apache hadoop

oracle

etl

relational

data