scaling etl with hadoop - avoiding failure
DESCRIPTION
TRANSCRIPT
1
Scaling ETL with HadoopGwen Shapira@[email protected]
Coming soon to a bookstore near you…
• Hadoop Application Architectures
How to build end-to-end solutions using Apache Hadoop and related tools
@hadooparchbookwww.hadooparchitecturebook.com
3
ETL is…
• Extracting data from outside sources• Transforming it to fit operational needs• Loading it into the end target
• (Wikipedia: http://en.wikipedia.org/wiki/Extract,_transform,_load)
4
Hadoop Is…
• HDFS – Massive, redundant data storage• MapReduce – Batch oriented data processing at scale• Many many ways to process data in parallel at scale
5
The Ecosystem
• High level languages and abstractions• File, relational and streaming data integration• Process Orchestration and Scheduling• Libraries for data wrangling• Low latency query language
6
Why ETL with Hadoop?
Data Has Changed in the Last 30 YearsDA
TA G
ROW
TH
END-USERAPPLICATIONS
THE INTERNET
MOBILE DEVICES
SOPHISTICATEDMACHINES
STRUCTURED DATA – 10%
1980 2013
UNSTRUCTURED DATA – 90%
Volume, Variety, Velocity Cause Problems
8
OLTP
EnterpriseApplications
Data Warehouse
QueryExtract
Transform
LoadBusiness
IntelligenceTransform
1
1
1
Slow data transformations. Missed SLAs.
2
2
Slow queries. Frustrated business and IT.
3 Must archive. Archived data can’t provide value.
9
Got unstructured data?
• Traditional ETL:• Text• CSV• XLS• XML
• Hadoop:• HTML• XML, RSS• JSON• Apache Logs• Avro, ProtoBuffs, ORC, Parquet• Compression• Office, OpenDocument, iWorks• PDF, Epup, RTF• Midi, MP3• JPEG, Tiff• Java Classes• Mbox, RFC822• Autocad• TrueType Parser• HFD / NetCDF
10
What is Apache Hadoop?
Has the Flexibility to Store and Mine Any Type of Data
Ask questions across structured and unstructured data that were previously impossible to ask or solve
Not bound by a single schema
Excels atProcessing Complex Data
Scale-out architecture divides workloads across multiple nodes
Flexible file system eliminates ETL bottlenecks
ScalesEconomically
Can be deployed on commodity hardware
Open source platform guards against vendor lock
Hadoop Distributed File System (HDFS)
Self-Healing, High Bandwidth Clustered
Storage
MapReduce
Distributed Computing Framework
Apache Hadoop is an open source platform for data storage and processing that is…
Distributed Fault tolerant Scalable
CORE HADOOP SYSTEM COMPONENTS
11
What I often see
ETL Cluster
ELT in DWH
ETL in Hadoop
12
Moving your transformations from the DWH to Hadoop?
Lets do it right.
13
Best Practices
Arup Nanda taught me to ask:1. Why is it better than the rest?2. What happens if it is not followed?3. When are they not applicable?
14
Or at leastLets avoid the worst mistakes
15
Extract
16
Let me count the ways
1. From Databases: Sqoop 2. Log Data: Flume 3. Copy data to HDFS
17
Data Loading Mistake #1
Hadoop is scalable.Lets run as many Sqoop mappers as possible, to get the data from our DB faster!
— Famous last words
18
Result:
19
Lesson:
• Start with 2 mappers, add slowly• Watch DB load and network utilization• Use FairScheduler to limit number of mappers
20
Data Loading Mistake #2
Database specific connectors are complicated and scary. Lets just use the default JDBC connector.
— Famous last words
21
Result:
22
Lesson:
1. There are connectors to:Oracle, Netezza and Teradata
2. Download them3. Read documentation4. Ask questions if not clear5. Follow installation instructions
6. Use Sqoop with connectors
23
Data Loading Mistake #3
Just copying files?This sounds too simple. We probably need some cool whizzbang tool.
— Famous last words
24
Result
25
Lessons:
• Copying files is a legitimate solution• In general, simple is good
26
Transform
27
Endless Possibilities
• Map Reduce• Crunch / Cascading • Spark• Hive (i.e. SQL)• Pig• R• Shell scripts• Plain old Java
28
Data Processing Mistake #0
29
Data Processing Mistake #1
This system must be ready in 12 month.We have to convert 100 data sources and 5000 transformations to Hadoop. Lets spend 2 days planning a schedule and budget for the entire year and then just go and implement it.
Prototype? Who needs that?
— Famous last words
30
Result
31
Lessons
• Take learning curve into account• You don’t know what you don’t know• Hadoop will be difficult and frustrating
for at least 3 month.
32
Data Processing Mistake #2
Hadoop is all about MapReduce. So I’ll use MapReduce for all my data processing needs.
— Famous last words
33
Result:
34
Lessons:
MapReduce is the assembly language of Hadoop:
Simple things are hard.Hard things are possible.
35
Data Processing Mistake #3
I got 5000 tiny XMLs, and Hadoop is great at processing unstructured data. So I’ll just leave the data like that and parse the XML in every job.
— Famous last words
36
Result
37
Lessons
1. Consolidate small files2. Don’t argue about #13. Convert files to easy-to-query formats4. De-normalize
38
Data Processing Mistake #4
Partitions are for relational databases
— Famous last words
39
Result
40
Lessons
1. Without partitions every query is a full table scan2. Yes, Hadoop scans fast. 3. But faster read is the one you don’t perform4. Cheap storage allows you to store same dataset,
partitioned multiple ways.5. Use partitions for fast data loading
41
Load
42
Technologies
• Sqoop• Fuse-DFS• Oracle Connectors• Just copy files• Query Hadoop
43
Data Loading Mistake #1
All of the data must end up in a relational DWH.
— Famous last words
44
Result
45
Lessons:
• Use Relational:• To maintain tool compatibility• DWH enrichment
• Stay in Hadoop for:• Text search• Graph analysis• Reduce time in pipeline• Big data & small network• Congested database
46
Data Loading Mistake #2
We used Sqoop to get data out of Oracle. Lets use Sqoop to get it back in.
— Famous last words
47
Result
48
Lesson
Use Oracle direct connectors if you can afford them.
They are:1. Faster than any alternative2. Use Hadoop to make Oracle more efficient3. Make *you* more efficient
49
Workflow Management
50
Tools
• Oozie• Pentaho, Talend, ActiveBatch, AutoSys, Informatica,
UC4, Cron
51
Workflow Mistake #1
Workflow management is easy. I’ll just write few scripts.
— Famous last words
52
Writing a workflow engine isthe software engineering equivalent of getting involved in a land war in Asia.
“ ”— Josh Wills
53
Lesson:
Workflow management tool should enable:• Keeping track of metadata, components and
integrations• Scheduling and Orchestration• Restarts and retries• Cohesive System View• Instrumentation, Measurement and Monitoring• Reporting
54
Workflow Mistake #2
Schema? This is Hadoop. Why would we need a schema?
— Famous last words
55
Result
56
Lesson
/user/…/user/gshapira/testdata/orders
/data/<database>/<table>/<partition>/data/<biz unit>/<app>/<dataset>/partition/data/pharmacy/fraud/orders/date=20131101
/etl/<biz unit>/<app>/<dataset>/<stage>/etl/pharmacy/fraud/orders/validated
57
Workflow Mistake #3
Oozie was written for Hadoop, so the right solution will always use Oozie
— Famous last words
58
Result
59
Lessons:
• Oozie has advantages• Use the tool that works for you
60
Hue + Oozie
61
I hope that in this year to come, you make mistakes.“
”— Neil Gaiman
62
Always Make New Mistakes“ ”— Esther Dyson
63
64
Should DBAs learn Hadoop?
• Hadoop projects are more visible• 48% of Hadoop clusters are owned by DWH team• Big Data == Business pays attention to data• New skills – from coding to cluster administration• Interesting projects
• No, you don’t need to learn Java
65
Beginner Projects
• Take a class• Download a VM• Install 5 node Hadoop cluster in AWS• Load data:
• Complete works of Shakespeare• Movielens database
• Find the 10 most common words in Shakespeare• Find the 10 most recommended movies• Run TPC-H• Cloudera Data Science Challenge• Actual use-case:
XML ingestion, ETL process, DWH history
66
Books
67
More Books