data management on hadoop at yahoo!

14
Data Management on Hadoop @ Y! Seetharam Venkatesh ([email protected]) Hadoop Data Infrastructure Lead/Architect

Upload: venkatesh-seetharam

Post on 24-May-2015

388 views

Category:

Technology


0 download

DESCRIPTION

Data Management is a suite of services that manage the lifecycle of data on the Hadoop clusters.

TRANSCRIPT

Page 1: Data Management on Hadoop at Yahoo!

Data Management on Hadoop @ Y!Seetharam Venkatesh ([email protected])Hadoop Data Infrastructure Lead/Architect

Page 2: Data Management on Hadoop at Yahoo!

Agenda

2 Challenging Data Landscape

1 Introduction

3 The Solution

4 Future opportunities

Page 3: Data Management on Hadoop at Yahoo!

Introduction

Replication

Data Export

Anonymization

Retention

Archival

Acquisition

Page 4: Data Management on Hadoop at Yahoo!

Why is Data Management Critical

Page 5: Data Management on Hadoop at Yahoo!

Challenging Data LandscapeData Warehouse

Database

NAS

Data Warehouse

Database

NAS

Data center

Hadoop Clusters

Data center

Tape

Steady growth in data movement volumes/day

SLA requirements (Minutes to day)

BCP requirements (Hot-Hot, Hot-Warm)

Feeds with varying periodicity (Minutes to Monthly)

Page 6: Data Management on Hadoop at Yahoo!

Data Acquisition

Challenge SolutionSteady growth in data volumes Heavy lifting delegated to map-

only jobs

Diverse Interfaces, API contracts Pluggable interfaces, Adaptors for specific API

Data Sources have diverse serving capacity

Throttling enables uniform load on the Data Source

Data Source Isolation Asynchronous Scheduling, progress monitored per source

Varying Data Formats, file sizes and long tails, failures

Conversion as map-reduce job Coalesce, Chunking, Checkpoint

Data QualityBCP

Pluggable validationsSupports Hot-Hot, Hot-Warm

Page 7: Data Management on Hadoop at Yahoo!

Data Replication

Challenge SolutionSteady growth in data volumes

Heavy lifting delegated to map-only jobs (DistCp v2)

Cluster proximity, availability Tree of copies with at most one cross-datacenter copy

Long tails Dynamic split assignment, each map picks up only one file at a time (DistCp v2)

Data Export Export as Replication target - Push Adhoc uses HDFS Proxy – Pull

Datacenter Datacenter

Page 8: Data Management on Hadoop at Yahoo!

Data Lifecycle Management

Challenge SolutionAging Data expires Retention to remove old data (as

required for legal compliance and for capacity purposes)

Data Privacy Anonymization of Personally Identifiable information

SOX Compliance & Audit Archival/Restoration to/from Tape (13 months)

SEC Compliance & Audit Archival/Restoration to/from Tape (7 years)

Page 9: Data Management on Hadoop at Yahoo!

Operability, Manageability

Challenge SolutionMonitor and administer data loading across clusters, colos

Central dashboard for monitoring and administration

Integrated view of jobs running across clusters, colos

Interoperability across incompatible Hadoop versions

Support various Hadoop versions using Reverse Class loader

One data loading instance per colo that can work across clusters

Maintenance Windows, failuresSystem shutdown

Partial copy + auto resumeAutomatic resume upon restart

SLA management + introspection via metrics

Page 10: Data Management on Hadoop at Yahoo!

Architecture

Page 11: Data Management on Hadoop at Yahoo!

Highlights

• “Workflows” abstraction over MR Jobs• More workflows than Oozie with in Y!• Amounts to >30% of jobs launched on the clusters • Occupies less than 10% of cluster capacity (slots)

• Solves recurring batch data transfers • 2300+ feeds with varying periodicity (5m to Monthly)• 100+ TB/day of data movement

• SLAs• Central Dashboard• SLA monitoring with ETA on feeds

Page 12: Data Management on Hadoop at Yahoo!

Highlights

Page 13: Data Management on Hadoop at Yahoo!

Future

Hcatalog/Oozie Integration

Self service

Support for Event-level data

Storage Efficiency

1

2

3

4

Page 14: Data Management on Hadoop at Yahoo!

Data Management on Hadoop @ Y!

Thank YouQ&A