distcp gobblin

26
Distcp-ng: Replicating massive datasets with Gobblin Issac Buenrostro Gobblin Meetup, Jun 2016

Upload: vasanth-rajamani

Post on 09-Jan-2017

115 views

Category:

Software


4 download

TRANSCRIPT

Page 1: Distcp gobblin

Distcp-ng:Replicating massive datasets with Gobblin

Issac BuenrostroGobblin Meetup, Jun 2016

Page 2: Distcp gobblin

Outline

1 Motivation

2 Architecture

3 Features

4 Hive Copy

5 Future

Page 3: Distcp gobblin

Motivation

Page 4: Distcp gobblin

 Distributed Copy Copy files between Hadoop compatible file systems.

What is Distcp?

Page 5: Distcp gobblin

Motivation

1.Continuous replication of datasets.

2.Efficient file listing.

• Reduce file system rpc calls• Alternate listing services first-class citizens.

3.Dataset awareness: prioritization, notification, etc.

4.Failure isolation.

5.Operational metrics, notifications, data availability triggers.

6.Portability.

Page 6: Distcp gobblin

Architecture

Page 7: Distcp gobblin

Distcp Gobblin Architecture

Page 8: Distcp gobblin

Copyable Dataset- Basic

Page 9: Distcp gobblin

Copy Entities - Advanced

Pre / Post publish stepsCopy Entities – Run before or after publish (NOOP in Task)

Page 10: Distcp gobblin

Features

Page 11: Distcp gobblin

Recursive Copy

Most similar to distcp

Copy all files under an input path

Accepts path filter

Page 12: Distcp gobblin

Features

Source Converter TargetHadoop File System

SFTP

Apache server filer

Hive

Byte-level stream transformations:

• Encrypt / Decrypt

• (Un) Gzip

• Untar

Atomic publishing

Data availability notification

Hive Registration

File deletion / sync

Page 13: Distcp gobblin

File Sets

Distcp atomic unit, single dataset can be split into multiple file sets

1. All-or-nothing publish*

2. Isolation: failed file set does not affect other file sets

3. Event emitted on publish per file and file set

* best-effort. Future: use write-ahead log for better guarantee.

Page 14: Distcp gobblin

Smart file limits

Limit the number of files copied in a single run

1. File sets are never split

2. Soft limit: stop processing new file sets, currently running file sets can finish

3. Hard limit: do not accept any more files

4. Prioritize file sets (Future)

Page 15: Distcp gobblin

Unpublished File Persistence

1. Files that were copied successfully but not published are persisted in private directory. (File set failure, permission failure, etc.)

2. Future run identifies persisted file, reuse instead of re-copying.

3. Time-based automatic retention on persist directory.

Page 16: Distcp gobblin

Hive Copy

Page 17: Distcp gobblin

Hive Copy

Copy Hive tables between Hive metastores

1. Determine files under each table / partition

2. Diff files in source / target

3. Copy necessary files

4. Register tables / partitions on target

5. Deregister partitions missing in source

6. (Optional) Delete files for deregistered partitions

Page 18: Distcp gobblin

Hive Copy Configuration

job.name=distcpNgExample

# Source and target metastoreshive.dataset.hive.metastore.uri=thrift://mysource.hive:9000hive.dataset.copy.target.metastore.uri=thrift://mytarget:9000

gobblin.copy.preserved.attributes=rgbp # Preserve attributes

# Database and tables copyhive.dataset.whitelist=events.loginEvent|logoutEvent,metrics

hive.dataset.copy.locations.listing.skipHiddenPaths=true # Skip hidden paths

# Use registration time to determine whether a partition should be skippedhive.dataset.copy.fast.partition.skip.predicate=gobblin.data.management.copy.predicates.RegistrationTimeSkipPredicate

# Partition filterhive.dataset.copy.partition.filter.generator=gobblin.data.management.copy.hive.filter.LookbackPartitionFilterGeneratorhive.dataset.partition.filter.datetime.column=datepartitionhive.dataset.partition.filter.datetime.lookback=P7Dhive.dataset.partition.filter.datetime.format=YYYY-MM-dd-HH

Page 19: Distcp gobblin

Hive Copy

Candidate FilesExisting files at expected target location.

• Different location• Schema incompatible• …

Page 20: Distcp gobblin

Hive Copy - Numbers

100+ tables3000+ partitions20,000+ new files per hour2TB+ new data per hour

File listing 30k files: < 30sCopy 30k files, 5TB: ~20 min

Page 21: Distcp gobblin

Current bottlenecks

Work unit serialization• ~100 work units / second

Bad nodes in Hadoop cluster• Need speculation

Serial publishing of file sets• Solution in progress

Page 22: Distcp gobblin

Gobblin Distcp vs ReAir

Reair: Hive warehouse data replication (Airbnb)Offers batch and incremental replication

Gobblin Distcp ReAirFile listing and modification times for incremental changes

MySQL and audit log hook store for incremental changes

Portable Gobblin job (MR, thread based, Helix)

MR job

Same framework can copy non-Hive data

Monitoring / Web UI (in progress for Gobblin)

Page 23: Distcp gobblin

Future

Page 24: Distcp gobblin

Distcp continuous service

Page 25: Distcp gobblin

Next Steps

1 Simple CLI launcher

2 Dataset / file set prioritization

3 Global network throttling

4 Large file splitting

5 Least-congested path optimization

Page 26: Distcp gobblin

Find out more:

©2015 LinkedIn Corporation. All Rights Reserved.

Gobblin Distcp