distcp gobblin
TRANSCRIPT
Distcp-ng:Replicating massive datasets with Gobblin
Issac BuenrostroGobblin Meetup, Jun 2016
Outline
1 Motivation
2 Architecture
3 Features
4 Hive Copy
5 Future
Motivation
Distributed Copy Copy files between Hadoop compatible file systems.
What is Distcp?
Motivation
1.Continuous replication of datasets.
2.Efficient file listing.
• Reduce file system rpc calls• Alternate listing services first-class citizens.
3.Dataset awareness: prioritization, notification, etc.
4.Failure isolation.
5.Operational metrics, notifications, data availability triggers.
6.Portability.
Architecture
Distcp Gobblin Architecture
Copyable Dataset- Basic
Copy Entities - Advanced
Pre / Post publish stepsCopy Entities – Run before or after publish (NOOP in Task)
Features
Recursive Copy
Most similar to distcp
Copy all files under an input path
Accepts path filter
Features
Source Converter TargetHadoop File System
SFTP
Apache server filer
Hive
…
Byte-level stream transformations:
• Encrypt / Decrypt
• (Un) Gzip
• Untar
Atomic publishing
Data availability notification
Hive Registration
File deletion / sync
File Sets
Distcp atomic unit, single dataset can be split into multiple file sets
1. All-or-nothing publish*
2. Isolation: failed file set does not affect other file sets
3. Event emitted on publish per file and file set
* best-effort. Future: use write-ahead log for better guarantee.
Smart file limits
Limit the number of files copied in a single run
1. File sets are never split
2. Soft limit: stop processing new file sets, currently running file sets can finish
3. Hard limit: do not accept any more files
4. Prioritize file sets (Future)
Unpublished File Persistence
1. Files that were copied successfully but not published are persisted in private directory. (File set failure, permission failure, etc.)
2. Future run identifies persisted file, reuse instead of re-copying.
3. Time-based automatic retention on persist directory.
Hive Copy
Hive Copy
Copy Hive tables between Hive metastores
1. Determine files under each table / partition
2. Diff files in source / target
3. Copy necessary files
4. Register tables / partitions on target
5. Deregister partitions missing in source
6. (Optional) Delete files for deregistered partitions
Hive Copy Configuration
job.name=distcpNgExample
# Source and target metastoreshive.dataset.hive.metastore.uri=thrift://mysource.hive:9000hive.dataset.copy.target.metastore.uri=thrift://mytarget:9000
gobblin.copy.preserved.attributes=rgbp # Preserve attributes
# Database and tables copyhive.dataset.whitelist=events.loginEvent|logoutEvent,metrics
hive.dataset.copy.locations.listing.skipHiddenPaths=true # Skip hidden paths
# Use registration time to determine whether a partition should be skippedhive.dataset.copy.fast.partition.skip.predicate=gobblin.data.management.copy.predicates.RegistrationTimeSkipPredicate
# Partition filterhive.dataset.copy.partition.filter.generator=gobblin.data.management.copy.hive.filter.LookbackPartitionFilterGeneratorhive.dataset.partition.filter.datetime.column=datepartitionhive.dataset.partition.filter.datetime.lookback=P7Dhive.dataset.partition.filter.datetime.format=YYYY-MM-dd-HH
Hive Copy
Candidate FilesExisting files at expected target location.
• Different location• Schema incompatible• …
Hive Copy - Numbers
100+ tables3000+ partitions20,000+ new files per hour2TB+ new data per hour
File listing 30k files: < 30sCopy 30k files, 5TB: ~20 min
Current bottlenecks
Work unit serialization• ~100 work units / second
Bad nodes in Hadoop cluster• Need speculation
Serial publishing of file sets• Solution in progress
Gobblin Distcp vs ReAir
Reair: Hive warehouse data replication (Airbnb)Offers batch and incremental replication
Gobblin Distcp ReAirFile listing and modification times for incremental changes
MySQL and audit log hook store for incremental changes
Portable Gobblin job (MR, thread based, Helix)
MR job
Same framework can copy non-Hive data
Monitoring / Web UI (in progress for Gobblin)
Future
Distcp continuous service
Next Steps
1 Simple CLI launcher
2 Dataset / file set prioritization
3 Global network throttling
4 Large file splitting
5 Least-congested path optimization
Find out more:
©2015 LinkedIn Corporation. All Rights Reserved.
Gobblin Distcp