jan 2013 hug: dist cpv2 for hug 20130116

DistCp (v.2) and the Dynamic InputFormatMithun RK

(mithunr@yahoo-inc.com)

20130116

Yahoo: HCatalog, Hive, GDM Firmware Engineer at Hewlett Packard Fluent Hindi Gold medal at the nationals, last year.

04/08/20232Yahoo! Presentation, Confidential

Prelude

“Legacy” DistCp

Inter-cluster file copy using Map/Reduce Command-line:

hadoop distcp –m 20 \

hftp://source_nn:50070/datasets/search/20120523/US \

hftp://source_nn:50070/datasets/search/20120523/UK \

hdfs://target_nn:8020/home/mithunr/target

Algo1. for (FileStatus f : FileSystem.globStatus(sourcePath)) { recurse(f); }

2. ~/_distCp_WIP_201301161600/file.list

3. InputSplit calculation:

1. Divide paths into ‘m’ groups (“splits”), one per mapper

2. Total file size in each split roughly equal to others.

4. Launch MR job

1. Each Map task copies files specified in its InputSplit

Source-path› http://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0/src/tools/org/apache/

hadoop/tools/DistCp.java

The (Unfortunate) Beginning

Data Management on Y! Clusters

Grid Data Management (GDM)› Life-cycle management for data on Hadoop Clusters› 1+ Petabytes per day.

Facets:1. Acquisition: Warehouse -> Cluster

2. Replication: Cluster -> Cluster

3. “Retention”: Eviction of old data

4. Workflow tracking, Metrics, Monitoring, User-dashboards, Configuration management, etc.

GDM Replication:1. Use DistCp!

2. Don’t re-implement MR-job for file-copy.

A marriage doomed to fail

Poor programmatic use:› DistCp.main(“-m”, “20”, “hftp://source_nn1:50070/source”,

“hdfs://target_nn1:8020/target”);› Blocking call› Equal-size Copy-distribution: Can’t be overridden.

Long Setup-times:› Optimization: file.list contains only files that are changed/absent on target› Compare checksums› E.g. Experiment with 200 GB dataset: 14 minutes setup time.

Atomic commit:› Example: Oozie workflow-launch on data availability› Premature consumption

› Workarounds: _DONE_ markers:• Name-node pressure

• Hacks to ignore at source

Others

04/08/20239

DistCp Redux(Available in Hadoop 0.23/2.0)

Changes

(Almost) Identical command-line: hadoop distcp –m 20 \

hftp://source_nn:50070/datasets/search/20120523/US/ \

hdfs://target_nn:8020/home/mithunr/target/

Reduced setup-times:› Postpone everything to MR job

› E.g. Experiment with 200 GB dataset:

• Old: 14 minutes setup time

• New: 7 seconds

Improved Copy-times:› Large dataset copy test: Time cut down from 17 hours to 7 hours.

Atomic commit:hadoop distcp –atomic –tmp /home/mithunr/tmp /source /target

Improved programmatic use:› options = new DistCpOptions(srcPaths, destPath).preserve(BLOCKSIZE).setBlocking(false);

› Job job = new DistCp(hadoopConf, options).execute();

Others› Bandwidth throttling, Asynchronous mode, Configurable copy-strategies.

Cost of copy

Copy-time is directly proportional to file-size› (All else being equal)

Long-tailed MR jobs› Copy twenty 2GB files between clusters. Why does one take longer than the rest?

› Hint: Sometimes, a file is slow initially, and then speeds up after a “block boundary”.

Are data-nodes equivalent?› Slower hard-drives

› Failing NICs

› Misconfiguration

Take a closer look at the command-line:hadoop distcp –m 20 \

hftp://source_nn:50070/datasets/search/20120523/US/ \

hdfs://target_nn:8020/home/mithunr/target/

Reads over hdfs://

Data Node

DFS Client

User Program

Data Node

Reads over hftp://

Data Node

DFS Client

User Program

Data Node

Long-tails

/datasets/search/US/20130101/part-00000.gz/datasets/search/US/20130101/part-00001.gz/datasets/search/US/20130101/part-00002.gz/datasets/search/US/20130101/part-00003.gz/datasets/search/US/20130101/part-00004.gz/datasets/search/US/20130101/part-00005.gz/datasets/search/US/20130101/part-00006.gz/datasets/search/US/20130101/part-00007.gz/datasets/search/US/20130101/part-00008.gz/datasets/search/US/20130101/part-00009.gz...

Input Split #1

STUCK!

SPLIT!

Mitigation

Break static binding between InputSplits and Mappers

E.g. Consider DistCp of N files with 10 mappers:

1. Don’t create 10 InputSplits. Create 20 instead.

2. Store each InputSplit as a separate file.

1. hdfs://home/mithunr/_distcp_20130116/work-pool/

3. Mapper consumes one InputSplit and checks for more.

4. Mappers quit when no more InputSplits are left.

Single file per InputSplit?› NameNode pressure.

DynamicInputFormat› Separate Library

Perf:› Worst-case is no worse than UniformSizeInputFormat

› Best-case: 17 hours -> 7 hours.

Future

Block-level parallelism› Stream blocks individually

› Stitch at the end: Metadata

Yarn› Master-worker paradigm

_DONE_

jan 2013 hug: dist cpv2 for hug 20130116

200 gb dataset

atomic commit

50070datasetssearch20120523uk

hadoop distcp

hftp

file

distcp

hdfs

Documents

samza la hug

circuit switcher type cpv2

houston hug 12.4.2014

london hug-samza

20130116 groningen seismicity report final

hug pharmacy & sterilisation

1 presentation company and products. 2 hug engineering ag...

director contact information parent expectations ·...

1 ee40 summer 2010 hug ee40 lecture 10 josh hug 7/17/2010

1 ee40 summer 2010 hug ee40 lecture 4 josh hug 6/28/2010

london hug

artistic research joa hug joa hug no solutions: the

ap hug- gender

king hug uk

2020년 hug 혁신계획

boston hug

hug the pug

a modern hug

annelise carleton hug

hcatalog hug draft5