jan 2013 hug: dist cpv2 for hug 20130116

Post on 07-Nov-2014

1.054 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

DistCp (v.2) and the Dynamic InputFormatMithun RK

(mithunr@yahoo-inc.com)

20130116

Me

Yahoo: HCatalog, Hive, GDM Firmware Engineer at Hewlett Packard Fluent Hindi Gold medal at the nationals, last year.

04/08/20232Yahoo! Presentation, Confidential

Prelude

“Legacy” DistCp

Inter-cluster file copy using Map/Reduce Command-line:

hadoop distcp –m 20 \

hftp://source_nn:50070/datasets/search/20120523/US \

hftp://source_nn:50070/datasets/search/20120523/UK \

hdfs://target_nn:8020/home/mithunr/target

Algo1. for (FileStatus f : FileSystem.globStatus(sourcePath)) { recurse(f); }

2. ~/_distCp_WIP_201301161600/file.list

3. InputSplit calculation:

1. Divide paths into ‘m’ groups (“splits”), one per mapper

2. Total file size in each split roughly equal to others.

4. Launch MR job

1. Each Map task copies files specified in its InputSplit

Source-path› http://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0/src/tools/org/apache/

hadoop/tools/DistCp.java

04/08/20234Yahoo! Presentation, Confidential

The (Unfortunate) Beginning

Data Management on Y! Clusters

Grid Data Management (GDM)› Life-cycle management for data on Hadoop Clusters› 1+ Petabytes per day.

Facets:1. Acquisition: Warehouse -> Cluster

2. Replication: Cluster -> Cluster

3. “Retention”: Eviction of old data

4. Workflow tracking, Metrics, Monitoring, User-dashboards, Configuration management, etc.

GDM Replication:1. Use DistCp!

2. Don’t re-implement MR-job for file-copy.

04/08/20236Yahoo! Presentation, Confidential

A marriage doomed to fail

Poor programmatic use:› DistCp.main(“-m”, “20”, “hftp://source_nn1:50070/source”,

“hdfs://target_nn1:8020/target”);› Blocking call› Equal-size Copy-distribution: Can’t be overridden.

Long Setup-times:› Optimization: file.list contains only files that are changed/absent on target› Compare checksums› E.g. Experiment with 200 GB dataset: 14 minutes setup time.

Atomic commit:› Example: Oozie workflow-launch on data availability› Premature consumption

› Workarounds: _DONE_ markers:• Name-node pressure

• Hacks to ignore at source

Others

04/08/20237Yahoo! Presentation, Confidential

04/08/20238Yahoo! Presentation, Confidential

04/08/20239

DistCp Redux(Available in Hadoop 0.23/2.0)

Changes

(Almost) Identical command-line: hadoop distcp –m 20 \

hftp://source_nn:50070/datasets/search/20120523/US/ \

hftp://source_nn:50070/datasets/search/20120523/UK \

hdfs://target_nn:8020/home/mithunr/target/

Reduced setup-times:› Postpone everything to MR job

› E.g. Experiment with 200 GB dataset:

• Old: 14 minutes setup time

• New: 7 seconds

Improved Copy-times:› Large dataset copy test: Time cut down from 17 hours to 7 hours.

Atomic commit:hadoop distcp –atomic –tmp /home/mithunr/tmp /source /target

Improved programmatic use:› options = new DistCpOptions(srcPaths, destPath).preserve(BLOCKSIZE).setBlocking(false);

› Job job = new DistCp(hadoopConf, options).execute();

Others› Bandwidth throttling, Asynchronous mode, Configurable copy-strategies.

04/08/202311Yahoo! Presentation, Confidential

Cost of copy

Copy-time is directly proportional to file-size› (All else being equal)

Long-tailed MR jobs› Copy twenty 2GB files between clusters. Why does one take longer than the rest?

› Hint: Sometimes, a file is slow initially, and then speeds up after a “block boundary”.

Are data-nodes equivalent?› Slower hard-drives

› Failing NICs

› Misconfiguration

Take a closer look at the command-line:hadoop distcp –m 20 \

hftp://source_nn:50070/datasets/search/20120523/US/ \

hftp://source_nn:50070/datasets/search/20120523/UK \

hdfs://target_nn:8020/home/mithunr/target/

04/08/202313Yahoo! Presentation, Confidential

Reads over hdfs://

04/08/202314Yahoo! Presentation, Confidential

Data Node

Data Node

Data Node

Data Node

DFS Client

User Program

Data Node

Reads over hftp://

04/08/202315Yahoo! Presentation, Confidential

Data Node

Data Node

Data Node

Data Node

DFS Client

User Program

Data Node

Long-tails

04/08/202316Yahoo! Presentation, Confidential

/datasets/search/US/20130101/part-00000.gz/datasets/search/US/20130101/part-00001.gz/datasets/search/US/20130101/part-00002.gz/datasets/search/US/20130101/part-00003.gz/datasets/search/US/20130101/part-00004.gz/datasets/search/US/20130101/part-00005.gz/datasets/search/US/20130101/part-00006.gz/datasets/search/US/20130101/part-00007.gz/datasets/search/US/20130101/part-00008.gz/datasets/search/US/20130101/part-00009.gz...

Input Split #1

STUCK!

SPLIT!

Mitigation

Break static binding between InputSplits and Mappers

E.g. Consider DistCp of N files with 10 mappers:

1. Don’t create 10 InputSplits. Create 20 instead.

2. Store each InputSplit as a separate file.

1. hdfs://home/mithunr/_distcp_20130116/work-pool/

3. Mapper consumes one InputSplit and checks for more.

4. Mappers quit when no more InputSplits are left.

Single file per InputSplit?› NameNode pressure.

DynamicInputFormat› Separate Library

Perf:› Worst-case is no worse than UniformSizeInputFormat

› Best-case: 17 hours -> 7 hours.

04/08/202317Yahoo! Presentation, Confidential

Future

Block-level parallelism› Stream blocks individually

› Stitch at the end: Metadata

Yarn› Master-worker paradigm

04/08/202318Yahoo! Presentation, Confidential

_DONE_

top related