i-files: handling intermediate data in parallel dataflow graphs

I-Files: Handling Intermediate Data In Parallel Dataflow Graphs

Sriram Rao

November 2, 2011

Joint Work With…

Raghu Ramakrishnan, Adam Silberstein: Yahoo Labs Mike Ovsiannikov, Damian Reeves: Quantcast

Motivation

Massive growth in online advertising (read…display ads) Companies are reacting to this opportunity via behavioral

ad-targeting› Collect click-stream logs, mine the data, build models, show ads

“Petabyte scale data mining” using computational frameworks (such as, Hadoop, Dryad) is commonplace

Analysis of Hadoop job history logs shows:› Over 95% of jobs are small (run for a few mins, process small data)› About 5% of jobs are large (run for hours, process big data)

Where have my cycles gone?

5% of jobs take 90% of cycles!

Who is using my network?

5% of jobs account for 99% of network traffic!

So…

Analysis shows 5% of the jobs are “big”:› 5% of jobs use 90% cluster compute cycles› 5% of jobs shuffle 99% of data (i.e., 99% network bandwidth)

To improve cluster performance, improve M/R performance for large jobs

Faster, faster, faster: virtuous cycle› Cluster throughput goes up› Users will run bigger jobs

Our work: Focus on handling intermediate data at scale in parallel dataflow graphs

Handling Intermediate Data in M/R

In a M/R computation, map output is intermediate data For transferring intermediate data from map to reduce:

› Map tasks generate data, write to disk› When a reduce task pulls map output,

• Data has to be read from disk

• Transferred over the network– Cannot assume that mappers/reducers can be scheduled concurrently

Transporting intermediate data:› Intermediate data size < RAM size: RAM masks disk I/O› Intermediate data size > RAM size: Cache hit rate masks disk I/O› Intermediate data size >> RAM size: Disk overheads affect perf

Intermediate Data Transfer: Distributed Merge Sort

› # of disk seeks for transferring intermediate data α M * R

› Avg. amount of data reducer pulls from a mapper α 1 / R

Handling Intermediate data at scale

Reduce

Map

Distributed File System

(M * R) Disk

Seeks

Disk Overheads (More detail) “Fix” the amount of data generated by a map task

• Size RAM such that the map output fits in-memory and can be sorted in 1-pass

– For example, use 1GB

“Fix” the amount of data consumed by a reduce task

• Size RAM for a 1-pass merge

– For example, use 1GB

Now…

• For a job with 1TB of data 1024 mappers generate 1G each; 1024 reduces consume 1G each

– On average, data generated by a map for a given reducer = 1G / 1024 = 1M

• For a job with 16TB of data 16K mappers generate 1G each; 16K reduces consume 1G each

– On average, data generated by a map for a given reducer = 1G / 16K = 64K

With scale, # of seeks increases; data read/seek decreases

As the volume of intermediate data scales,

› Amount of data read per seek decreases

› # of disk seeks increases non-linearly

Net result: Job performance will be affected by the disk overheads in handling intermediate data

› Intermediate data increases by 2

› Job-run time increases by 2.5x

Disk Overheads

What is new?

I-Files

Map

Distributed File System

Reduce

Network-wide Merge

Fewer Seeks!

One intermediate file per reducer, instead of one per mapper

Our work

New approach for efficient handling of intermediate data at large scale

• Minimize the number of seeks

• Maximize the amount of data read/written per seek

• Primarily geared towards LARGE M/R jobs: – 10’s of TB of intermediate data

– 1000’s of mapper/reducer tasks

I-files: Filesystem support for intermediate data› Atomic record append primitive that allows write parallelism at scale

› Network-wide batching of intermediate data

Build Sailfish (by modifying Hadoop-0.20.2) where intermediate data is transported using I-files

How did we do? (Benchmark job)

How did we do? (Actual Job)

Talk Outline

Properties of Intermediate data

I-files implementation

Sailfish: M/R implementation that uses I-files for intermediate data

Experimental Evaluation

Summary

Organizing Intermediate Data

Hadoop organizes intermediate data in a format convenient for the mapper

What if we went the opposite way: organize it in a format convenient for the reducer?› Mappers write their output to a per-partition I-file› Data destined for a reducer is in a single file› Build the intermediate data file in a manner that is suitable for the

reader rather than the writer

Reducer input is generated by multiple mappers

File is a container into which mapper output needs to be stored

› Write order is k1, k2, k3, k4

› Processing order is k3, k1, k4, k2

Because reducer imposes processing order, writer does not care where the output is stored in the file

Once a mapper emits a record, the output is committed

› There is no “withdraw”

Intermediate data

M

M

M

M

RFile

k3k1k4k2

k2

k1

k3

k4

Properties of Intermediate data file

Multiple mappers generate data that will be consumed by a single reducer› Need low latency multi-writer support

Writers are doing append-only writes› Contents of the I-file are never overwritten

Arbitrary interleaving of data is ok:› Writer does not care where the data goes in the file › Any ordering we need can be done post-facto

No ordering guarantees for the writes from a single client• Follows from arbitrary interleaving of writes

Atomic Record Append

Multi-writer support => need an atomic primitive› Intermediate data is append only…so, need atomic append

With atomic record append primitive clients provide just the data but the server chooses the offset with arbitrary interleaving› In contrast, in a traditional write clients provide data+offset

Since server is choosing the offset, design is lock-free To scale atomic record append with writers, allow

› Multiple writers append to a single block of the file› Multiple blocks of the file concurrently appended to


Client1

Server

Client2

ARA: <A, offset = -1>

ARA: <B, offset = -1>

B 300

A 350

C 400

D 500

Offset = 300

Implementing I-files

Have implemented I-files in context of Kosmos distributed filesystem (KFS)› Why KFS?

• KFS has multi-writer support

• We have designed/implemented/deployed KFS to manage PB’s of storage

KFS is similar to GFS/HDFS› Chunks are striped across nodes and replicated for fault-tolerance

– Chunk master serializes all writes to a chunk

› For atomic append, chunk master assigns the offset› With KFS I-files, multiple chunks of the I-file can be concurrently

modified


Writers are routed to a chunk that is open for writing› For scale, limit the # of concurrent writers to a chunk

When client gets an ack back from chunk master, data is replicated in the volatile memory at all the replicas› Chunkservers are free to commit data to disk asynchronously

Eventually, chunk is made stable› Data is committed to disk at all the replicas› Replicas are byte-wise identical

Stable chunks are not appended to again

Talk Outline

Properties of Intermediate data

I-files implementation

Sailfish: M/R implementation that uses I-files for intermediate data


Summary

The Elephant Can Dance…

Hadoop Shuffle Pipeline

map()

(De) Serialization

reduce()

Sailfish Shuffle Pipeline

Sailfish Overview

Modify Hadoop-0.20.2 to use I-files for MapReduce› Mappers write their output to a per-partition I-file

• Replication factor of 1 for all the chunks of an I-file

• At-least-once semantics for append; filter dups on the reduce side

› Data destined for a reducer is in a single file› Build the intermediate data file in a manner that is suitable for the

reader rather than the writer

Automatically parallelize execution of the reduce phase: Set the number of reduce tasks and work assignment dynamically › Assign key-ranges to reduce tasks rather than whole partitions› Extend I-files to support key-based retrieval

Sailfish Map Phase Execution

Sailfish Reduce Phase Execution

Atomic “Record” Append For M/R

M/R computations are about processing records› Intermediate data consists of key/value pairs

Extend atomic append to support “records”› Mappers emit <key, record>

• Per-record framing that identifies the mapper task that generated a record

› System stores per-chunk index

• After chunk is stable, chunk is sorted and an index is built by the sorter– Sorting is a completely local operation: read a block from disk, sort in RAM, and write back

to disk

› Reducers can retrieve data by <key>

• Use per-record framing to discard data from dead mappers

Sailfish Architecture

Hadoop JTSubmit Job

Mapper Task

I Appender

Reducer Task

IMerger

Read/Merge

.

.

.

KFS I-files

workbuilder

What do I do? I-file 5

[a, d)

Handling Failures

Whenever a chunk of an I-file is lost, need to re-generate lost data

With I-file, we have multiple mappers writing to a block For fault-tolerance,

› Workbuilder tracks the set of chunks modified by a mapper task› Whenever a chunk is lost, workbuilder notifies the JT of the set of

map tasks that have to be re-run› Reducers reading from the I-file with the lost chunk wait until data is

re-generated

For fault-containment, in Sailfish, use per-rack I-files› Mappers running in a rack write to chunks of the I-file stored in the

rack

Fault-tolerance With Sailfish

Alternate option is to replicate map-output› Use atomic record append to write to two chunkservers

• Probability of data loss due to (concurrent) double failure is low

› Performance hit for replicating data is low

• Data is replicated using RAM and written to disk async

› However, network traffic increases substantially

• Sailfish causes network traffic to double compared to Stock Hadoop– Map output is written to the network and reduce input is read over the network

• With replication, data traverses the network three times– Alternate strategy is to selectively replicate map output

Replicate in response to data loss

Replicate output that was generated the earliest

Sailfish Reduce Phase # of reducers/job and their task assignment is determined by the workbuilder

in a data-dependent manner› Dynamically set the # of reducer per job after the map phase execution is complete

# of reducers/I-file = (size of I-file) / (work per reducer)› Work per reducer is set based on RAM (in experiments, use, 1GB per reduce task)

› If data assigned to a task exceeds size of RAM, merger does a network-wide merge by appropriately streaming the data

Workbuilder uses the per-chunk index to determine split points› Each reduce task is assigned a range of keys within an I-file

• Data for a reduce task is in multiple chunks and requires a merge

• Since chunks are sorted, data read by a reducer from a chunk is all sequential I/O

Skew in reduce input is handled seamlessly› I-file with more data has more tasks assigned to it


Cluster comprises of ~150 machines› 6 map tasks, 6 reduce tasks per node

• With Hadoop M/R tasks, a JVM is given 1.5G RAM for one pass sort/merge

› 8 cores, 16GB RAM, 4-750GB drives, 1Gbps between any pair of nodes

› Job uses all the nodes in the cluster

Evaluate with benchmark as well as real M/R job› Simple benchmark that generates its own data (similar to terasort)

• Measure only the overhead with transporting intermediate data

• Job generates records with random 10-byte key, 90-byte value

› Experiments vary the size of intermediate data (1TB – 64TB)

• Mappers generate 1GB of data and reducers consume ~1GB of data

I-files in practice

150 map tasks/rack 128 map tasks concurrently appending to a block of an I-

file 2 blocks of an I-file are concurrently appended to in a rack 512 I-files per job

› Beyond 512 I-files hit system limitations in the cluster (too many open files, too many connections)

KFS chunkservers use direct I/O with the disk subsystem, by-passing the buffer cache

How did we do? (Benchmark job)

How many seeks?

With Stock Hadoop, number of seeks is α M * R With Sailfish, it is the product of:

› # of chunks per I-file (c)

› # of reduce tasks per I-file (R / I)

› # of I-files (I)

We get: c * I * (R / I) = c * R # of chunks per I-file: 64TB intermediate data split over

512 I-files, where the chunksize is 128MB› c = (64TB / (512 * 128MB)) = 1024

# of map tasks at 64TB: 65536 (64TB / 1GB per mapper): c << M

Why does Sailfish work?

Where are the gains coming from?› Write-path is low-latency and is able to keep as many disk arms

and NICs busy› Read-path:

• Lowered the number of disk seeks

• Reads are large/sequential

Compared to Hadoop, read path in Sailfish is very efficient› Efficient disk read path leads to better network utilization

Data read per seek

Disk Thruput (during Reduce phase)

Using Sailfish In Practice Use a job+data from one of the behavioral ad-targeting pipelines at

Yahoo› BT-Join: Build a sliding N-day model of user behavior

• Take 1 day of clickstream logs and join with previous N days and produce a new N-day model

Input datasets compressed using bz2:

› Dataset A: 1000 files, 50MB apeice (10:1 compression)

› Dataset B: 1000 files, 1.2GB apeice (10:1 compression)

Extended Sailfish to support compression for intermediate data

• Mappers generate upto 256K of records, compress, and “append record”

• Sorters read compressed data, decompress, sort, and recompress

• Merger reads compressed data, decompress, merge and pass to reducer

• For performance, use LZO from Intel IPP package

How did we do? (BT-Join)

BT-Join Analysis

Speedup in Reduce phase is due better batching Speedup in Map-phase:

› Stock Hadoop: if map output doesn’t fit in RAM, mappers do an external sort

› Sailfish: Sorting is outside the map task and hence, no limits on the amount of map output generated by a map task

Net result: Job with Sailfish is about 2x faster when compared to Stock Hadoop

Related Work

Atomic append was introduced in GFS paper (SOSP’03)› GFS however seems to have moved away from atomic append as

they say it has not usable (at least once semantics and replicas can diverge)

Balanced systems: TritonSort› Stage-based Sort engine in which the hardware is balanced

• 8 cores, 24GB RAM, 10Gig NIC, 16 drives/box

• Software is then constructed in a way that balances hardware use

› Follow-on work on building an M/R on top of TritonSort

• Not clear how general their M/R engine is (seems specific to sort)

› Sailfish tries to achieve balance via software and is a general M/R engine

Summary

Designed I-files for intermediate data and built Sailfish for doing large-scale M/R› Sailfish will be released as open-source

Build Sailfish on top of YARN› Utilize the per-chunk index:

• Improve reduce task planning based on key distributions

• “Checkpoint” reduce tasks on key-based boundaries and allow better resource sharing

• Support aggregation trees

Having the intermediate data outside a M/R job allows new debugging possibilities› Debug just the reduce phase

i-files: handling intermediate data in parallel dataflow graphs

Documents

increases data

data reducer

r handling intermediate

process big data

intermediate file

adspetabyte scale data

intermediate datafor

map tasksize ram