crossing the chasm: sneaking a parallel file system into...

40
Crossing the Chasm: Sneaking a parallel file system into Hadoop Wittawat Tantisiriroj Swapnil Patil, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University

Upload: others

Post on 23-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Crossing the Chasm: Sneaking a parallel file system into ...wtantisi/hadooppvfs/files/hdfspvfs-pdl-slid… · Hadoop applications Hadoop framework Extensible file system API HDFS

Crossing the Chasm:

Sneaking a parallel file system

into Hadoop

Wittawat TantisirirojSwapnil Patil, Garth Gibson

PARALLEL DATA LABORATORYCarnegie Mellon University

Page 2: Crossing the Chasm: Sneaking a parallel file system into ...wtantisi/hadooppvfs/files/hdfspvfs-pdl-slid… · Hadoop applications Hadoop framework Extensible file system API HDFS

In this work …

• Compare and contrast large storage system

architectures

• Internet services

• High performance computing

• Can we use a parallel file system for Internet

service applications?

• Hadoop, an Internet service software stack

• HDFS, an Internet service file system for Hadoop

• PVFS, a parallel file system

Wittawat Tantisiriroj © November 08http://www.pdl.cmu.edu/ 2

Page 3: Crossing the Chasm: Sneaking a parallel file system into ...wtantisi/hadooppvfs/files/hdfspvfs-pdl-slid… · Hadoop applications Hadoop framework Extensible file system API HDFS

Wittawat Tantisiriroj © November 08http://www.pdl.cmu.edu/ 3

Today’s Internet services

• Applications are becoming data-intensive

• Large input data set (e.g. the entire web)

• Distributed, parallel application execution

• Distributed file system is a key component

• Define new semantics for anticipated workloads

– Atomic append in Google FS

– Write-once in HDFS

• Commodity hardware and network

– Handle failures through replication

Page 4: Crossing the Chasm: Sneaking a parallel file system into ...wtantisi/hadooppvfs/files/hdfspvfs-pdl-slid… · Hadoop applications Hadoop framework Extensible file system API HDFS

The HPC world

• Equally large applications

• Large input data set (e.g. astronomy data)

• Parallel execution on large clusters

• Use parallel file systems for scalable I/O

• e.g. IBM’s GPFS, Sun’s Lustre FS, PanFS, and

Parallel Virtual File System (PVFS)

Wittawat Tantisiriroj © November 08http://www.pdl.cmu.edu/ 4

Page 5: Crossing the Chasm: Sneaking a parallel file system into ...wtantisi/hadooppvfs/files/hdfspvfs-pdl-slid… · Hadoop applications Hadoop framework Extensible file system API HDFS

Why use parallel file systems?

• Handle a wide variety of workloads

• High concurrent reads and writes

• Small file support, scalable metadata

• Offer performance vs. reliability tradeoff

• RAID-5 (e.g., PanFS)

• Mirroring

• Failover (e.g., LustreFS)

• Standard Unix FS interface & POSIX semantics

• pNFS standard (NFS v4.1)

Wittawat Tantisiriroj © November 08http://www.pdl.cmu.edu/ 5

Page 6: Crossing the Chasm: Sneaking a parallel file system into ...wtantisi/hadooppvfs/files/hdfspvfs-pdl-slid… · Hadoop applications Hadoop framework Extensible file system API HDFS

Wittawat Tantisiriroj © November 08http://www.pdl.cmu.edu/ 6

Outline

A basic shim layer & preliminary evaluation

• Three add-on features in a shim layer

• Evaluation

Page 7: Crossing the Chasm: Sneaking a parallel file system into ...wtantisi/hadooppvfs/files/hdfspvfs-pdl-slid… · Hadoop applications Hadoop framework Extensible file system API HDFS

HDFS & PVFS: high level design

• Meta-data servers

• Store all file system metadata

• Handle all metadata operations

• Data servers

• Store actual file system data

• Handle all read and write operations

• Files are divided into chunks

• Chunks of a file are distributed across servers

Wittawat Tantisiriroj © November 08http://www.pdl.cmu.edu/ 7

Page 8: Crossing the Chasm: Sneaking a parallel file system into ...wtantisi/hadooppvfs/files/hdfspvfs-pdl-slid… · Hadoop applications Hadoop framework Extensible file system API HDFS

PVFS shim layer under Hadoop

Wittawat Tantisiriroj © November 08http://www.pdl.cmu.edu/ 8

Hadoop applications

Hadoop framework

Hadoop applications

Hadoop framework

Extensible file system API

Hadoop applications

Hadoop framework

Extensible file system API

HDFS client library

Hadoop applications

HDFS servers

Client

Server

Hadoop framework

Extensible file system API

HDFS client library

Hadoop applications

Unmodified PVFS

client library (C)

Unmodified PVFS

serversHDFS servers

Client

Server

Hadoop framework

Extensible file system API

PVFS shim layerHDFS client library

Hadoop applications

Unmodified PVFS

client library (C)

Forward requests to

and respond from

PVFS client library

using

Java Native Interface

(JNI)

Unmodified PVFS

serversHDFS servers

Client

Server

Page 9: Crossing the Chasm: Sneaking a parallel file system into ...wtantisi/hadooppvfs/files/hdfspvfs-pdl-slid… · Hadoop applications Hadoop framework Extensible file system API HDFS

Preliminary Evaluation

• Text search (“grep”)

• common workloads in Internet service applications

• Search for a rare pattern in 100-byte records

• 64GB data set

• 32 nodes

• Each node serves as storage and compute nodes

Wittawat Tantisiriroj © November 08http://www.pdl.cmu.edu/ 9

Page 10: Crossing the Chasm: Sneaking a parallel file system into ...wtantisi/hadooppvfs/files/hdfspvfs-pdl-slid… · Hadoop applications Hadoop framework Extensible file system API HDFS

Vanilla PVFS is disappointing …

Wittawat Tantisiriroj © November 08http://www.pdl.cmu.edu/ 10

0

50

100

150

200

250

300

Co

mp

leti

on

Tim

e (

sec)

Grep (64GB, 32 nodes, no replication)

PVFS: HDFS

2.5 times slower

Page 11: Crossing the Chasm: Sneaking a parallel file system into ...wtantisi/hadooppvfs/files/hdfspvfs-pdl-slid… · Hadoop applications Hadoop framework Extensible file system API HDFS

Wittawat Tantisiriroj © November 08http://www.pdl.cmu.edu/ 11

Outline

• A basic shim layer & preliminary evaluation

Three add-on features in a shim layer

Readahead buffer

• File layout information

• Replication

• Evaluation

Page 12: Crossing the Chasm: Sneaking a parallel file system into ...wtantisi/hadooppvfs/files/hdfspvfs-pdl-slid… · Hadoop applications Hadoop framework Extensible file system API HDFS

Read operation in Hadoop

• Typical read workload:

• Small (less than 128 KB)

• Sequential through an entire chunk

• HDFS prefetches an entire chunk

• No cache coherence issue with its write-once

semantic

Wittawat Tantisiriroj © November 08http://www.pdl.cmu.edu/ 12

Page 13: Crossing the Chasm: Sneaking a parallel file system into ...wtantisi/hadooppvfs/files/hdfspvfs-pdl-slid… · Hadoop applications Hadoop framework Extensible file system API HDFS

Readahead buffer

• PVFS has no client buffer cache

• Avoid a cache coherence issue with

concurrent writes

• Readahead buffer can be added to

PVFS shim layer

• In Hadoop, a file can become immutable

after it is closed

• No need for cache coherence mechanism

Wittawat Tantisiriroj © November 08http://www.pdl.cmu.edu/ 13

Page 14: Crossing the Chasm: Sneaking a parallel file system into ...wtantisi/hadooppvfs/files/hdfspvfs-pdl-slid… · Hadoop applications Hadoop framework Extensible file system API HDFS

PVFS with 4MB buffer

Wittawat Tantisiriroj © November 08http://www.pdl.cmu.edu/ 14

0

50

100

150

200

250

300

Co

mp

leti

on

Tim

e (

sec)

Grep (64GB, 32 nodes, no replication)

PVFS:no buffer

PVFS:with buffer

HDFS

still quite slow

Page 15: Crossing the Chasm: Sneaking a parallel file system into ...wtantisi/hadooppvfs/files/hdfspvfs-pdl-slid… · Hadoop applications Hadoop framework Extensible file system API HDFS

Wittawat Tantisiriroj © November 08http://www.pdl.cmu.edu/ 15

Outline

• A basic shim layer & preliminary evaluation

Three add-on features in a shim layer

• Readahead buffer

File layout information

• Replication

• Evaluation

Page 16: Crossing the Chasm: Sneaking a parallel file system into ...wtantisi/hadooppvfs/files/hdfspvfs-pdl-slid… · Hadoop applications Hadoop framework Extensible file system API HDFS

Collocation in Hadoop

• File layout information

• Describe where chunks are located

• Collocate computation and data

• Ship computation to where data is located

• Reduce network traffic

Wittawat Tantisiriroj © November 08http://www.pdl.cmu.edu/ 16

Page 17: Crossing the Chasm: Sneaking a parallel file system into ...wtantisi/hadooppvfs/files/hdfspvfs-pdl-slid… · Hadoop applications Hadoop framework Extensible file system API HDFS

Hadoop without collocation

Wittawat Tantisiriroj © November 08http://www.pdl.cmu.edu/ 17

Node B

Chunk 1

Node C

Chunk 2

Node A

Chunk 3

Chunk1 Chunk2 Chunk3Computation Chunk1 Chunk2 Chunk3Compute

Node

Storage

Node3 data transfers over network

Chunk1 Chunk2 Chunk3

Page 18: Crossing the Chasm: Sneaking a parallel file system into ...wtantisi/hadooppvfs/files/hdfspvfs-pdl-slid… · Hadoop applications Hadoop framework Extensible file system API HDFS

Hadoop with collocation

Wittawat Tantisiriroj © November 08http://www.pdl.cmu.edu/ 18

Node B

Chunk 1

Node C

Chunk 2

Node A

Chunk 3

Chunk1 Chunk2 Chunk3Chunk1 Chunk2 Chunk3Compute

Node

no data transfer over network

Chunk1Chunk3 Chunk2

Computation

Storage

Node

Page 19: Crossing the Chasm: Sneaking a parallel file system into ...wtantisi/hadooppvfs/files/hdfspvfs-pdl-slid… · Hadoop applications Hadoop framework Extensible file system API HDFS

Expose file layout information

• File layout information in PVFS

• Stored as extended attributes

• Different format from Hadoop format

• A shim layer converts file layout information

from PVFS format to Hadoop format

• Enable Hadoop to collocate computation and data

Wittawat Tantisiriroj © November 08http://www.pdl.cmu.edu/ 19

Page 20: Crossing the Chasm: Sneaking a parallel file system into ...wtantisi/hadooppvfs/files/hdfspvfs-pdl-slid… · Hadoop applications Hadoop framework Extensible file system API HDFS

PVFS with file layout information

Wittawat Tantisiriroj © November 08http://www.pdl.cmu.edu/ 20

0

50

100

150

200

250

300

Co

mp

leti

on

Tim

e (

sec)

Grep (64GB, 32 nodes, no replication)

PVFS:no bufferno file layout

PVFS:with bufferno file layout

PVFS:with bufferwith file layout

HDFS

comparable performance

Page 21: Crossing the Chasm: Sneaking a parallel file system into ...wtantisi/hadooppvfs/files/hdfspvfs-pdl-slid… · Hadoop applications Hadoop framework Extensible file system API HDFS

Wittawat Tantisiriroj © November 08http://www.pdl.cmu.edu/ 21

Outline

• A basic shim layer & preliminary evaluation

Three add-on features in a shim layer

• Readahead buffer

• File layout information

Replication

• Evaluation

Page 22: Crossing the Chasm: Sneaking a parallel file system into ...wtantisi/hadooppvfs/files/hdfspvfs-pdl-slid… · Hadoop applications Hadoop framework Extensible file system API HDFS

Replication in HDFS

• Rack-awareness replication

• By default, 3 copies for each file (triplication)

1.Write to a local storage node

2.Write to a storage node in the local rack

3.Write to a storage node in the other rack

Wittawat Tantisiriroj © November 08http://www.pdl.cmu.edu/ 22

Page 23: Crossing the Chasm: Sneaking a parallel file system into ...wtantisi/hadooppvfs/files/hdfspvfs-pdl-slid… · Hadoop applications Hadoop framework Extensible file system API HDFS

Replication in PVFS

• No replication in the public release of PVFS

• Rely on hardware based reliability solutions

• Per server RAID inside logical storage devices

• Replication can be added in a shim layer

• Write each file to three servers

• No reconstruction/recovery in the prototype

Wittawat Tantisiriroj © November 08http://www.pdl.cmu.edu/ 23

Page 24: Crossing the Chasm: Sneaking a parallel file system into ...wtantisi/hadooppvfs/files/hdfspvfs-pdl-slid… · Hadoop applications Hadoop framework Extensible file system API HDFS

PVFS with replication

Wittawat Tantisiriroj © November 08http://www.pdl.cmu.edu/ 24

Hadoop framework

Extensible file system API

PVFS shim layer

Hadoop applications

Hadoop framework

Extensible file system API

PVFS shim layer

Hadoop applications

Unmodified PVFS

client library (C)

Hadoop framework

Extensible file system API

PVFS shim layer

Hadoop applications

Unmodified PVFS

client library (C)

Unmodified PVFS

server

Unmodified PVFS

server

Unmodified PVFS

server

Page 25: Crossing the Chasm: Sneaking a parallel file system into ...wtantisi/hadooppvfs/files/hdfspvfs-pdl-slid… · Hadoop applications Hadoop framework Extensible file system API HDFS

PVFS shim layer under Hadoop

Wittawat Tantisiriroj © November 08http://www.pdl.cmu.edu/ 25

Hadoop framework

Extensible file system API

PVFS shim layerHDFS client library

Hadoop applications

Unmodified PVFS

client library (C)

Unmodified PVFS

serversHDFS servers

Client

Server

Hadoop framework

Extensible file system API

PVFS shim layerHDFS client library

Hadoop applications

Unmodified PVFS

client library (C)

PVFS shim layer

Readahead buffer

File layout info

Replication

Unmodified PVFS

serversHDFS servers

Client

Server

~1,700 lines of code

Page 26: Crossing the Chasm: Sneaking a parallel file system into ...wtantisi/hadooppvfs/files/hdfspvfs-pdl-slid… · Hadoop applications Hadoop framework Extensible file system API HDFS

Wittawat Tantisiriroj © November 08http://www.pdl.cmu.edu/ 26

Outline

• A basic shim layer & preliminary evaluation

• Three add-on features in a shim layer

Evaluation

Micro-benchmark (non MapReduce)

• MapReduce benchmark

Page 27: Crossing the Chasm: Sneaking a parallel file system into ...wtantisi/hadooppvfs/files/hdfspvfs-pdl-slid… · Hadoop applications Hadoop framework Extensible file system API HDFS

Micro-benchmark

• Cluster configuration

• 16 nodes

• Pentium D dual-core 3.0GHz

• 4 GB Memory

• One 7200 rpm SATA 160 GB (8 MB buffer)

• Gigabit Ethernet

• Use file system API directly without Hadoop

involvement

Wittawat Tantisiriroj © November 08http://www.pdl.cmu.edu/ 27

Page 28: Crossing the Chasm: Sneaking a parallel file system into ...wtantisi/hadooppvfs/files/hdfspvfs-pdl-slid… · Hadoop applications Hadoop framework Extensible file system API HDFS

N clients, each reads 1/N of single file

• Round-robin file layout in PVFS helps avoid

contention

Wittawat Tantisiriroj © November 08http://www.pdl.cmu.edu/ 28

0

100

200

300

400

500

600

700

800

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Ag

gre

ga

te r

ead

th

rou

gh

pu

t (M

B/s

)

Number of Clients

PVFS (no replication)

HDFS (no replication)

Page 29: Crossing the Chasm: Sneaking a parallel file system into ...wtantisi/hadooppvfs/files/hdfspvfs-pdl-slid… · Hadoop applications Hadoop framework Extensible file system API HDFS

Why is PVFS better in this case?

• Without scheduling, clients read in a uniform pattern

• Client1 reads A1 then A4

• Client2 reads A2 then A5

• Client3 reads A3 then A6

• PVFS

• Round-robin

placement

• HDFS

• Random

placement

Wittawat Tantisiriroj © November 08http://www.pdl.cmu.edu/ 29

A1

A3

A2

A5

A4

A6

A1

A4

A2

A5

A3

A6

Contention

A1

A3

A2

A5

A4

A6

A1

A4

A2

A5

A3

A6

Page 30: Crossing the Chasm: Sneaking a parallel file system into ...wtantisi/hadooppvfs/files/hdfspvfs-pdl-slid… · Hadoop applications Hadoop framework Extensible file system API HDFS

HDFS with Hadoop’s scheduling

• Example 1:• Client1 reads A1 then A4

• Client2 reads A2 then A5

• Client3 reads A6 then A3

• Example 2:• Client1 reads A1 then A3

• Client2 reads A2 then A5

• Client3 reads A4 then A6

Wittawat Tantisiriroj © November 08http://www.pdl.cmu.edu/ 30

A1

A3

A2

A5

A4

A6

A1

A3

A2

A5

A4

A6

A1

A3

A2

A5

A4

A6

A1

A3

A2

A5

A4

A6

Page 31: Crossing the Chasm: Sneaking a parallel file system into ...wtantisi/hadooppvfs/files/hdfspvfs-pdl-slid… · Hadoop applications Hadoop framework Extensible file system API HDFS

Read with Hadoop’s scheduling

• Hadoop’s scheduling can mask a problem

with a non-uniform file layout in HDFS

Wittawat Tantisiriroj © November 08http://www.pdl.cmu.edu/ 31

0

5

10

15

20

25

30

35

Co

mp

leti

on

Tim

e (

sec)

Read (16GB, 16 nodes)

PVFS HDFS

Page 32: Crossing the Chasm: Sneaking a parallel file system into ...wtantisi/hadooppvfs/files/hdfspvfs-pdl-slid… · Hadoop applications Hadoop framework Extensible file system API HDFS

N clients write to n distinct files

• By writing one of three copies locally,

HDFS write throughput grows linearly

Wittawat Tantisiriroj © November 08http://www.pdl.cmu.edu/ 32

0

100

200

300

400

500

600

700

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Ag

gre

ga

te w

rite

th

rou

gh

pu

t (M

B/s

)

Number of Clients

PVFS (no replication)

HDFS (no replication)

Page 33: Crossing the Chasm: Sneaking a parallel file system into ...wtantisi/hadooppvfs/files/hdfspvfs-pdl-slid… · Hadoop applications Hadoop framework Extensible file system API HDFS

Concurrent writes to a single file

• By allowing concurrent writes in PVFS,

“copy” completes faster by using multiple writers

Wittawat Tantisiriroj © November 08http://www.pdl.cmu.edu/ 33

0

100

200

300

400

500

600

700

Co

mp

leti

on

Tim

e (

sec)

Parallel Copy (16GB, 16 nodes)

PVFS (16 writers) HDFS (1 writer)

Page 34: Crossing the Chasm: Sneaking a parallel file system into ...wtantisi/hadooppvfs/files/hdfspvfs-pdl-slid… · Hadoop applications Hadoop framework Extensible file system API HDFS

Wittawat Tantisiriroj © November 08http://www.pdl.cmu.edu/ 34

Outline

• A basic shim layer & preliminary evaluation

• Three add-on features in a shim layer

Evaluation

• Micro-benchmark (non MapReduce)

MapReduce benchmark

Page 35: Crossing the Chasm: Sneaking a parallel file system into ...wtantisi/hadooppvfs/files/hdfspvfs-pdl-slid… · Hadoop applications Hadoop framework Extensible file system API HDFS

MapReduce benchmark setting

• Yahoo! M45 cluster

• Use 50-100 nodes

• Xeon quad-core 1.86 GHz with 6GB Memory

• One 7200 rpm SATA 750 GB (8 MB buffer)

• Gigabit Ethernet

• Use Hadoop framework for MapReduce

processing

Wittawat Tantisiriroj © November 08http://www.pdl.cmu.edu/ 35

Page 36: Crossing the Chasm: Sneaking a parallel file system into ...wtantisi/hadooppvfs/files/hdfspvfs-pdl-slid… · Hadoop applications Hadoop framework Extensible file system API HDFS

MapReduce benchmark

• Grep: Search for a rare pattern in hundred

million 100-byte records (100GB)

• Sort: Sort hundred million 100-byte records

(100GB)

• Never-Ending Language Learning (NELL):

(J. Betteridge, CMU) Count the numbers of

selected phrases in 37GB data-set

Wittawat Tantisiriroj © November 08http://www.pdl.cmu.edu/ 36

Page 37: Crossing the Chasm: Sneaking a parallel file system into ...wtantisi/hadooppvfs/files/hdfspvfs-pdl-slid… · Hadoop applications Hadoop framework Extensible file system API HDFS

Read-Intensive Benchmark

0

25

50

75

100

125

Co

mp

leti

on

Tim

e (

sec

)

Grep (100GB, 50 nodes)

PVFS HDFS

• PVFS’s performance is similar to HDFS

0

100

200

300

400

500

Co

mp

leti

on

Tim

e (

sec)

NELL (37GB, 100 nodes)

PVFS HDFS

Wittawat Tantisiriroj © November 08http://www.pdl.cmu.edu/ 37

Page 38: Crossing the Chasm: Sneaking a parallel file system into ...wtantisi/hadooppvfs/files/hdfspvfs-pdl-slid… · Hadoop applications Hadoop framework Extensible file system API HDFS

Write-Intensive Benchmark

• By writing one of three copies locally,

HDFS does better than PVFS

Wittawat Tantisiriroj © November 08http://www.pdl.cmu.edu/ 38

0

200

400

600

800

1000

1200

Co

mp

leti

on

Tim

e (

sec

)

Sort (100GB, 50 nodes)

PVFS HDFS PVFS(2 copies)

0

200

400

600

800

1000

Netw

ork

Tra

ffic

(G

B)

Sort (100GB, 50 nodes)

PVFS HDFS PVFS(2 copies)

Page 39: Crossing the Chasm: Sneaking a parallel file system into ...wtantisi/hadooppvfs/files/hdfspvfs-pdl-slid… · Hadoop applications Hadoop framework Extensible file system API HDFS

Summary

• PVFS can be tuned to deliver promising

performance for Hadoop applications

• Simple shim layer in Hadoop

• No modification to PVFS

• PVFS can expose file layout information

• Enable Hadoop to collocate computation and data

• Hadoop application can benefit from concurrent

writing supported by parallel file systems

Wittawat Tantisiriroj © November 08http://www.pdl.cmu.edu/ 39

Page 40: Crossing the Chasm: Sneaking a parallel file system into ...wtantisi/hadooppvfs/files/hdfspvfs-pdl-slid… · Hadoop applications Hadoop framework Extensible file system API HDFS

Acknowledgements

• Sam Lang and Rob Ross for help with PVFS

internals

• Yahoo! for the M45 cluster

• Julio Lopez for help with M45 and Hadoop

• Justin Betteridge, Le Zhao, Jamie Callan,

Shay Cohen, Noah Smith, U Kang and

Christos Faloutsos for their scientific

applications

Wittawat Tantisiriroj © November 08http://www.pdl.cmu.edu/ 40