qing zheng lin xiao, kai ren, garth gibsonqingzhen/talk/shardindexfs_talk_pdl.pdf · “dir lookup...

27
Qing Zheng Lin Xiao, Kai Ren, Garth Gibson

Upload: vandung

Post on 11-Jul-2019

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Qing Zheng Lin Xiao, Kai Ren, Garth Gibsonqingzhen/talk/shardindexfs_talk_pdl.pdf · “Dir lookup state mutation ops” “Non-conflicting ops” ... Kai Ren, Qing Zheng, Swapnil

Qing Zheng Lin Xiao, Kai Ren, Garth Gibson

Page 2: Qing Zheng Lin Xiao, Kai Ren, Garth Gibsonqingzhen/talk/shardindexfs_talk_pdl.pdf · “Dir lookup state mutation ops” “Non-conflicting ops” ... Kai Ren, Qing Zheng, Swapnil

File System Architecture

Low-level Storage Infrastructure

Metadata Service APP APP APP APP

APP APP APP APP

I/O operations

metadata operations

PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 2

Parallel data path with decoupled metadata path

[flat namespace]

Page 3: Qing Zheng Lin Xiao, Kai Ren, Garth Gibsonqingzhen/talk/shardindexfs_talk_pdl.pdf · “Dir lookup state mutation ops” “Non-conflicting ops” ... Kai Ren, Qing Zheng, Swapnil

Metadata = 1 + 2 + 3

PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 3

Metadata Representation

File System App Library

Namespace Indexing

Small File Storage

Large File Block Indexing

2 1 3

Page 4: Qing Zheng Lin Xiao, Kai Ren, Garth Gibsonqingzhen/talk/shardindexfs_talk_pdl.pdf · “Dir lookup state mutation ops” “Non-conflicting ops” ... Kai Ren, Qing Zheng, Swapnil

Decoupled != Scalable Single metadata server HDFS, Lustre 1.x

Statically partitioned metadata servers PVFS, Federated HDFS, NFS v4.1

PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 4

Many existing metadata service don’t scale

Page 5: Qing Zheng Lin Xiao, Kai Ren, Garth Gibsonqingzhen/talk/shardindexfs_talk_pdl.pdf · “Dir lookup state mutation ops” “Non-conflicting ops” ... Kai Ren, Qing Zheng, Swapnil

Our Goal is To Have Really

Scalable Metadata

PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 5

Page 6: Qing Zheng Lin Xiao, Kai Ren, Garth Gibsonqingzhen/talk/shardindexfs_talk_pdl.pdf · “Dir lookup state mutation ops” “Non-conflicting ops” ... Kai Ren, Qing Zheng, Swapnil

Our Goal is To Have Really

Scalable Metadata

PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 6

Outline 1. Pathname lookup

important limitation on scalability

2. Client-side caching represented by IndexFS

3. Replicated state represented by ShardFS

4. Experimental results

Page 7: Qing Zheng Lin Xiao, Kai Ren, Garth Gibsonqingzhen/talk/shardindexfs_talk_pdl.pdf · “Dir lookup state mutation ops” “Non-conflicting ops” ... Kai Ren, Qing Zheng, Swapnil

Path Resolution Hierarchical permission checking

PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 7

In order to resolve /a/b/c , need to test /a , a/b , b/c , … /a/b/c/… /a a/b b/c …

1) Permissions to lookup names under an intermediate directory 2) The existence of the name 3) The name represents a directory

A set of recursive tests starting from the root

Page 8: Qing Zheng Lin Xiao, Kai Ren, Garth Gibsonqingzhen/talk/shardindexfs_talk_pdl.pdf · “Dir lookup state mutation ops” “Non-conflicting ops” ... Kai Ren, Qing Zheng, Swapnil

2 Categories of Ops

PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 8

Parent Dir Id Obj Name ObjSize ObjMode UserId GroupId Times Embedded File Data ObjId Other Metadata

mkdir rename on dirs

rmdir

chmod on dirs

chown on dirs

mknod/create unlink

rename on files chmod/chown on files

utime/utimes

“Dir lookup state mutation ops” “Non-conflicting ops”

Page 9: Qing Zheng Lin Xiao, Kai Ren, Garth Gibsonqingzhen/talk/shardindexfs_talk_pdl.pdf · “Dir lookup state mutation ops” “Non-conflicting ops” ... Kai Ren, Qing Zheng, Swapnil

Naive Implementation 1 lookup RPC to server for each intermediate dir

Parallel Data Lab - http://www.pdl.cmu.edu/

Client

/a/b/c/… Server 0

/a

Client Side Server Side

Server 1

a/b Server 2

b/c

Client

/a/d/… Other Clients

1. Repeated RPCs

2. Hot spot

servers holding

names at the top

of the tree

BOTTLENECKS

PDL Group Meeting 9

Page 10: Qing Zheng Lin Xiao, Kai Ren, Garth Gibsonqingzhen/talk/shardindexfs_talk_pdl.pdf · “Dir lookup state mutation ops” “Non-conflicting ops” ... Kai Ren, Qing Zheng, Swapnil

Outline 1. Pathname lookup

important limitation on scalability

2. Client-side caching represented by IndexFS

3. Replicated state represented by ShardFS

4. Experimental results

PDL Group Meeting 10

Page 11: Qing Zheng Lin Xiao, Kai Ren, Garth Gibsonqingzhen/talk/shardindexfs_talk_pdl.pdf · “Dir lookup state mutation ops” “Non-conflicting ops” ... Kai Ren, Qing Zheng, Swapnil

Design Choice #1 Outline 1. Pathname lookup

important limitation on scalability

2. Client-side caching represented by IndexFS

3. Replicated state represented by ShardFS

4. Experimental results

PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 11

Lease and cache dir lookup states at clients Block mutation ops until all leases have expired

Expiration-time Cache-Entry

client-side cache table

Max-expiration-time Cache-Id

server-side cache table

Fewer repeated RPCs & simple server states

Page 12: Qing Zheng Lin Xiao, Kai Ren, Garth Gibsonqingzhen/talk/shardindexfs_talk_pdl.pdf · “Dir lookup state mutation ops” “Non-conflicting ops” ... Kai Ren, Qing Zheng, Swapnil

IndexFS Design Distributes namespace on a per-dir partition basic Path resolution conducted by clients with an consistent lease-based lookup cache

PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 12

c

/

a

b d

e 1 2 3

/a

a/c

c/2

1

ROOT

3 2

a

d

b e

c

client cache

Page 13: Qing Zheng Lin Xiao, Kai Ren, Garth Gibsonqingzhen/talk/shardindexfs_talk_pdl.pdf · “Dir lookup state mutation ops” “Non-conflicting ops” ... Kai Ren, Qing Zheng, Swapnil

Outline 1. Pathname lookup

important limitation on scalability

2. Client-side caching represented by IndexFS

3. Replicated state represented by ShardFS

4. Experimental results

PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 13

Page 14: Qing Zheng Lin Xiao, Kai Ren, Garth Gibsonqingzhen/talk/shardindexfs_talk_pdl.pdf · “Dir lookup state mutation ops” “Non-conflicting ops” ... Kai Ren, Qing Zheng, Swapnil

Design Choice #2 Outline 1. Pathname lookup

important limitation on scalability

2. Client-side caching represented by IndexFS

3. Replicated state represented by ShardFS

4. Experimental results

PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 14

Replicates dir lookup states to all servers & broadcasts mutation ops to all servers

mkdir

Server 1 Server 2 Server 3 Server n

Principally a better decision if #client >> #server

Page 15: Qing Zheng Lin Xiao, Kai Ren, Garth Gibsonqingzhen/talk/shardindexfs_talk_pdl.pdf · “Dir lookup state mutation ops” “Non-conflicting ops” ... Kai Ren, Qing Zheng, Swapnil

ShardFS Design Distributes namespace on a per-file basic (sharding) All metadata servers can accept new files and perform path resolution

PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 15

c

/

a

b d

e 1 2 3

ROOT

2

ROOT

ROOT ROOT

1

3

client /a/b/1

/a/c/3

a

b

d

c

e

Page 16: Qing Zheng Lin Xiao, Kai Ren, Garth Gibsonqingzhen/talk/shardindexfs_talk_pdl.pdf · “Dir lookup state mutation ops” “Non-conflicting ops” ... Kai Ren, Qing Zheng, Swapnil

RPC Amplification

PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 16

path resolution mknod

unlink/getattr mkdir*

rmdir*/readdir chmod/chown on file

chmod*/chown* on dir utime on file/dir

0 1 1

#metadata_servers #metadata_servers

1 #metadata_server

1

0 ~ #path_depth 1 1

1 + 1 1 + #partitions

1 1 1

#path_lookups +

IndexFS ShardFS * dir lookup state mutation op

Page 17: Qing Zheng Lin Xiao, Kai Ren, Garth Gibsonqingzhen/talk/shardindexfs_talk_pdl.pdf · “Dir lookup state mutation ops” “Non-conflicting ops” ... Kai Ren, Qing Zheng, Swapnil

Micro Benchmark YCSB++ [SoCC11 ]

Balanced tree 1280 files per leaf dir

Zipfian tree files distributed unevenly, creating static hot spots

Uniform/zipfian stat’s random getattrs on files, creating dynamic hot spots

PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 17

5 levels

fan out=10

Root dir

Leaf dirs

111K dirs

128M files

Page 18: Qing Zheng Lin Xiao, Kai Ren, Garth Gibsonqingzhen/talk/shardindexfs_talk_pdl.pdf · “Dir lookup state mutation ops” “Non-conflicting ops” ... Kai Ren, Qing Zheng, Swapnil

Th

rou

gh

pu

t

PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 18

10,829

4,728

8,819 8,507

0

4,000

8,000

12,000

16,000

20,000

Balanced Tree Zipfian Tree

IndexFS ShardFS

3,071 3,635

14,654 15,502

Balanced Tree Zipfian Tree

IndexFS ShardFS

2,022 2,105

6,529 6,356

Balanced Tree Zipfian Tree

IndexFS ShardFS

Creation Phase Uniform Stat’s Zipfian Stat’s op/s

metadata footprint

dir splitting load variance

path resolution

hot spots on files

Page 19: Qing Zheng Lin Xiao, Kai Ren, Garth Gibsonqingzhen/talk/shardindexfs_talk_pdl.pdf · “Dir lookup state mutation ops” “Non-conflicting ops” ... Kai Ren, Qing Zheng, Swapnil

Macro Benchmark YCSB++ [SoCC11 ]

Strong scaling fixed namespace and fixed fs ops

Weak scaling fixed dir structure and fixed dir lookup state mutation ops

with scaling number of files and scaling number of other fs ops

LinkedIn trace replay 1-day HDFS trace with 1.9M dirs & 11.4M files

PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 19

s s s s s s s s s s s s

strong scaling 1 2 3 1 2 3

chmod /a open /a/1

chmod /a open /a/1 a a

s s s s s s s s s s s s

weak scaling 1 2 3

chmod /a open /a/1

chmod /a open /a/1 open /a/α

1 β γ α 2 2

a a

Page 20: Qing Zheng Lin Xiao, Kai Ren, Garth Gibsonqingzhen/talk/shardindexfs_talk_pdl.pdf · “Dir lookup state mutation ops” “Non-conflicting ops” ... Kai Ren, Qing Zheng, Swapnil

PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 20

80

160

320

640

1280

8 16 32 64 128 8 16 32 64 128

IndexFS-1m IndexFS-250ms

IndexFS-50ms ShardFS

IndexFS-1m IndexFS-250ms

IndexFS-50ms ShardFS

number of metadata servers

Th

rou

gh

pu

t Kop/s Strong Scaling Weak Scaling

Page 21: Qing Zheng Lin Xiao, Kai Ren, Garth Gibsonqingzhen/talk/shardindexfs_talk_pdl.pdf · “Dir lookup state mutation ops” “Non-conflicting ops” ... Kai Ren, Qing Zheng, Swapnil

References IndexFS: Scaling File System Metadata Performance with Stateless Caching and Bulk Insertion. Kai Ren, Qing Zheng, Swapnil Patil and Garth Gibson. SC 2014

TableFS: Enhancing metadata efficiency in local file systems. Kai Ren and Garth Gibson. USENIX ATC 2013

Scale and Concurrency in GIGA+: File System Directories with Millions of Files. Swapnil Patil and Garth Gibson. FAST 2009

YCSB++: Benchmarking and Performance Debugging Advanced Features in Scalable Table Stores. Swapnil Patil, Milo Polte, Kai Ren, Wittawat Tantisiriroj, Lin Xiao, Julio Lopez, Garth Gibson, Adam Fuchs, Billie Rinaldi. SoCC 2011

PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 21

Page 22: Qing Zheng Lin Xiao, Kai Ren, Garth Gibsonqingzhen/talk/shardindexfs_talk_pdl.pdf · “Dir lookup state mutation ops” “Non-conflicting ops” ... Kai Ren, Qing Zheng, Swapnil

BACKUP SLIDES

Page 23: Qing Zheng Lin Xiao, Kai Ren, Garth Gibsonqingzhen/talk/shardindexfs_talk_pdl.pdf · “Dir lookup state mutation ops” “Non-conflicting ops” ... Kai Ren, Qing Zheng, Swapnil

Target Namespace 90% dirs are small (less than 128 entries)

Large dirs are really huge

90% dirs are of depth 16 or more

Median file size smaller then 64KB for many fs’es

PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 23

Page 24: Qing Zheng Lin Xiao, Kai Ren, Garth Gibsonqingzhen/talk/shardindexfs_talk_pdl.pdf · “Dir lookup state mutation ops” “Non-conflicting ops” ... Kai Ren, Qing Zheng, Swapnil

Distribution of FS Ops

PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 24

0%

20%

40%

60%

80%

100%

LinkedIn OpenCloud Yahoo! PROD Yahoo! R&D

Mknod Open Chmod Rename Mkdir Readdir Remove

Page 25: Qing Zheng Lin Xiao, Kai Ren, Garth Gibsonqingzhen/talk/shardindexfs_talk_pdl.pdf · “Dir lookup state mutation ops” “Non-conflicting ops” ... Kai Ren, Qing Zheng, Swapnil

ShardFS/IndexFS Overview

PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 25

Application

Metadata Server (ShardFS/IndexFS Server)

DFS Metadata Server

DFS Data Server

Metadata Client (ShardFS/IndexFS Client)

metadata storage

metadata operations DFS internal operations

data storage

block operations

Underlying Distributed File System ShardFS or IndexFS

namespace indexing

small file storage

large file block indexing

Page 26: Qing Zheng Lin Xiao, Kai Ren, Garth Gibsonqingzhen/talk/shardindexfs_talk_pdl.pdf · “Dir lookup state mutation ops” “Non-conflicting ops” ... Kai Ren, Qing Zheng, Swapnil

Metadata Representation Log-structured and indexed data structure

PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 26

[(k,v), (k,v), ..., (k,v)]

mkdir

In-mem buffer

SSTable1 SSTable2 SSTable3

Key-Value Store

ShardFS/IndexFS Servers

TableFS [ATC13] IndexFS [SC14]

Page 27: Qing Zheng Lin Xiao, Kai Ren, Garth Gibsonqingzhen/talk/shardindexfs_talk_pdl.pdf · “Dir lookup state mutation ops” “Non-conflicting ops” ... Kai Ren, Qing Zheng, Swapnil

Namespace Metadata = dir index + object attributes + file data for small files

PDL Group Meeting Parallel Data Lab - http://www.pdl.cmu.edu/ 27

Parent Dir Id Obj Name ObjSize ObjMode UserId GroupId Times Embedded File Data ObjId Other Metadata

Parent Dir Id Obj Name ObjSize ObjMode UserId GroupId Times Embedded File Data ObjId Other Metadata

Parent Dir Id Obj Name ObjSize ObjMode UserId GroupId Times Embedded File Data ObjId Other Metadata

Parent Dir Id Obj Name ObjSize ObjMode UserId GroupId Times Embedded File Data ObjId Other Metadata

Key Value

Sorte

d