shark: scaling file servers via cooperative caching
DESCRIPTION
Network file systems offer a powerful, transparent interfacefor accessing remote data. Unfortunately, in current network file systems like NFS, clients fetch data from a central file server, inherently limiting the system's ability to scale to many clients. While recent distributed (peer-to-peer) systems have managed to eliminate this scalabilitybottleneck, they are often exceedingly complex and providenon-standard models for administration and accountability. We present Shark, a novel system that retains the best of both worlds -- the scalability of distributed systemswith the simplicity of central servers.Shark is a distributed file system designed for large-scale,wide-area deployment, while also providing a drop-in replacement for local-area file systems. Shark introduces a novel cooperative-caching mechanism, in whichmutually-distrustful clients can exploit each others' filecaches to reduce load on an origin file server. Using a distributed index, Shark clients nd nearby copies of data, even when files originate from different servers. Performance results show that Shark can greatly reduce serverload and improve client latency for read-heavy workloads, both in the wide and local areas, while still remainingcompetitive for single clients in the local area. Thus, Shark enables modestly-provisioned file servers to scale to hundreds of read-mostly clients while retaining traditional usability, consistency, security, and accountability.TRANSCRIPT
Shark: Scaling File Servers via Cooperative
Caching
Siddhartha Annapureddy Michael J. Freedman David Mazieres
New York University
Networked Systems Design and Implementation, 2005
Siddhartha Annapureddy Shark
Motivation
Scenario – Want to test a new application on a hundred nodes
Workstation
Workstation
Workstation
Workstation
WorkstationWorkstation
Workstation Workstation
WorkstationWorkstation
Problem – Need to push software to all nodes and collect results
Siddhartha Annapureddy Shark
Current Approaches
Distribute from a central server
Client
Client
Client
Network A Network B
Server
rsync, unison, scp
– Server’s network uplink saturatesI Operation takes a long time to finish
– Wastes bandwidth along bottleneck links
Siddhartha Annapureddy Shark
Current Approaches
File distribution mechanisms
BitTorrent, Bullet
+ Scales by offloading burden on server
– Client downloads from half-way across the world
Siddhartha Annapureddy Shark
Inherent Problems with Copying Files
Users have to decide a priori what to ship
I Ship too much – Waste bandwidth, takes long time
I Ship too little – Hassle to work in a poor environment
Idle environments consume disk space
I Users are loath to cleanup ⇒ Redistribution
I Need low cost solution for refetching files
Manual management of software versions
Illusion of having development environmentPrograms fetched transparently on demand
Siddhartha Annapureddy Shark
Networked File Systems
+ Know how to deploy these systems
+ Know how to administer such systems
+ Simple accountability mechanisms
Eg: NFS, AFS, SFS etc.
Siddhartha Annapureddy Shark
Problem: Scalability
0
100
200
300
400
500
600
700
800
0 20 40 60 80 100
Tim
e to
fini
sh R
ead
(sec
)
Unique nodes
40 MB readSFS
Very slow ≈ 775s
Siddhartha Annapureddy Shark
Problem: Scalability
0
100
200
300
400
500
600
700
800
0 20 40 60 80 100
Tim
e to
fini
sh R
ead
(sec
)
Unique nodes
40 MB readSFSShark
Very slow ≈ 775sMuch better !!! ≈ 88s (9x better)
Siddhartha Annapureddy Shark
Peer-to-Peer Filesystems
P2P Filesystems (Pond, Ivy)
+ Scalability
– Non-standard administrative models
– New filesystem semantics
– Haven’t been widely deployed yet
Goal – Best of Both Worlds
I Convenience of central servers
I Scalability of peer-to-peer
Siddhartha Annapureddy Shark
Let’s Design this New Filesystem
ServerClient
Vanilla SFS
I Central server model
Siddhartha Annapureddy Shark
Scalability – Client-side Caching
ServerClientCache
Large client cache a la AFS
I Scales by avoiding redundant data transfers
I Leases to ensure NFS-style cache consistency
I With multiple clients, must address bandwidth concerns
Siddhartha Annapureddy Shark
Scalability – Cooperative Caching
ServerClient
ClientClient
Client
Client
Siddhartha Annapureddy Shark
Scalability – Cooperative Caching
Clients fetch data from each other and offload burden from server
ServerClient
ClientClient
Client
Client
I Shark clients maintain a distributed index
Siddhartha Annapureddy Shark
Scalability – Cooperative Caching
Clients fetch data from each other and offload burden from server
Server
ClientClient
Client
Client
Client
I Shark clients maintain a distributed index
I Fetch a file from multiple other clients in parallel
Siddhartha Annapureddy Shark
Scalability – Cooperative Caching
Clients fetch data from each other and offload burden from server
Server
ClientClient
Client
Client
Client
I Shark clients maintain a distributed index
I Fetch a file from multiple other clients in parallelI LBFS-style Chunks – Variable-sized blocksI Chunks are better – Exploit commonalities across filesI Chunks preserved across file versions, concatenations
Siddhartha Annapureddy Shark
Scalability – Cooperative Caching
Clients fetch data from each other and offload burden from server
Server
ClientClient
Client
Client
Client
I Shark clients maintain a distributed index
I Fetch a file from multiple other clients in parallel
I Locality-awareness – Preferentially fetch chunks from nearbyclients
Siddhartha Annapureddy Shark
Cross-filesystem Sharing
Global cooperative cache regardless of origin servers
Client
ClientClient
Client
ClientClient
ClientClient
Client
Client
Server ServerA B
I Two groups of clients accessing servers A and B
I Client groups share a large amount of software
I Such clients automatically form a global cache
Siddhartha Annapureddy Shark
Cross-filesystem Sharing
Global cooperative cache regardless of origin servers
Client
ClientClient
Client
ClientClient
ClientClient
Client
Client
Server ServerA B
I Two groups of clients accessing servers A and B
I Client groups share a large amount of software
I Such clients automatically form a global cache
Siddhartha Annapureddy Shark
Application
Linux distribution with LiveCDs
Sharkcd
Sharkcd
Sharkcd
Sharkcd
Knoppix
Knoppix Mirror
Knoppix
Knoppix
Slax
Slax
Slax
I LiveCD – Run an entire OS without using hard disk
I But all your programs must fit on a CD-ROM
I Download dynamically from server but scalability problems
I Knoppix and Slax users form global cache – Relieve servers
Siddhartha Annapureddy Shark
Cooperative Caching – Roadmap
Client Client
Client
Server
Client
Client
Fetch Metadata
I Fetch metadata from the server
Siddhartha Annapureddy Shark
Cooperative Caching – Roadmap
Client Client
Client
Server
Client
Client
Lookup Clients
I Fetch metadata from the server
I Look up clients caching needed chunks in overlay
Siddhartha Annapureddy Shark
Cooperative Caching – Roadmap
Client Client
Client
Server
Client
Client
Download Chunks
I Fetch metadata from the server
I Look up clients caching needed chunks in overlay
I Connect to multiple clients and download chunks in parallel
Siddhartha Annapureddy Shark
Cooperative Caching – Roadmap
Client Client
Client
Server
Client
Client
Download Chunks
I Fetch metadata from the server
I Look up clients caching needed chunks in overlay
I Connect to multiple clients and download chunks in parallel
I Check integrity of fetched chunks
Clients are mutually distrustful
Siddhartha Annapureddy Shark
File Metadata
Client ServerMetadata:
(fh,offset,count)GET_TOKEN
Chunk tokens
Siddhartha Annapureddy Shark
File Metadata
Client ServerMetadata:
(fh,offset,count)GET_TOKEN
Chunk tokens
I Possession of token implies server permissions to read
Siddhartha Annapureddy Shark
File Metadata
Client ServerMetadata:
(fh,offset,count)GET_TOKEN
Chunk tokens
I Possession of token implies server permissions to read
I Tokens are a shared secret between authorized clients
Siddhartha Annapureddy Shark
File Metadata
Client ServerMetadata:
(fh,offset,count)GET_TOKEN
Chunk tokens
I Possession of token implies server permissions to read
I Tokens are a shared secret between authorized clients
Given a chunk B...
I Chunk token TB = H(B)
I H is a collision resistant hash function
Siddhartha Annapureddy Shark
File Metadata
Client ServerMetadata:
(fh,offset,count)GET_TOKEN
Chunk tokens
I Possession of token implies server permissions to read
I Tokens are a shared secret between authorized clients
I Tokens can be used to check integrity of fetched data
Given a chunk B...
I Chunk token TB = H(B)
I H is a collision resistant hash function
Siddhartha Annapureddy Shark
Discovering Clients Caching Chunks
Client Client
Client
Server
Client
Client
Discover Clients
I For every chunk B, there’s indexing key IBI IB used to index clients caching B
I Cannot set IB = TB, as TB is a secretI IB = MAC (TB, “Indexing Key”)
Siddhartha Annapureddy Shark
Locality Awareness
Client
Client
Client
Client
Network A Network B
I Overlay organized as clusters based on latency
I Indexing infrastructure preferentially returns sources in samecluster as the client
I Hence, chunks usually transferred from nearby clients
Siddhartha Annapureddy Shark
Final Steps
Download Chunk
I Security issues discussed later
Register as a source
I Client now becomes a source for the downloaded chunk
I Client registers in distributed index – PUT(IB, Addr)
Chunk Reconciliation
I Reuse connection to download more chunks
I Exchange mutually needed chunks w/o indexing overhead
Siddhartha Annapureddy Shark
Security Issues – Client Authentication
Client Client
Client
Server
Client
Client
TraditionalAuthentication
I Traditionally, server authenticated read requests using uids
Siddhartha Annapureddy Shark
Security Issues – Client Authentication
Client Client
Client
Server
Client
Client
TraditionalAuthentication From Token
Authenticator Derived
I Traditionally, server authenticated read requests using uids
I Challenge – How does a client know when to send chunks?
I Chunk token allows client to identify authorized clients
Siddhartha Annapureddy Shark
Security Issues – Client Communication
I Client should be able to check integrity of downloaded chunk
I Client should not send chunks to other unauthorized clients
I An eavesdropper shouldn’t be able to obtain chunk contents
Siddhartha Annapureddy Shark
Security Protocol
Client
Receiver
Client
Source
Siddhartha Annapureddy Shark
Security Protocol
Client
Receiver
Client
Source
RC
RP
I RC, RP – Random nonces to ensure freshness
Siddhartha Annapureddy Shark
Security Protocol
Client
Receiver
Client
Source
RC
RP
IB, AuthC
EK(B)
I RC, RP – Random nonces to ensure freshness
I AuthC – Authenticator to prove receiver has tokenI AuthC = MAC (TB, “Auth C”, C,P,RC,RP)
Siddhartha Annapureddy Shark
Security Protocol
Client
Receiver
Client
Source
RC
RP
IB, AuthC
EK(B)
I RC, RP – Random nonces to ensure freshness
I AuthC – Authenticator to prove receiver has tokenI AuthC = MAC (TB, “Auth C”, C,P,RC,RP)
I K – Key to encrypt chunk contentsI K = MAC (TB, “Encryption”, C,P,RC,RP)
Siddhartha Annapureddy Shark
Security Properties
Client
Receiver
Client
Source
RC
RP
IB, AuthC
EK(B)
Client can check integrity of downloaded chunk
I Client checks H(Downloaded chunk)?= TB
Source should not send chunks to unauthorized clients
I Malicious clients cannot send correct AuthC
Eavesdropper shouldn’t get chunk contents
I All communication encrypted with K
Siddhartha Annapureddy Shark
Security Properties
Client
Receiver
Client
Source
RC
RP
IB, AuthC
EK(B)
Privacy limitations for world-readable files
I Eavesdropper can track lookups of clients
I Eavesdropper hashes data, finds what exactly client downloads
For private files, solution described in paper
I Sacrifices cross-FS sharing for better privacy
Forward Secrecy not guaranteed
Siddhartha Annapureddy Shark
Implementation
Sharksd
ServerNFS V3
Application
Syscall
Sharkcd
Corald
NFS V3 Client
Application
Syscall
Sharkcd
Corald
NFS V3Client
Server
Client Client
I sharkcd – Incorporates source-receiver client functionalityI sharksd – Incorporates chunking mechanismsI corald – A node in the indexing infrastructure
Siddhartha Annapureddy Shark
Evaluation
I How does Shark compare with SFS? With NFS?
I How scalable is the server?
I How fair is Shark across clients?
I Which order is better? Random or Sequential
I What are the benefits of set reconciliation?
Siddhartha Annapureddy Shark
Emulab – 100 Nodes on LAN
0
20
40
60
80
100
0 100 200 300 400 500 600 700 800
Per
cent
age
com
plet
ed w
ithin
tim
e
Time since initialization (sec)
40 MB readShark, rand, negotiationNFSSFS
I Shark – 88sI SFS – 775s (≈ 9x better), NFS – 350s (≈ 4x better)I SFS less fair because of TCP backoffs
Siddhartha Annapureddy Shark
PlanetLab – 185 Nodes
0
20
40
60
80
100
0 500 1000 1500 2000 2500
Per
cent
age
com
plet
ed w
ithin
tim
e
Time since initialization (sec)
40 MB readSharkSFS
I Shark ≈ 7 min – 95th Percentile
I SFS ≈ 39 min – 95th Percentile (5x better)
I NFS – Triggered kernel panics in server
Siddhartha Annapureddy Shark
Data pushed by Server
SFS Shark0.0
0.5
1.0
Nor
mal
ized
ban
dwid
th
7400
954
I Shark vs SFS – 23 copies vs 185 copies (8x better)
Siddhartha Annapureddy Shark
Data served by Clients
0
20
40
60
80
100
120
140
160
0 10 20 30 40 50 60 70 80 90Ban
dwid
th tr
ansm
itted
(M
B)
Unique Shark proxies
PlanetLab hostsShark, 40MB
I Maximum contribution ≈ 3.5 copies
I Median contribution ≈ 1.5 copies
I Minimum contribution ≈ 0.75 copies
Siddhartha Annapureddy Shark
Fetching Chunks – Order Matters
I In what order should we fetch chunks of a file?
I Natural choices – Random or Sequential
Intuitively, when many clients start simultaneously
I RandomI All clients fetch independent chunksI More chunks become available in the cooperative cache
I SequentialI Better disk I/O scheduling on the serverI Client that downloads most chunks alone adds to cache
Siddhartha Annapureddy Shark
Emulab – 100 Nodes on LAN
0
20
40
60
80
100
0 100 200 300 400 500 600 700 800
Per
cent
age
com
plet
ed w
ithin
tim
e
Time since initialization (sec)
40 MB readShark, randShark, seq
I Random – 133sI Sequential – 203sI Random Wins !!! – 35% better
Siddhartha Annapureddy Shark
Emulab – 100 Nodes on LAN
0
20
40
60
80
100
0 100 200 300 400 500 600 700 800
Per
cent
age
com
plet
ed w
ithin
tim
e
Time since initialization (sec)
40 MB readShark, randShark, rand, negotiation
I Random + Reconciliation – 88sI Random – 133sI Reconciliation crucial – 34% improvement
Siddhartha Annapureddy Shark
Conclusions
I Networked filesystems offer a convenient interface
I Current networked filesystems like NFS are not scalableI Forces users to resort to inconvenient tools like rsync
I Shark offers a filesystem that scales to hundreds of clientsI Locality-aware cooperative cacheI Supports cross-FS sharing enabling novel applications
Siddhartha Annapureddy Shark
Questions
http://www.scs.cs.nyu.edu/shark
Siddhartha Annapureddy Shark