web20expo filesystems
DESCRIPTION
http://www.iamcal.com/talks/TRANSCRIPT
![Page 1: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/1.jpg)
Beyond the File System
Designing Large Scale File Storage and Serving
Cal Henderson
![Page 2: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/2.jpg)
Web 2.0 Expo, 17 April 2007 2
Hello!
![Page 3: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/3.jpg)
Web 2.0 Expo, 17 April 2007 3
Big file systems?
• Too vague!
• What is a file system?
• What constitutes big?
• Some requirements would be nice
![Page 4: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/4.jpg)
Web 2.0 Expo, 17 April 2007 4
ScalableLooking at storage and serving infrastructures1
![Page 5: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/5.jpg)
Web 2.0 Expo, 17 April 2007 5
ReliableLooking at redundancy, failure rates, on the fly changes2
![Page 6: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/6.jpg)
Web 2.0 Expo, 17 April 2007 6
CheapLooking at upfront costs, TCO and lifetimes3
![Page 7: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/7.jpg)
Web 2.0 Expo, 17 April 2007 7
Four buckets
Storage
Serving
BCP
Cost
![Page 8: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/8.jpg)
Web 2.0 Expo, 17 April 2007 8
Storage
![Page 9: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/9.jpg)
Web 2.0 Expo, 17 April 2007 9
The storage stack
File system
Block protocol
RAID
Hardware
ext, reiserFS, NTFS
SCSI, SATA, FC
Mirrors, Stripes
Disks and stuff
File protocol NFS, CIFS, SMB
![Page 10: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/10.jpg)
Web 2.0 Expo, 17 April 2007 10
Hardware overview
The storage scale
Internal DAS SAN NAS
Lower Higher
![Page 11: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/11.jpg)
Web 2.0 Expo, 17 April 2007 11
Internal storage
• A disk in a computer– SCSI, IDE, SATA
• 4 disks in 1U is common
• 8 for half depth boxes
![Page 12: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/12.jpg)
Web 2.0 Expo, 17 April 2007 12
DAS
Direct attached storage
Disk shelf, connected by SCSI/SATA
HP MSA30 – 14 disks in 3U
![Page 13: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/13.jpg)
Web 2.0 Expo, 17 April 2007 13
SAN
• Storage Area Network
• Dumb disk shelves
• Clients connect via a ‘fabric’
• Fibre Channel, iSCSI, Infiniband– Low level protocols
![Page 14: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/14.jpg)
Web 2.0 Expo, 17 April 2007 14
NAS
• Network Attached Storage
• Intelligent disk shelf
• Clients connect via a network
• NFS, SMB, CIFS– High level protocols
![Page 15: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/15.jpg)
Web 2.0 Expo, 17 April 2007 15
Of course, it’s more confusing than that
![Page 16: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/16.jpg)
Web 2.0 Expo, 17 April 2007 16
Meet the LUN
• Logical Unit Number
• A slice of storage space
• Originally for addressing a single drive:– c1t2d3– Controller, Target, Disk (Slice)
• Now means a virtual partition/volume– LVM, Logical Volume Management
![Page 17: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/17.jpg)
Web 2.0 Expo, 17 April 2007 17
NAS vs SAN
With a SAN, a single host (initiator) owns a single LUN/volume
With NAS, multiple hosts own a single LUN/volume
NAS head – NAS access to a SAN
![Page 18: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/18.jpg)
Web 2.0 Expo, 17 April 2007 18
SAN Advantages
Virtualization within a SAN offers some nice features:
• Real-time LUN replication
• Transparent backup
• SAN booting for host replacement
![Page 19: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/19.jpg)
Web 2.0 Expo, 17 April 2007 19
Some Practical Examples
• There are a lot of vendors
• Configurations vary
• Prices vary wildly
• Let’s look at a couple– Ones I happen to have experience with– Not an endorsement ;)
![Page 20: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/20.jpg)
Web 2.0 Expo, 17 April 2007 20
NetApp Filers
Heads and shelves, up to 500TB in 6 Cabs
FC SAN with 1 or 2 NAS heads
![Page 21: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/21.jpg)
Web 2.0 Expo, 17 April 2007 21
Isilon IQ
• 2U Nodes, 3-96 nodes/cluster, 6-600 TB
• FC/InfiniBand SAN with NAS head on each node
![Page 22: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/22.jpg)
Web 2.0 Expo, 17 April 2007 22
Scaling
Vertical vs Horizontal
![Page 23: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/23.jpg)
Web 2.0 Expo, 17 April 2007 23
Vertical scaling
• Get a bigger box
• Bigger disk(s)
• More disks
• Limited by current tech – size of each disk and total number in appliance
![Page 24: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/24.jpg)
Web 2.0 Expo, 17 April 2007 24
Horizontal scaling
• Buy more boxes
• Add more servers/appliances
• Scales forever*
*sort of
![Page 25: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/25.jpg)
Web 2.0 Expo, 17 April 2007 25
Storage scaling approaches
• Four common models:
• Huge FS
• Physical nodes
• Virtual nodes
• Chunked space
![Page 26: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/26.jpg)
Web 2.0 Expo, 17 April 2007 26
Huge FS
• Create one giant volume with growing space– Sun’s ZFS– Isilon IQ
• Expandable on-the-fly?
• Upper limits– Always limited somewhere
![Page 27: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/27.jpg)
Web 2.0 Expo, 17 April 2007 27
Huge FS
• Pluses– Simple from the application side– Logically simple– Low administrative overhead
• Minuses– All your eggs in one basket– Hard to expand– Has an upper limit
![Page 28: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/28.jpg)
Web 2.0 Expo, 17 April 2007 28
Physical nodes
• Application handles distribution to multiple physical nodes– Disks, Boxes, Appliances, whatever
• One ‘volume’ per node
• Each node acts by itself
• Expandable on-the-fly – add more nodes
• Scales forever
![Page 29: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/29.jpg)
Web 2.0 Expo, 17 April 2007 29
Physical Nodes
• Pluses– Limitless expansion– Easy to expand– Unlikely to all fail at once
• Minuses– Many ‘mounts’ to manage– More administration
![Page 30: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/30.jpg)
Web 2.0 Expo, 17 April 2007 30
Virtual nodes
• Application handles distribution to multiple virtual volumes, contained on multiple physical nodes
• Multiple volumes per node
• Flexible
• Expandable on-the-fly – add more nodes
• Scales forever
![Page 31: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/31.jpg)
Web 2.0 Expo, 17 April 2007 31
Virtual Nodes
• Pluses– Limitless expansion– Easy to expand– Unlikely to all fail at once– Addressing is logical, not physical– Flexible volume sizing, consolidation
• Minuses– Many ‘mounts’ to manage– More administration
![Page 32: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/32.jpg)
Web 2.0 Expo, 17 April 2007 32
Chunked space
• Storage layer writes parts of files to different physical nodes
• A higher-level RAID striping
• High performance for large files– read multiple parts simultaneously
![Page 33: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/33.jpg)
Web 2.0 Expo, 17 April 2007 33
Chunked space
• Pluses– High performance– Limitless size
• Minuses– Conceptually complex– Can be hard to expand on the fly– Can’t manually poke it
![Page 34: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/34.jpg)
Web 2.0 Expo, 17 April 2007 34
Real Life
Case Studies
![Page 35: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/35.jpg)
Web 2.0 Expo, 17 April 2007 35
GFS – Google File System
• Developed by … Google
• Proprietary
• Everything we know about it is based on talks they’ve given
• Designed to store huge files for fast access
![Page 36: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/36.jpg)
Web 2.0 Expo, 17 April 2007 36
GFS – Google File System
• Single ‘Master’ node holds metadata– SPF – Shadow master allows warm swap
• Grid of ‘chunkservers’– 64bit filenames– 64 MB file chunks
![Page 37: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/37.jpg)
Web 2.0 Expo, 17 April 2007 37
GFS – Google File System
1(a) 2(a)
1(b)
Master
![Page 38: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/38.jpg)
Web 2.0 Expo, 17 April 2007 38
GFS – Google File System
• Client reads metadata from master then file parts from multiple chunkservers
• Designed for big files (>100MB)
• Master server allocates access leases
• Replication is automatic and self repairing– Synchronously for atomicity
![Page 39: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/39.jpg)
Web 2.0 Expo, 17 April 2007 39
GFS – Google File System
• Reading is fast (parallelizable)– But requires a lease
• Master server is required for all reads and writes
![Page 40: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/40.jpg)
Web 2.0 Expo, 17 April 2007 40
MogileFS – OMG Files
• Developed by Danga / SixApart
• Open source
• Designed for scalable web app storage
![Page 41: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/41.jpg)
Web 2.0 Expo, 17 April 2007 41
MogileFS – OMG Files
• Single metadata store (MySQL)– MySQL Cluster avoids SPF
• Multiple ‘tracker’ nodes locate files
• Multiple ‘storage’ nodes store files
![Page 42: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/42.jpg)
Web 2.0 Expo, 17 April 2007 42
MogileFS – OMG Files
Tracker
Tracker
MySQL
![Page 43: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/43.jpg)
Web 2.0 Expo, 17 April 2007 43
MogileFS – OMG Files
• Replication of file ‘classes’ happens transparently
• Storage nodes are not mirrored – replication is piecemeal
• Reading and writing go through trackers, but are performed directly upon storage nodes
![Page 44: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/44.jpg)
Web 2.0 Expo, 17 April 2007 44
Flickr File System
• Developed by Flickr
• Proprietary
• Designed for very large scalable web app storage
![Page 45: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/45.jpg)
Web 2.0 Expo, 17 April 2007 45
Flickr File System
• No metadata store– Deal with it yourself
• Multiple ‘StorageMaster’ nodes
• Multiple storage nodes with virtual volumes
![Page 46: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/46.jpg)
Web 2.0 Expo, 17 April 2007 46
Flickr File System
SM
SM
SM
![Page 47: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/47.jpg)
Web 2.0 Expo, 17 April 2007 47
Flickr File System
• Metadata stored by app– Just a virtual volume number– App chooses a path
• Virtual nodes are mirrored– Locally and remotely
• Reading is done directly from nodes
![Page 48: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/48.jpg)
Web 2.0 Expo, 17 April 2007 48
Flickr File System
• StorageMaster nodes only used for write operations
• Reading and writing can scale separately
![Page 49: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/49.jpg)
Web 2.0 Expo, 17 April 2007 49
Amazon S3
• A big disk in the sky
• Multiple ‘buckets’
• Files have user-defined keys
• Data + metadata
![Page 50: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/50.jpg)
Web 2.0 Expo, 17 April 2007 50
Amazon S3
Servers Amazon
![Page 51: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/51.jpg)
Web 2.0 Expo, 17 April 2007 51
Amazon S3
Servers Amazon
Users
![Page 52: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/52.jpg)
Web 2.0 Expo, 17 April 2007 52
The cost
• Fixed price, by the GB
• Store: $0.15 per GB per month
• Serve: $0.20 per GB
![Page 53: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/53.jpg)
Web 2.0 Expo, 17 April 2007 53
The cost
S3
![Page 54: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/54.jpg)
Web 2.0 Expo, 17 April 2007 54
The cost
S3
Regular Bandwidth
![Page 55: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/55.jpg)
Web 2.0 Expo, 17 April 2007 55
End costs
• ~$2k to store 1TB for a year
• ~$63 a month for 1Mb
• ~$65k a month for 1Gb
![Page 56: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/56.jpg)
Web 2.0 Expo, 17 April 2007 56
Serving
![Page 57: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/57.jpg)
Web 2.0 Expo, 17 April 2007 57
Serving files
Serving files is easy!
ApacheDisk
![Page 58: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/58.jpg)
Web 2.0 Expo, 17 April 2007 58
Serving files
Scaling is harder
ApacheDisk
ApacheDisk
ApacheDisk
![Page 59: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/59.jpg)
Web 2.0 Expo, 17 April 2007 59
Serving files
• This doesn’t scale well
• Primary storage is expensive– And takes a lot of space
• In many systems, we only access a small number of files most of the time
![Page 60: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/60.jpg)
Web 2.0 Expo, 17 April 2007 60
Caching
• Insert caches between the storage and serving nodes
• Cache frequently accessed content to reduce reads on the storage nodes
• Software (Squid, mod_cache)
• Hardware (Netcache, Cacheflow)
![Page 61: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/61.jpg)
Web 2.0 Expo, 17 April 2007 61
Why it works
• Keep a smaller working set
• Use faster hardware– Lots of RAM– SCSI– Outer edge of disks (ZCAV)
• Use more duplicates– Cheaper, since they’re smaller
![Page 62: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/62.jpg)
Web 2.0 Expo, 17 April 2007 62
Two models
• Layer 4– ‘Simple’ balanced cache– Objects in multiple caches– Good for few objects requested many times
• Layer 7– URL balances cache– Objects in a single cache– Good for many objects requested a few times
![Page 63: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/63.jpg)
Web 2.0 Expo, 17 April 2007 63
Replacement policies
• LRU – Least recently used
• GDSF – Greedy dual size frequency
• LFUDA – Least frequently used with dynamic aging
• All have advantages and disadvantages
• Performance varies greatly with each
![Page 64: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/64.jpg)
Web 2.0 Expo, 17 April 2007 64
Cache Churn
• How long do objects typically stay in cache?
• If it gets too short, we’re doing badly– But it depends on your traffic profile
• Make the cached object store larger
![Page 65: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/65.jpg)
Web 2.0 Expo, 17 April 2007 65
Problems
• Caching has some problems:
– Invalidation is hard– Replacement is dumb (even LFUDA)
• Avoiding caching makes your life (somewhat) easier
![Page 66: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/66.jpg)
Web 2.0 Expo, 17 April 2007 66
CDN – Content Delivery Network
• Akamai, Savvis, Mirror Image Internet, etc
• Caches operated by other people– Already in-place– In lots of places
• GSLB/DNS balancing
![Page 67: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/67.jpg)
Web 2.0 Expo, 17 April 2007 67
Edge networks
Origin
![Page 68: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/68.jpg)
Web 2.0 Expo, 17 April 2007 68
Edge networks
Origin
Cache
Cache
Cache
CacheCache
Cache
CacheCache
![Page 69: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/69.jpg)
Web 2.0 Expo, 17 April 2007 69
CDN Models
• Simple model– You push content to them, they serve it
• Reverse proxy model– You publish content on an origin, they proxy
and cache it
![Page 70: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/70.jpg)
Web 2.0 Expo, 17 April 2007 70
CDN Invalidation
• You don’t control the caches– Just like those awful ISP ones
• Once something is cached by a CDN, assume it can never change– Nothing can be deleted– Nothing can be modified
![Page 71: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/71.jpg)
Web 2.0 Expo, 17 April 2007 71
Versioning
• When you start to cache things, you need to care about versioning
– Invalidation & Expiry– Naming & Sync
![Page 72: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/72.jpg)
Web 2.0 Expo, 17 April 2007 72
Cache Invalidation
• If you control the caches, invalidation is possible
• But remember ISP and client caches
• Remove deleted content explicitly– Avoid users finding old content– Save cache space
![Page 73: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/73.jpg)
Web 2.0 Expo, 17 April 2007 73
Cache versioning
• Simple rule of thumb:– If an item is modified, change its name (URL)
• This can be independent of the file system!
![Page 74: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/74.jpg)
Web 2.0 Expo, 17 April 2007 74
Virtual versioning
• Database indicates version 3 of file
• Web app writes version number into URL
• Request comes through cache and is cached with the versioned URL
• mod_rewrite converts versioned URL to path
Version 3
example.com/foo_3.jpg
Cached: foo_3.jpg
foo_3.jpg -> foo.jpg
![Page 75: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/75.jpg)
Web 2.0 Expo, 17 April 2007 75
Authentication
• Authentication inline layer– Apache / perlbal
• Authentication sideline– ICP (CARP/HTCP)
• Authentication by URL– FlickrFS
![Page 76: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/76.jpg)
Web 2.0 Expo, 17 April 2007 76
Auth layer
• Authenticator sits between client and storage
• Typically built into the cache software
Cache
Authenticator
Origin
![Page 77: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/77.jpg)
Web 2.0 Expo, 17 April 2007 77
Auth sideline
• Authenticator sits beside the cache
• Lightweight protocol used for authenticator
Cache
Authenticator
Origin
![Page 78: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/78.jpg)
Web 2.0 Expo, 17 April 2007 78
Auth by URL
• Someone else performs authentication and gives URLs to client (typically the web app)
• URLs hold the ‘keys’ for accessing files
Cache OriginWeb Server
![Page 79: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/79.jpg)
Web 2.0 Expo, 17 April 2007 79
BCP
![Page 80: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/80.jpg)
Web 2.0 Expo, 17 April 2007 80
Business Continuity Planning
• How can I deal with the unexpected?– The core of BCP
• Redundancy
• Replication
![Page 81: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/81.jpg)
Web 2.0 Expo, 17 April 2007 81
Reality
• On a long enough timescale, anything that can fail, will fail
• Of course, everything can fail
• True reliability comes only through redundancy
![Page 82: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/82.jpg)
Web 2.0 Expo, 17 April 2007 82
Reality
• Define your own SLAs
• How long can you afford to be down?
• How manual is the recovery process?
• How far can you roll back?
• How many $node boxes can fail at once?
![Page 83: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/83.jpg)
Web 2.0 Expo, 17 April 2007 83
Failure scenarios
• Disk failure
• Storage array failure
• Storage head failure
• Fabric failure
• Metadata node failure
• Power outage
• Routing outage
![Page 84: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/84.jpg)
Web 2.0 Expo, 17 April 2007 84
Reliable by design
• RAID avoids disk failures, but not head or fabric failures
• Duplicated nodes avoid host and fabric failures, but not routing or power failures
• Dual-colo avoids routing and power failures, but may need duplication too
![Page 85: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/85.jpg)
Web 2.0 Expo, 17 April 2007 85
Tend to all points in the stack
• Going dual-colo: great
• Taking a whole colo offline because of a single failed disk: bad
• We need a combination of these
![Page 86: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/86.jpg)
Web 2.0 Expo, 17 April 2007 86
Recovery times
• BCP is not just about continuing when things fail
• How can we restore after they come back?
• Host and colo level syncing– replication queuing
• Host and colo level rebuilding
![Page 87: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/87.jpg)
Web 2.0 Expo, 17 April 2007 87
Reliable Reads & Writes
• Reliable reads are easy– 2 or more copies of files
• Reliable writes are harder– Write 2 copies at once– But what do we do when we can’t write to
one?
![Page 88: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/88.jpg)
Web 2.0 Expo, 17 April 2007 88
Dual writes
• Queue up data to be written– Where?– Needs itself to be reliable
• Queue up journal of changes– And then read data from the disk whose write
succeeded
• Duplicate whole volume after failure– Slow!
![Page 89: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/89.jpg)
Web 2.0 Expo, 17 April 2007 89
Cost
![Page 90: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/90.jpg)
Web 2.0 Expo, 17 April 2007 90
Judging cost
• Per GB?
• Per GB upfront and per year
• Not as simple as you’d hope– How about an example
![Page 91: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/91.jpg)
Web 2.0 Expo, 17 April 2007 91
Hardware costs
Cost of hardware
Usable GB
Single Cost
![Page 92: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/92.jpg)
Web 2.0 Expo, 17 April 2007 92
Power costs
Cost of power per year
Usable GB
Recurring Cost
![Page 93: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/93.jpg)
Web 2.0 Expo, 17 April 2007 93
Power costs
Power installation cost
Usable GB
Single Cost
![Page 94: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/94.jpg)
Web 2.0 Expo, 17 April 2007 94
Space costs
Cost per U
Usable GB
[ ]U’s needed (inc network)x
Recurring Cost
![Page 95: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/95.jpg)
Web 2.0 Expo, 17 April 2007 95
Network costs
Cost of network gear
Usable GB
Single Cost
![Page 96: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/96.jpg)
Web 2.0 Expo, 17 April 2007 96
Misc costs
Support contracts + spare disks
Usable GB
+ bus adaptors + cables[ ]Single & Recurring Costs
![Page 97: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/97.jpg)
Web 2.0 Expo, 17 April 2007 97
Human costs
Admin cost per node
Node countx
Recurring Cost
Usable GB
[ ]
![Page 98: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/98.jpg)
Web 2.0 Expo, 17 April 2007 98
TCO
• Total cost of ownership in two parts– Upfront– Ongoing
• Architecture plays a huge part in costing– Don’t get tied to hardware– Allow heterogeneity– Move with the market
![Page 99: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/99.jpg)
(fin)
![Page 100: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/100.jpg)
Web 2.0 Expo, 17 April 2007 100
Photo credits
• flickr.com/photos/ebright/260823954/• flickr.com/photos/thomashawk/243477905/• flickr.com/photos/tom-carden/116315962/• flickr.com/photos/sillydog/287354869/• flickr.com/photos/foreversouls/131972916/• flickr.com/photos/julianb/324897/• flickr.com/photos/primejunta/140957047/• flickr.com/photos/whatknot/28973703/• flickr.com/photos/dcjohn/85504455/
![Page 101: Web20expo Filesystems](https://reader035.vdocuments.us/reader035/viewer/2022062418/55691946d8b42aa8138b4aa8/html5/thumbnails/101.jpg)
Web 2.0 Expo, 17 April 2007 101
You can find these slides online:
iamcal.com/talks/