Download - Filesystems
![Page 1: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/1.jpg)
Beyond the File System
Designing Large Scale File Storage and Serving
Cal Henderson
![Page 2: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/2.jpg)
Web Builder 2.0 2
Hello!
![Page 3: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/3.jpg)
Web Builder 2.0 3
Big file systems?
• Too vague!
• What is a file system?
• What constitutes big?
• Some requirements would be nice
![Page 4: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/4.jpg)
Web Builder 2.0 4
ScalableLooking at storage and serving infrastructures1
![Page 5: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/5.jpg)
Web Builder 2.0 5
ReliableLooking at redundancy, failure rates, on the fly changes2
![Page 6: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/6.jpg)
Web Builder 2.0 6
CheapLooking at upfront costs, TCO and lifetimes3
![Page 7: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/7.jpg)
Web Builder 2.0 7
Four buckets
Storage
Serving
BCP
Cost
![Page 8: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/8.jpg)
Web Builder 2.0 8
Storage
![Page 9: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/9.jpg)
Web Builder 2.0 9
The storage stack
File system
Block protocol
RAID
Hardware
ext, reiserFS, NTFS
SCSI, SATA, FC
Mirrors, Stripes
Disks and stuff
File protocol NFS, CIFS, SMB
![Page 10: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/10.jpg)
Web Builder 2.0 10
Hardware overview
The storage scale
Internal DAS SAN NAS
Lower Higher
![Page 11: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/11.jpg)
Web Builder 2.0 11
Internal storage
• A disk in a computer– SCSI, IDE, SATA
• 4 disks in 1U is common
• 8 for half depth boxes
![Page 12: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/12.jpg)
Web Builder 2.0 12
DAS
Direct attached storage
Disk shelf, connected by SCSI/SATA
HP MSA30 – 14 disks in 3U
![Page 13: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/13.jpg)
Web Builder 2.0 13
SAN
• Storage Area Network
• Dumb disk shelves
• Clients connect via a ‘fabric’
• Fibre Channel, iSCSI, Infiniband– Low level protocols
![Page 14: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/14.jpg)
Web Builder 2.0 14
NAS
• Network Attached Storage
• Intelligent disk shelf
• Clients connect via a network
• NFS, SMB, CIFS– High level protocols
![Page 15: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/15.jpg)
Web Builder 2.0 15
Of course, it’s more confusing than that
![Page 16: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/16.jpg)
Web Builder 2.0 16
Meet the LUN
• Logical Unit Number
• A slice of storage space
• Originally for addressing a single drive:– c1t2d3– Controller, Target, Disk (Slice)
• Now means a virtual partition/volume– LVM, Logical Volume Management
![Page 17: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/17.jpg)
Web Builder 2.0 17
NAS vs SAN
With SAN, a single host (initiator) owns a single LUN/volume
With NAS, multiple hosts own a single LUN/volume
NAS head – NAS access to a SAN
![Page 18: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/18.jpg)
Web Builder 2.0 18
SAN Advantages
Virtualization within a SAN offers some nice features:
• Real-time LUN replication
• Transparent backup
• SAN booting for host replacement
![Page 19: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/19.jpg)
Web Builder 2.0 19
Some Practical Examples
• There are a lot of vendors
• Configurations vary
• Prices vary wildly
• Let’s look at a couple– Ones I happen to have experience with– Not an endorsement ;)
![Page 20: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/20.jpg)
Web Builder 2.0 20
NetApp Filers
Heads and shelves, up to 500TB in 260U
FC SAN with 1 or 2 NAS heads
![Page 21: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/21.jpg)
Web Builder 2.0 21
Isilon IQ
• 2U Nodes, 3-96 nodes/cluster, 6-600 TB
• FC/InfiniBand SAN with NAS head on each node
![Page 22: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/22.jpg)
Web Builder 2.0 22
Scaling
Vertical vs Horizontal
![Page 23: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/23.jpg)
Web Builder 2.0 23
Vertical scaling
• Get a bigger box
• Bigger disk(s)
• More disks
• Limited by current tech – size of each disk and total number in appliance
![Page 24: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/24.jpg)
Web Builder 2.0 24
Horizontal scaling
• Buy more boxes
• Add more servers/appliances
• Scales forever*
*sort of
![Page 25: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/25.jpg)
Web Builder 2.0 25
Storage scaling approaches
• Four common models:
• Huge FS
• Physical nodes
• Virtual nodes
• Chunked space
![Page 26: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/26.jpg)
Web Builder 2.0 26
Huge FS
• Create one giant volume with growing space– Sun’s ZFS– Isilon IQ
• Expandable on-the-fly?
• Upper limits– Always limited somewhere
![Page 27: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/27.jpg)
Web Builder 2.0 27
Huge FS
• Pluses– Simple from the application side– Logically simple– Low administrative overhead
• Minuses– All your eggs in one basket– Hard to expand– Has an upper limit
![Page 28: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/28.jpg)
Web Builder 2.0 28
Physical nodes
• Application handles distribution to multiple physical nodes– Disks, Boxes, Appliances, whatever
• One ‘volume’ per node
• Each node acts by itself
• Expandable on-the-fly – add more nodes
• Scales forever
![Page 29: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/29.jpg)
Web Builder 2.0 29
Physical Nodes
• Pluses– Limitless expansion– Easy to expand– Unlikely to all fail at once
• Minuses– Many ‘mounts’ to manage– More administration
![Page 30: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/30.jpg)
Web Builder 2.0 30
Virtual nodes
• Application handles distribution to multiple virtual volumes, contained on multiple physical nodes
• Multiple volumes per node
• Flexible
• Expandable on-the-fly – add more nodes
• Scales forever
![Page 31: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/31.jpg)
Web Builder 2.0 31
Virtual Nodes
• Pluses– Limitless expansion– Easy to expand– Unlikely to all fail at once– Addressing is logical, not physical– Flexible volume sizing, consolidation
• Minuses– Many ‘mounts’ to manage– More administration
![Page 32: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/32.jpg)
Web Builder 2.0 32
Chunked space
• Storage layer writes parts of files to different physical nodes
• A higher-level RAID striping
• High performance for large files– read multiple parts simultaneously
![Page 33: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/33.jpg)
Web Builder 2.0 33
Chunked space
• Pluses– High performance– Limitless size
• Minuses– Conceptually complex– Can be hard to expand on the fly– Can’t manually poke it
![Page 34: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/34.jpg)
Web Builder 2.0 34
Real Life
Case Studies
![Page 35: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/35.jpg)
Web Builder 2.0 35
GFS – Google File System
• Developed by … Google
• Proprietary
• Everything we know about it is based on talks they’ve given
• Designed to store huge files for fast access
![Page 36: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/36.jpg)
Web Builder 2.0 36
GFS – Google File System
• Single ‘Master’ node holds metadata– SPF – Shadow master allows warm swap
• Grid of ‘chunkservers’– 64bit filenames– 64 MB file chunks
![Page 37: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/37.jpg)
Web Builder 2.0 37
GFS – Google File System
1(a) 2(a)
1(b)
Master
![Page 38: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/38.jpg)
Web Builder 2.0 38
GFS – Google File System
• Client reads metadata from master then file parts from multiple chunkservers
• Designed for big files (>100MB)
• Master server allocates access leases
• Replication is automatic and self repairing– Synchronously for atomicity
![Page 39: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/39.jpg)
Web Builder 2.0 39
GFS – Google File System
• Reading is fast (parallelizable)– But requires a lease
• Master server is required for all reads and writes
![Page 40: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/40.jpg)
Web Builder 2.0 40
MogileFS – OMG Files
• Developed by Danga / SixApart
• Open source
• Designed for scalable web app storage
![Page 41: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/41.jpg)
Web Builder 2.0 41
MogileFS – OMG Files
• Single metadata store (MySQL)– MySQL Cluster avoids SPF
• Multiple ‘tracker’ nodes locate files
• Multiple ‘storage’ nodes store files
![Page 42: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/42.jpg)
Web Builder 2.0 42
MogileFS – OMG Files
Tracker
Tracker
MySQL
![Page 43: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/43.jpg)
Web Builder 2.0 43
MogileFS – OMG Files
• Replication of file ‘classes’ happens transparently
• Storage nodes are not mirrored – replication is piecemeal
• Reading and writing go through trackers, but are performed directly upon storage nodes
![Page 44: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/44.jpg)
Web Builder 2.0 44
Flickr File System
• Developed by Flickr
• Proprietary
• Designed for very large scalable web app storage
![Page 45: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/45.jpg)
Web Builder 2.0 45
Flickr File System
• No metadata store– Deal with it yourself
• Multiple ‘StorageMaster’ nodes
• Multiple storage nodes with virtual volumes
![Page 46: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/46.jpg)
Web Builder 2.0 46
Flickr File System
SM
SM
SM
![Page 47: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/47.jpg)
Web Builder 2.0 47
Flickr File System
• Metadata stored by app– Just a virtual volume number– App chooses a path
• Virtual nodes are mirrored– Locally and remotely
• Reading is done directly from nodes
![Page 48: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/48.jpg)
Web Builder 2.0 48
Flickr File System
• StorageMaster nodes only used for write operations
• Reading and writing can scale separately
![Page 49: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/49.jpg)
Web Builder 2.0 49
Serving
![Page 50: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/50.jpg)
Web Builder 2.0 50
Serving files
Serving files is easy!
ApacheDisk
![Page 51: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/51.jpg)
Web Builder 2.0 51
Serving files
Scaling is harder
ApacheDisk
ApacheDisk
ApacheDisk
![Page 52: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/52.jpg)
Web Builder 2.0 52
Serving files
• This doesn’t scale well
• Primary storage is expensive– And takes a lot of space
• In many systems, we only access a small number of files most of the time
![Page 53: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/53.jpg)
Web Builder 2.0 53
Caching
• Insert caches between the storage and serving nodes
• Cache frequently accessed content to reduce reads on the storage nodes
• Software (Squid, mod_cache)
• Hardware (Netcache, Cacheflow)
![Page 54: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/54.jpg)
Web Builder 2.0 54
Why it works
• Keep a smaller working set
• Use faster hardware– Lots of RAM– SCSI– Outer edge of disks (ZCAV)
• Use more duplicates– Cheaper, since they’re smaller
![Page 55: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/55.jpg)
Web Builder 2.0 55
Two models
• Layer 4– ‘Simple’ balanced cache– Objects in multiple caches– Good for few objects requested many times
• Layer 7– URL balances cache– Objects in a single cache– Good for many objects requested a few times
![Page 56: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/56.jpg)
Web Builder 2.0 56
Replacement policies
• LRU – Least recently used
• GDSF – Greedy dual size frequency
• LFUDA – Least frequently used with dynamic aging
• All have advantages and disadvantages
• Performance varies greatly with each
![Page 57: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/57.jpg)
Web Builder 2.0 57
Cache Churn
• How long do objects typically stay in cache?
• If it gets too short, we’re doing badly– But it depends on your traffic profile
• Make the cached object store larger
![Page 58: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/58.jpg)
Web Builder 2.0 58
Problems
• Caching has some problems:
– Invalidation is hard– Replacement is dumb (even LFUDA)
• Avoiding caching makes your life (somewhat) easier
![Page 59: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/59.jpg)
Web Builder 2.0 59
CDN – Content Delivery Network
• Akamai, Savvis, Mirror Image Internet, etc
• Caches operated by other people– Already in-place– In lots of places
• GSLB/DNS balancing
![Page 60: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/60.jpg)
Web Builder 2.0 60
Edge networks
Origin
![Page 61: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/61.jpg)
Web Builder 2.0 61
Edge networks
Origin
Cache
Cache
Cache
CacheCache
Cache
CacheCache
![Page 62: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/62.jpg)
Web Builder 2.0 62
CDN Models
• Simple model– You push content to them, they serve it
• Reverse proxy model– You publish content on an origin, they proxy
and cache it
![Page 63: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/63.jpg)
Web Builder 2.0 63
CDN Invalidation
• You don’t control the caches– Just like those awful ISP ones
• Once something is cached by a CDN, assume it can never change– Nothing can be deleted– Nothing can be modified
![Page 64: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/64.jpg)
Web Builder 2.0 64
Versioning
• When you start to cache things, you need to care about versioning
– Invalidation & Expiry– Naming & Sync
![Page 65: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/65.jpg)
Web Builder 2.0 65
Cache Invalidation
• If you control the caches, invalidation is possible
• But remember ISP and client caches
• Remove deleted content explicitly– Avoid users finding old content– Save cache space
![Page 66: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/66.jpg)
Web Builder 2.0 66
Cache versioning
• Simple rule of thumb:– If an item is modified, change its name (URL)
• This can be independent of the file system!
![Page 67: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/67.jpg)
Web Builder 2.0 67
Virtual versioning
• Database indicates version 3 of file
• Web app writes version number into URL
• Request comes through cache and is cached with the versioned URL
• mod_rewrite converts versioned URL to path
Version 3
example.com/foo_3.jpg
Cached: foo_3.jpg
foo_3.jpg -> foo.jpg
![Page 68: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/68.jpg)
Web Builder 2.0 68
Authentication
• Authentication inline layer– Apache / perlbal
• Authentication sideline– ICP (CARP/HTCP)
• Authentication by URL– FlickrFS
![Page 69: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/69.jpg)
Web Builder 2.0 69
Auth layer
• Authenticator sits between client and storage
• Typically built into the cache software
Cache
Authenticator
Origin
![Page 70: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/70.jpg)
Web Builder 2.0 70
Auth sideline
• Authenticator sits beside the cache
• Lightweight protocol used for authenticator
Cache
Authenticator
Origin
![Page 71: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/71.jpg)
Web Builder 2.0 71
Auth by URL
• Someone else performs authentication and gives URLs to client (typically the web app)
• URLs hold the ‘keys’ for accessing files
Cache OriginWeb Server
![Page 72: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/72.jpg)
Web Builder 2.0 72
BCP
![Page 73: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/73.jpg)
Web Builder 2.0 73
Business Continuity Planning
• How can I deal with the unexpected?– The core of BCP
• Redundancy
• Replication
![Page 74: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/74.jpg)
Web Builder 2.0 74
Reality
• On a long enough timescale, anything that can fail, will fail
• Of course, everything can fail
• True reliability comes only through redundancy
![Page 75: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/75.jpg)
Web Builder 2.0 75
Reality
• Define your own SLAs
• How long can you afford to be down?
• How manual is the recovery process?
• How far can you roll back?
• How many node x boxes can fail at once?
![Page 76: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/76.jpg)
Web Builder 2.0 76
Failure scenarios
• Disk failure
• Storage array failure
• Storage head failure
• Fabric failure
• Metadata node failure
• Power outage
• Routing outage
![Page 77: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/77.jpg)
Web Builder 2.0 77
Reliable by design
• RAID avoids disk failures, but not head or fabric failures
• Duplicated nodes avoid host and fabric failures, but not routing or power failures
• Dual-colo avoids routing and power failures, but my need duplication too
![Page 78: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/78.jpg)
Web Builder 2.0 78
Tend to all points in the stack
• Going dual-colo: great
• Taking a whole colo offline because of a single failed disk: bad
• We need a combination of these
![Page 79: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/79.jpg)
Web Builder 2.0 79
Recovery times
• BCP is not just about continuing when things fail
• How can we restore after they come back?
• Host and colo level syncing– replication queuing
• Host and colo level rebuilding
![Page 80: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/80.jpg)
Web Builder 2.0 80
Reliable Reads & Writes
• Reliable reads are easy– 2 or more copies of files
• Reliable writes are harder– Write 2 copies at once– But what do we do when we can’t write to
one?
![Page 81: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/81.jpg)
Web Builder 2.0 81
Dual writes
• Queue up data to be written– Where?– Needs itself to be reliable
• Queue up journal of changes– And then read data from the disk whose write
succeeded
• Duplicate whole volume after failure– Slow!
![Page 82: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/82.jpg)
Web Builder 2.0 82
Cost
![Page 83: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/83.jpg)
Web Builder 2.0 83
Judging cost
• Per GB?
• Per GB upfront and per year
• Not as simple as you’d hope– How about an example
![Page 84: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/84.jpg)
Web Builder 2.0 84
Hardware costs
Cost of hardware
Usable GB
Single Cost
![Page 85: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/85.jpg)
Web Builder 2.0 85
Power costs
Cost of power per year
Usable GB
Recurring Cost
![Page 86: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/86.jpg)
Web Builder 2.0 86
Power costs
Power installation cost
Usable GB
Single Cost
![Page 87: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/87.jpg)
Web Builder 2.0 87
Space costs
Cost per U
Usable GB
[ ]U’s needed (inc network)x
Recurring Cost
![Page 88: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/88.jpg)
Web Builder 2.0 88
Network costs
Cost of network gear
Usable GB
Single Cost
![Page 89: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/89.jpg)
Web Builder 2.0 89
Misc costs
Support contracts + spare disks
Usable GB
+ bus adaptors + cables[ ]Single & Recurring Costs
![Page 90: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/90.jpg)
Web Builder 2.0 90
Human costs
Admin cost per node
Node countx
Recurring Cost
Usable GB
[ ]
![Page 91: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/91.jpg)
Web Builder 2.0 91
TCO
• Total cost of ownership in two parts– Upfront– Ongoing
• Architecture plays a huge part in costing– Don’t get tied to hardware– Allow heterogeneity– Move with the market
![Page 92: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/92.jpg)
(fin)
![Page 93: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/93.jpg)
Web Builder 2.0 93
Photo credits
• flickr.com/photos/ebright/260823954/• flickr.com/photos/thomashawk/243477905/• flickr.com/photos/tom-carden/116315962/• flickr.com/photos/sillydog/287354869/• flickr.com/photos/foreversouls/131972916/• flickr.com/photos/julianb/324897/• flickr.com/photos/primejunta/140957047/• flickr.com/photos/whatknot/28973703/• flickr.com/photos/dcjohn/85504455/
![Page 94: Filesystems](https://reader036.vdocuments.us/reader036/viewer/2022062513/556d0ecad8b42ad34f8b4f08/html5/thumbnails/94.jpg)
Web Builder 2.0 94
You can find these slides online:
iamcal.com/talks/