2012 Storage Developer Conference. © Nebula, Inc. All Rights Reserved.
Data integrity in the Cloud
Christoph HellwigNebula, Inc
2012 Storage Developer Conference. © Nebula, Inc. All Rights Reserved.
Data Integrity
Data integrity in brief:Writing data reliably to persistent
storageRetrieving the same data again later
2012 Storage Developer Conference. © Nebula, Inc. All Rights Reserved.
Application writes
Application
Disk
OS Cache
Applications only write to the OS cache
Reliable ACKs lostNeed fsync to:
Transfer data to diskGet a reliable ACK
2012 Storage Developer Conference. © Nebula, Inc. All Rights Reserved.
Operating System writes
Operating System
Disk
Disk Cache
Often the OS writes to disk cache only
Reliable ACKs lostNeed a cache flush to:
Transfer data to diskGet reliable ACK
2012 Storage Developer Conference. © Nebula, Inc. All Rights Reserved.
Write granularity
Byte-level writes are not atomicOS caches operate on page granularity
(usually 4096 bytes)Disk drives operate on sectors (usually
512 bytes)Striped RAID operates on much larger
granularities (but pretends not to)
2012 Storage Developer Conference. © Nebula, Inc. All Rights Reserved.
Write granularity issues
Too small writes need read-modify-write cyclesMay cause untouched data to be
corrupted Too large writes can be torn
Half-updated data may be on disk
2012 Storage Developer Conference. © Nebula, Inc. All Rights Reserved.
Reading data
As long as the disk and data is there that's easy, right?Silent data corruption happens (a lot)Disk might fail entirely and go away
2012 Storage Developer Conference. © Nebula, Inc. All Rights Reserved.
Data protection
Disks use error correction to deal with bit flipsReads that succeed should return the
right data To protect against disk failure data is
stored in multiple placesN-way mirroring (e.g. RAID1)Erasure encoding (e.g. RAID5/6)
2012 Storage Developer Conference. © Nebula, Inc. All Rights Reserved.
File System
File System
Multipathing
Volume Manager
Multipathing
File System
Hypervisor
Driver Driver
RAID / SAN
Guest
RAID / SANDisk
Driver
I/O architecture complexity
Personal Computer Enterprise VirtualizationEnterprise Server
Increasing complexity
2012 Storage Developer Conference. © Nebula, Inc. All Rights Reserved.
I/O architecture complexity
Guest
The Cloud !
GuestGuest Guest
2012 Storage Developer Conference. © Nebula, Inc. All Rights Reserved.
The Cloud?
Cloud computing in I/O terms:Virtualization as abstractionMassive scale distributed systems (Object Storage)
2012 Storage Developer Conference. © Nebula, Inc. All Rights Reserved.
Multipathing
File System
Hypervisor
Driver
Guest
RAID
Cloud I/O architectures (1)
The “Enterprise” I/O stack Just treats “cloud” as a
management layerScale-out of compute and
disk is handled independently
2012 Storage Developer Conference. © Nebula, Inc. All Rights Reserved.
File System
Hypervisor
Driver
Guest
Disk
Cloud I/O architectures (2)
“Instance” storageLocal non-fault tolerant
storageUsed for instance-local
temporary dataPart of the AWS and
Openstack models
2012 Storage Developer Conference. © Nebula, Inc. All Rights Reserved.
Distributed storage system
Hypervisor
Guest
Disk
Cloud I/O architectures (3)
Distributed storageAccess by internal
networkingData stored on a
large number of nodes
Hypervisor
Guest
Disk
Driver Driver
2012 Storage Developer Conference. © Nebula, Inc. All Rights Reserved.
Cloud data integrity issues
Mapping between layersComplicated I/O stacksMapping block I/O semantics on file
system semantics Distributed systems
Replication needs to be location-awareFailure is the norm
2012 Storage Developer Conference. © Nebula, Inc. All Rights Reserved.
Without words
2012 Storage Developer Conference. © Nebula, Inc. All Rights Reserved.
I/O layering issues
Caching semantics need to be preserved over all layers Including mapping them from block
device layers to file systems and backOn distributed systems multiple writers
complicate semantics a lotThis may include live migration
2012 Storage Developer Conference. © Nebula, Inc. All Rights Reserved.
Data placement
Rack
Switch
Node
Node
Node
Node
Node
Node
Rack
Switch
Node
Node
Node
Node
Node
Node
Rack
Switch
Node
Node
Node
Node
Node
Node
2012 Storage Developer Conference. © Nebula, Inc. All Rights Reserved.
Data placement
Rack
Switch
Node
Node
Node
Node
Node
Node
Rack
Switch
Node
Node
Node
Node
Node
Node
Rack
Switch
Node
Node
Node
Node
Node
Node
Good placement?Good placement?
2012 Storage Developer Conference. © Nebula, Inc. All Rights Reserved.
Data placement
Rack
Switch
Node
Node
Node
Node
Node
Node
Rack
Switch
Node
Node
Node
Node
Node
Node
Rack
Switch
Node
Node
Node
Node
Node
Node
Better placement?Better placement?
2012 Storage Developer Conference. © Nebula, Inc. All Rights Reserved.
Data placement
Rack
Switch
Node
Node
Node
Node
Node
Node
Rack
Switch
Node
Node
Node
Node
Node
Node
Rack
Switch
Node
Node
Node
Node
Node
Node
Or maybe this?Or maybe this?
2012 Storage Developer Conference. © Nebula, Inc. All Rights Reserved.
In-flight data protection
Data might get corrupted in-flight Lots of potentially buggy I/O layers Large clusters → multiple hops through
switches Various protocols offer error checking or
correction E.g. iSCSI crc32c But we'd really like to be able to verify
integrity through the whole I/O stack
2012 Storage Developer Conference. © Nebula, Inc. All Rights Reserved.
T10 DIF/DIX
End-to-end data protectionStores application checksum in
extended disk sectorsStandard format that can be validated
by all intermediate layers Issues:
Requires disk support, but not part of the ATA standard
2012 Storage Developer Conference. © Nebula, Inc. All Rights Reserved.
Questions?