toward qcow2 deduplication - kvm...first iteration architecture use hashes to identify identical...
TRANSCRIPT
![Page 1: TOWARD QCOW2 DEDUPLICATION - KVM...First iteration architecture Use hashes to identify identical blocks 256-bit crypto hashes Low probability of collision on 1 EB with 4KB clusters:](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed7a29321f2f81ba73da1e2/html5/thumbnails/1.jpg)
TOWARD QCOW2 DEDUPLICATION
Benoît Canet <[email protected]>
_benoit_ on #qemu / oftc
KVM-Forum / October 2013
![Page 2: TOWARD QCOW2 DEDUPLICATION - KVM...First iteration architecture Use hashes to identify identical blocks 256-bit crypto hashes Low probability of collision on 1 EB with 4KB clusters:](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed7a29321f2f81ba73da1e2/html5/thumbnails/2.jpg)
What is deduplication?
● Factorizes redundant storage blocks● Saves disk space● Can be combined with block compression● Saves money● Reads identical blocks only once (cached)● Encourages SSD use as SSD price/MB approaches
hard drive price/MB
![Page 3: TOWARD QCOW2 DEDUPLICATION - KVM...First iteration architecture Use hashes to identify identical blocks 256-bit crypto hashes Low probability of collision on 1 EB with 4KB clusters:](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed7a29321f2f81ba73da1e2/html5/thumbnails/3.jpg)
Possible uses
● File server● Catia CAD software: 5 fold decrease in disk use● Factorize guest containers without AUFS● Archival (when combined with compression)
![Page 4: TOWARD QCOW2 DEDUPLICATION - KVM...First iteration architecture Use hashes to identify identical blocks 256-bit crypto hashes Low probability of collision on 1 EB with 4KB clusters:](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed7a29321f2f81ba73da1e2/html5/thumbnails/4.jpg)
Why QCOW2?
● QEMU code is simpler than kernel code● QCOW2 has the required infrastructure● QCOW2 is transparent for the guest● Could work later over NFS/Gluster/Ceph
![Page 5: TOWARD QCOW2 DEDUPLICATION - KVM...First iteration architecture Use hashes to identify identical blocks 256-bit crypto hashes Low probability of collision on 1 EB with 4KB clusters:](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed7a29321f2f81ba73da1e2/html5/thumbnails/5.jpg)
How does it work?
● Volume is divided into data blocks
● Use QCOW2 logical to physical mapping
● Identical logical blocks pointing to same physical block
● Use QCOW2 reference count for physical block lifecycle
![Page 6: TOWARD QCOW2 DEDUPLICATION - KVM...First iteration architecture Use hashes to identify identical blocks 256-bit crypto hashes Low probability of collision on 1 EB with 4KB clusters:](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed7a29321f2f81ba73da1e2/html5/thumbnails/6.jpg)
How does it look?
Without dedupe With dedupe
Logical Physical PhysicalLogical
![Page 7: TOWARD QCOW2 DEDUPLICATION - KVM...First iteration architecture Use hashes to identify identical blocks 256-bit crypto hashes Low probability of collision on 1 EB with 4KB clusters:](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed7a29321f2f81ba73da1e2/html5/thumbnails/7.jpg)
First iteration architecture
● Use hashes to identify identical blocks● 256-bit crypto hashes● Low probability of collision on 1 EB with 4KB clusters: 2.57E-49● Non-ECC ram bit flip rate: 1.3e-12 upsets/bit/hour● Manipulate all hashes in an in RAM Gtree● Save hashes on disk indexed by physical block offset● Write at 100MB/s on an intel 510 SSD● QCOW2 read path untouched → Read at full speed
![Page 8: TOWARD QCOW2 DEDUPLICATION - KVM...First iteration architecture Use hashes to identify identical blocks 256-bit crypto hashes Low probability of collision on 1 EB with 4KB clusters:](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed7a29321f2f81ba73da1e2/html5/thumbnails/8.jpg)
Deduplication algorithm
Incoming write IO vector
N = new block
D= duplicated block
D D DNNN N
Write sub IO vector Write sub IODedup Dedup Dedup
N NN N
The code walks through the write IO vector
![Page 9: TOWARD QCOW2 DEDUPLICATION - KVM...First iteration architecture Use hashes to identify identical blocks 256-bit crypto hashes Low probability of collision on 1 EB with 4KB clusters:](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed7a29321f2f81ba73da1e2/html5/thumbnails/9.jpg)
First iteration shortcomings
● Writes are not at full SSD speed● Makes random writes● Crypto hash uses a lot of CPU● 80 bytes of RAM per 4KB cluster → too much
![Page 10: TOWARD QCOW2 DEDUPLICATION - KVM...First iteration architecture Use hashes to identify identical blocks 256-bit crypto hashes Low probability of collision on 1 EB with 4KB clusters:](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed7a29321f2f81ba73da1e2/html5/thumbnails/10.jpg)
Second iteration goals
● Building a key-value store into QCOW2● Need to reduce memory usage● Need to make memory usage configurable
![Page 11: TOWARD QCOW2 DEDUPLICATION - KVM...First iteration architecture Use hashes to identify identical blocks 256-bit crypto hashes Low probability of collision on 1 EB with 4KB clusters:](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed7a29321f2f81ba73da1e2/html5/thumbnails/11.jpg)
SSD storage specificity
● Large sequential writes (Speed)● No random writes (NAND wear-out)● Can do fast random reads● Random reads must be done in parallel to go fast● Limited number of rewrite cycles (3,000)
![Page 12: TOWARD QCOW2 DEDUPLICATION - KVM...First iteration architecture Use hashes to identify identical blocks 256-bit crypto hashes Low probability of collision on 1 EB with 4KB clusters:](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed7a29321f2f81ba73da1e2/html5/thumbnails/12.jpg)
Hash storage alternatives
● Disk hash table● B-tree variants● SILT● BufferHash● QCOW2 key value store
![Page 13: TOWARD QCOW2 DEDUPLICATION - KVM...First iteration architecture Use hashes to identify identical blocks 256-bit crypto hashes Low probability of collision on 1 EB with 4KB clusters:](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed7a29321f2f81ba73da1e2/html5/thumbnails/13.jpg)
Disk hash table
● A collection of buckets containing hashes
0 N
![Page 14: TOWARD QCOW2 DEDUPLICATION - KVM...First iteration architecture Use hashes to identify identical blocks 256-bit crypto hashes Low probability of collision on 1 EB with 4KB clusters:](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed7a29321f2f81ba73da1e2/html5/thumbnails/14.jpg)
Disk hash table
● Pro: O(1) lookup, O(1) insertion● Con: Generates lots of random writes● Con: Sparse hash table is inefficient● Con: Disk Hash tables don't grow well● Con: Write amplification
![Page 15: TOWARD QCOW2 DEDUPLICATION - KVM...First iteration architecture Use hashes to identify identical blocks 256-bit crypto hashes Low probability of collision on 1 EB with 4KB clusters:](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed7a29321f2f81ba73da1e2/html5/thumbnails/15.jpg)
B-tree
Root
Node Node Node
Leafs 2 4 7 11 32 44 46 66 77
2 4 11 32 46 66
7 44
![Page 16: TOWARD QCOW2 DEDUPLICATION - KVM...First iteration architecture Use hashes to identify identical blocks 256-bit crypto hashes Low probability of collision on 1 EB with 4KB clusters:](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed7a29321f2f81ba73da1e2/html5/thumbnails/16.jpg)
B-tree
● Pro: Well known structure (BAYER -1972)● Con: O(log(n)) lookup not O(1)● Con: Complex locking protocols● Con: Generates lots of random writes● Con: Write amplification
![Page 17: TOWARD QCOW2 DEDUPLICATION - KVM...First iteration architecture Use hashes to identify identical blocks 256-bit crypto hashes Low probability of collision on 1 EB with 4KB clusters:](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed7a29321f2f81ba73da1e2/html5/thumbnails/17.jpg)
SILT
● SILT is a memory-efficient, high-performance key-value store
● Pro: Made for deduplication needs● Pro: Made for SSD● Pro: O(1) lookup● Pro: Amortized insertions● Con: complexity → need to simplify
![Page 18: TOWARD QCOW2 DEDUPLICATION - KVM...First iteration architecture Use hashes to identify identical blocks 256-bit crypto hashes Low probability of collision on 1 EB with 4KB clusters:](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed7a29321f2f81ba73da1e2/html5/thumbnails/18.jpg)
BufferHash
● Another research paper● Ancestor of SILT● Pro: Also done for SSD● Pro: Lots of good ideas● Combine these two great projects● Specialize deduplication for SSD usage
![Page 19: TOWARD QCOW2 DEDUPLICATION - KVM...First iteration architecture Use hashes to identify identical blocks 256-bit crypto hashes Low probability of collision on 1 EB with 4KB clusters:](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed7a29321f2f81ba73da1e2/html5/thumbnails/19.jpg)
QCOW hash store
● Optimized for SSD● Two simple stages● Takes only around 4 bytes of RAM per 4KB cluster● No write amplification● Amortized writes● O(1) lookup● Memory usage can be configurable
![Page 20: TOWARD QCOW2 DEDUPLICATION - KVM...First iteration architecture Use hashes to identify identical blocks 256-bit crypto hashes Low probability of collision on 1 EB with 4KB clusters:](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed7a29321f2f81ba73da1e2/html5/thumbnails/20.jpg)
Inserting into the hash store
● Insertions use only large sequential writes● No write amplification
![Page 21: TOWARD QCOW2 DEDUPLICATION - KVM...First iteration architecture Use hashes to identify identical blocks 256-bit crypto hashes Low probability of collision on 1 EB with 4KB clusters:](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed7a29321f2f81ba73da1e2/html5/thumbnails/21.jpg)
Stage 1
● Write new hashes into a log● Build a hash table of the new hashes in RAM
![Page 22: TOWARD QCOW2 DEDUPLICATION - KVM...First iteration architecture Use hashes to identify identical blocks 256-bit crypto hashes Low probability of collision on 1 EB with 4KB clusters:](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed7a29321f2f81ba73da1e2/html5/thumbnails/22.jpg)
Stage 1
Write on disk log: hash table rebuild from it on restart
Index into in RAM hash table
![Page 23: TOWARD QCOW2 DEDUPLICATION - KVM...First iteration architecture Use hashes to identify identical blocks 256-bit crypto hashes Low probability of collision on 1 EB with 4KB clusters:](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed7a29321f2f81ba73da1e2/html5/thumbnails/23.jpg)
Stage 2
● Convert Stage 1 hash table into an incarnation● Collect incarnations
![Page 24: TOWARD QCOW2 DEDUPLICATION - KVM...First iteration architecture Use hashes to identify identical blocks 256-bit crypto hashes Low probability of collision on 1 EB with 4KB clusters:](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed7a29321f2f81ba73da1e2/html5/thumbnails/24.jpg)
Stage 2
Stage 1 ram hash table Disk incarnation #1
Disk incarnation #2
Disk incarnation #n
dump
...
![Page 25: TOWARD QCOW2 DEDUPLICATION - KVM...First iteration architecture Use hashes to identify identical blocks 256-bit crypto hashes Low probability of collision on 1 EB with 4KB clusters:](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed7a29321f2f81ba73da1e2/html5/thumbnails/25.jpg)
Querying
● First query Stage 1● Next query every Stage 2 incarnation● Query from newest to oldest● Queries can be done in O(1) with RAM filters
![Page 26: TOWARD QCOW2 DEDUPLICATION - KVM...First iteration architecture Use hashes to identify identical blocks 256-bit crypto hashes Low probability of collision on 1 EB with 4KB clusters:](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed7a29321f2f81ba73da1e2/html5/thumbnails/26.jpg)
How to speed up Stage 2 queries
● One filter per incarnation● Filters loaded into RAM● A filter is an extract of an incarnation● Same as the incarnation, only smaller● Use smaller hashes at the same position● Smaller hashes are slices of the hashes
![Page 27: TOWARD QCOW2 DEDUPLICATION - KVM...First iteration architecture Use hashes to identify identical blocks 256-bit crypto hashes Low probability of collision on 1 EB with 4KB clusters:](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed7a29321f2f81ba73da1e2/html5/thumbnails/27.jpg)
A Stage 2 query probe
On disk hash incarnation #n
Probe in RAM incarnation filter (extracts of the hashes)
![Page 28: TOWARD QCOW2 DEDUPLICATION - KVM...First iteration architecture Use hashes to identify identical blocks 256-bit crypto hashes Low probability of collision on 1 EB with 4KB clusters:](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed7a29321f2f81ba73da1e2/html5/thumbnails/28.jpg)
Store queries
Filter 1 Filter 2 Filter 3 Filter nHash table
--------------------------------->
1 2 3 n
Incarnations
![Page 29: TOWARD QCOW2 DEDUPLICATION - KVM...First iteration architecture Use hashes to identify identical blocks 256-bit crypto hashes Low probability of collision on 1 EB with 4KB clusters:](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed7a29321f2f81ba73da1e2/html5/thumbnails/29.jpg)
Memory usage control
● Oldest in RAM filters can be unloaded at will● Memory usage will decrease● Only the deduplication ratio will be impacted
![Page 30: TOWARD QCOW2 DEDUPLICATION - KVM...First iteration architecture Use hashes to identify identical blocks 256-bit crypto hashes Low probability of collision on 1 EB with 4KB clusters:](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed7a29321f2f81ba73da1e2/html5/thumbnails/30.jpg)
Current status
● QCOW2 key-value store implemented● First round of patches need to be merged
![Page 31: TOWARD QCOW2 DEDUPLICATION - KVM...First iteration architecture Use hashes to identify identical blocks 256-bit crypto hashes Low probability of collision on 1 EB with 4KB clusters:](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed7a29321f2f81ba73da1e2/html5/thumbnails/31.jpg)
Third iteration (after merge)
● SSDs need parallelization to read fast● Current algorithm is sequential so it is slow● Dedupe algorithm code will need a rewrite● Need a faster 256-bit hash function (cityhash?)
![Page 32: TOWARD QCOW2 DEDUPLICATION - KVM...First iteration architecture Use hashes to identify identical blocks 256-bit crypto hashes Low probability of collision on 1 EB with 4KB clusters:](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed7a29321f2f81ba73da1e2/html5/thumbnails/32.jpg)
Does it work at all?
Let's do a simple test
![Page 33: TOWARD QCOW2 DEDUPLICATION - KVM...First iteration architecture Use hashes to identify identical blocks 256-bit crypto hashes Low probability of collision on 1 EB with 4KB clusters:](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed7a29321f2f81ba73da1e2/html5/thumbnails/33.jpg)
Host preparation
● On the host:
● # qemu-img create -f qcow2_dedup test.qcow2 10G
● # qemu … -drive file=test.qcow2,if=virtio,cache=none
![Page 34: TOWARD QCOW2 DEDUPLICATION - KVM...First iteration architecture Use hashes to identify identical blocks 256-bit crypto hashes Low probability of collision on 1 EB with 4KB clusters:](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed7a29321f2f81ba73da1e2/html5/thumbnails/34.jpg)
On the guest
● root@debian:~# mkfs.ext4 /dev/vdb● mount /dev/vdb /mnt● root@debian:~# du -sh /usr/
927M /usr/● root@debian:~# cp /usr/ /mnt/1 -a● root@debian:~# cp /usr/ /mnt/2 -a● root@debian:~# cp /usr/ /mnt/3 -a● root@debian:~# cp /usr/ /mnt/4 -a● root@debian:~# du -sh /mnt/
3.6G /mnt/● root@debian:~# sync
![Page 35: TOWARD QCOW2 DEDUPLICATION - KVM...First iteration architecture Use hashes to identify identical blocks 256-bit crypto hashes Low probability of collision on 1 EB with 4KB clusters:](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed7a29321f2f81ba73da1e2/html5/thumbnails/35.jpg)
Back to the host
● # du -sh test.qcow2
1.1GB test.qcow2
● 2.5GB of disk space saved on 3.6GB
![Page 37: TOWARD QCOW2 DEDUPLICATION - KVM...First iteration architecture Use hashes to identify identical blocks 256-bit crypto hashes Low probability of collision on 1 EB with 4KB clusters:](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed7a29321f2f81ba73da1e2/html5/thumbnails/37.jpg)
References
● SSD: http://en.wikipedia.org/wiki/Solid-state_drive
● B-tree: www.cs.aau.dk/~simas/aalg06/UbiquitBtree.pdf
● SILT: http://www.cs.cmu.edu/~dga/papers/silt-sosp2011.pdf
● BufferHash: http://pages.cs.wisc.edu/~akella/papers/bufferhash-nsdi10.pdf
● Venti: http://www.cs.bell-labs.com/sys/doc/venti/venti.html