rapidcdc: leveraging duplicate locality to accelerate ... · cdc chunking becomes a bottleneck §...
Post on 03-Aug-2020
1 Views
Preview:
TRANSCRIPT
Wayne State UniversityCluster and Internet Computing Laboratory
RapidCDC: Leveraging Duplicate Locality to Accelerate Chunking in CDC-based Deduplication
Fan Ni*fan@netapp.com
ATG, NetApp
Song Jiangsong.jiang@uta.edu
UT Arlington
* The work was done when he was a Ph.D. student at UT Arlington
Data is Growing Rapidly
§ Many of the data needs to be stored for preservation and processing.§ Efficient data storage and management has become a big challenge.
From storagenewsletter.com
2
The Opportunity: Data Duplication is Common
§ Sources of duplicate data:– The same files are stored by multiple users into the cloud. – Continuously updating of files to generate multiple versions. – Use of checkpointing and repeated data archiving.
§ Significant data duplication has been observed for both backup and primary storage workloads.
3
The Deduplication Technique can Help
Logical Physical
File1File2
File1
File2SHA1( ) = SHA1( )
When duplication is detected (using fingerprinting):
Only one copy is stored:
§ Benefits– Storage space– I/O bandwidth– Network traffic
§ An important feature in commercial storage systems.– NetApp ONTAP system– Dell-EMC Data Domain system
§ Two critical issues:– How to deduplicate more data?– How to deduplicate faster? 4
Chunking andfingerprinting
Remove duplicate chunks
Deduplicate at Smaller Chunks …
… for higher deduplication ratio § Two potentially major sources of cost in the deduplication:
– Chunking– Fingerprinting
§ Can chunking be very fast?5
Fixed-Size Chunking (FSC)
HOWAREYOU?OK?REALLY?YES?NO File A
HOWAREYOU?OK?REALLY?YES?NO File B
§ FSC: partition files (or data streams) into equal- and fixed-sized chunks.– Very fast!
§ But the deduplication ratio can be significantly compromised.– The boundary-shift problem.
6
Fixed-Size Chunking (FSC)
§ FSC: partition files (or data streams) into equal- and fixed-size chunks.– Very fast!
§ But the deduplication ratio can be significantly compromised.– The boundary-shift problem.
HOWAREYOU?OK?REALLY?YES?NO File A
HOWAREYOU?OK?REALLY?YES?NO File B H
7
Content-Defined Chunking (CDC)
HOWAREYOU?OK?REALLY?YES?NO File A
HOWAREYOU?OK?REALLY?YES?NO File B H
§ CDC: determines chunk boundaries according to contents (a predefined special marker).– Variable chunk size.– Addresses boundary-shift problem
§ Assume the special marker is ‘?’
88
The Advantage of CDC
§ Real-world datasets include two-week’s google news, Linux kernels, and various Docker images.
§ CDC’s deduplication ratio is much higher than FSC.§ However, CDC can be very expensive.
Google-n
ews
Linux-
tar
Cassand
raRedi
s
Debian-
docker
Neo4j
Wordpre
ssNode
js0
4
8
12
16
20
24
28
32
36
40D
edup
licat
ion
ratio
CDC FSC
9
CDC can be Too Expensive!
HOWAREYOU?OK?REALLY?YES?NO File A
HOWAREYOU?OK?REALLY?YES?NO File B H
Assume the special marker is ‘?’
§ The marker for identifying chunk boundaries must – be evenly spaced out with a controllable distance in between.
§ Actually the marker is determined by applying a hash function on a window of bytes.– E.g., hash(“YOU?”) == pre-defined-value
§ The window rolls forward byte-by-byte and the hashing is applied continuously. 10
CDC Chunking Becomes a Bottleneck
§ Chunking time > 60% of the CPU time.§ I/O bandwidth is not fully utilized.§ The bottleneck shifts from the disk to CPU.
1111
Linux-tardr=4.06
Redisdr=7.17
Neo4jdr=19.04
0
20
40
60
80
100
Tim
e(%
)
Fingerprinting
Chunking
Breakdown of CPU time
Linux-tardr=4.06
Redisdr=7.17
Neo4jdr=19.04
0
20
40
60
80
100
Tim
e(%
)
CDC Chunking Becomes a Bottleneck
§ Chunking time > 60% of the CPU time.§ I/O bandwidth is not fully utilized.§ The bottleneck shifts from the disk to CPU.
1212
Fingerprinting
Chunking
I/O Idle
Breakdown of CPU time Breakdown of IO time
I/O Busy
Linux-tardr=4.06
Redisdr=7.17
Neo4jdr=19.04
0
20
40
60
80
100
Tim
e(%
)
CDC Chunking Becomes a Bottleneck
§ Chunking time > 60% of the CPU time.§ I/O bandwidth is not fully utilized.§ The bottleneck shifts from the disk to CPU.
1313
I/O Busy
Fingerprinting
Chunking
I/O Idle
Breakdown of CPU time Breakdown of IO time
Efforts on Acceleration of CDC Chunking
§ Make hashing faster– Example functions: SimpleByte, gear, and AE– More likely to generate small chunks
• increasing size of metadata cached in memory for performance
§ Use GPU/multi-core to parallelize the chunking process– Extra hardware cost– Substantial efforts to deploy – The speedup is bounded by hardware parallelism.
§ Significant software/hardware efforts, but limited performance return
1414
We proposed RapidCDC that …
§ is still sequential and doesn’t require additional cores/threads.
§ makes the hashing speed almost irrelevant.
§ accelerates the CDC chunking often by 10-30 times.
§ has a deduplication ratio the same as regular CDC methods.
§ can be adopted in an existing CDC deduplication system by adding 100~200 LOC in a few functions.
1515
The Path to the Breakthrough
Unique Chunks in the Disk
16
The Path to the Breakthrough
Fingerprint
Matched!
17
The Path to the Breakthrough
Fingerprint
Matched !15KB
15KB
Confirm it !
18
The Path to the Breakthrough
Fingerprint
Matched!15KB
15KB
10KB
9KB 20KB
12KB
12KB
7KB
19
The Path to the Breakthrough
Fingerprint
Matched !
16KB
20
The Path to the Breakthrough
Fingerprint
Matched !Fingerprint
Matched !
7KB16KB
P
21
The Path to the Breakthrough
Fingerprint
Matched !Fingerprint
Matched !
Fingerprint
Matched !
20KB16KB 7KB
P P
22
The Path to the Breakthrough
Fingerprint
Matched !Fingerprint
Matched !
Fingerprint
Matched !
Fingerprint
Matched !
Fingerprint
Matched !
Palmost always happens !
16KB 7KB 20KB
P P P
23
Duplicate Locality§ Duplicate locality: if two of chunks are duplicates, their next chunks (in their
respective files or data stream) are likely duplicates of each other.
§ Duplicate chunks tend to stay together.
10 20 40 80 90# of files
0
20
40
60
80
100
Per
cent
age
ofch
unks
(%)
All duplicate chunks
Duplicate chunk immediately following another duplicate chunk
24(Debian)
Duplicate Locality§ Duplicate locality: if two of chunks are duplicates, their next chunks (in their
respective files or data stream) are likely duplicates of each other.
§ Duplicate chunks tend to stay together.
10 20 40 80 90# of files
0
20
40
60
80
100
Per
cent
age
ofch
unks
(%)
All duplicate chunks
Duplicate chunk immediately following another duplicate chunk
25
RapidCDC: Using Next Chunk in History as a Hint
+s2=
When FP(B1) == FP(A1):
<FP1, s2> <FP2, s3> <FP3, s4> <FP4, …>
B1 B2 B3…
…File A
File B
P1P0 P2 P3 P4
…
…
+s3=B4
+s4=
A2A1 A3 A4
Offset in file:
§ History recording: whenever a chunk is detected, its size is attached to its previous chunk (fingerprint);
§ Hint-assisted chunking: whenever a duplication is detected, use the history chunk size as a hint for the next chunk boundary.
§ Regular CDC is used for chunking until a duplicate chunk (e.g., B1) is found
26
More Design Considerations …
27
§ A chunk may have been followed with chunks of different sizes– Maintain a size list
§ Validation of Hinted Next Chunk Boundaries– Four alternative criterions with different efficiency and
confidencesØ FF (fast-forwarding only)Ø FF+RWT (Rolling window Test)Ø FF+MT (Marker Test)Ø FF+RWT+FPT (Fingerprint Test)
§ Please refer to the paper for detail.
Evaluation of RapidCDC
§ Prototype: based on a rolling-window-based CDC system.– Using Rabin/Gear as rolling function for rolling window computation.– Using SHA1 to calculate fingerprints.
§ Three disks with different speed are tested.– SATA Hard disk: 138 MB/s and 150MB/s for sequential read/write. – SATA SSD: 520 MB/s and 550MB/s for sequential read/write.– NVMe SSD: 1.2 GB/s and 2.4G/s for sequential read/write.
28
§ Chunking speedup correlates to the deduplication ratio.§ Deduplication ratio is little affected (except for one very
aggressive validation criterion).
Synthetic Datasets: Insert/Delete
29
1000 2000 5000 10000 20000# of modifications
0
1
2
3
4
5
6
7
Dedu
plica
tion
ratio
Regular
FF+RWT+FPT
FF+RWT
FF+MT
FF
1000 2000 5000 10000 20000# of modifications
0
1
2
3
4
5
6
Spee
dup
FF+RWT+FPT
FF+RWT
FF+MT
FF
Spee
dup
Ded
uplic
atio
n ra
tio# of modifications # of modifications
Debian Neo4j Wordpress Nodejs0
5
10
15
20
25
30
Spee
dup
FF+RWT+FPT
FF+RWT
FF+MT
FF
Real-world Datasets: Chunking Speed
§ Chunking speedup approaches deduplication ratio.§ Negligible deduplication ratio reductions (if any).
33XFaster!
Debian Neo4j Wordpress Nodejs0
10
20
30
40
Ded
uplic
atio
nra
tio
Regular
FF+RWT+FPT
FF+RWT
FF+MT
FF
30
Spee
dup
Ded
uplic
atio
n ra
tio
Conclusions
§ RapidCDC represents a disruptively new approach to improve CDC chunking speed.
§ It increases chunking speed by up to 33X without loss of deduplication ratio.
§ Its adoption in an existing CDC deduplication system does not require any major change of its current operation flow.
§ Its implementation in any existing CDC deduplication systems requires minimal code changes (100-200 lines of C code in our prototype)
§ A prototype implementation is available at https://github.com/moking/rapidcdc
31
top related