alacc: accelerating restore performance of data
TRANSCRIPT
ALACC: Accelerating Restore Performance of
Data Deduplication Systems Using
Adaptive Look-Ahead Window Assisted Chunk Caching
Zhichao Cao, Hao Wen, Fenggang Wu and David H.C. Du
University of Minnesota, Twin Cities
02/15/2018
Center for Research in Intelligent Storage
• Deduplication Process
• Restore Process with Different Caching Schemes
– Container/chunk based caching
– Forward Assembly
• Objective and Challenges
• Proposed Approach
– Look-ahead window assisted chunk based caching (all fixed)
– Adaptive Look-ahead Chunk-based Caching (ALACC)
• Evaluations
• Conclusions and Future Work
Agenda
Center for Research in Intelligent Storage
• Deduplication Process
• Restore Process with Different Caching Schemes
– Container/chunk based caching
– Forward Assembly
• Objective and Challenges
• Proposed Approach
– Look-ahead window assisted chunk based caching (all fixed)
– Adaptive Look-ahead Chunk-based Caching (ALACC)
• Evaluations
• Conclusions and Future Work
Agenda
Center for Research in Intelligent Storage
Deduplication Process [1]
2 5 10 1422 18
Recipe
23 822
2223 822
2119 2018
2322
Container Buffer
133 102
Container Storage
102 518
259 178
5 1414
Indexing Table
……
Byte Stream
Not Found
14
1. Chunk ID
2. Chunk Size
3. Container Address
4. Offset in the container
5. Other meta information
Sliding Window
Recipe
Entry
[1] Zhu B, Li K, Patterson R H. Avoiding the Disk Bottleneck in the Data Domain Deduplication File System[C]//Fast. 2008, 8: 1-14.
Center for Research in Intelligent Storage
Deduplication Process [1]
2 5 10 1422 18
Recipe
23 822
2223 822
2119 2018
2322
Container Buffer
133 102
Container Storage
102 518
259 178
5
5
1414
Indexing Table
5
……
Byte Stream
Exits
Sliding Window
[1] Zhu B, Li K, Patterson R H. Avoiding the Disk Bottleneck in the Data Domain Deduplication File System[C]//Fast. 2008, 8: 1-14.
Center for Research in Intelligent Storage
• Deduplication Process
• Restore Process with Different Caching Schemes
– Container/chunk based caching
– Forward Assembly
• Objective and Challenges
• Proposed Approach
– Look-ahead window assisted chunk based caching (all fixed)
– Adaptive Look-ahead Chunk-based Caching (ALACC)
• Evaluations
• Conclusions and Future Work
Agenda
Center for Research in Intelligent Storage
• Due to the serious data fragmentation and size mismatching of requested
data and I/O unite, the restore performance is much lower than that of
directly reading out the data which is not deduplicated.
• CPU and memory resources are limited.
Why Improving Restore Performance is Important?
222 1823 822
……
Container Storage
22
23
8 18
2
?
Center for Research in Intelligent Storage
Restore Process with Container-based Caching
2 5 10 14 18 13 22 3 ……22 18
Recipe
23 822 28 23 12 13 32 23 28 6
Container Cache
2119 2018 133 102
259 178 1423 522
Assembling Buffer
102 518133 102
1423 522
142223 822
Restored Data Storage
133 52
……
1314 512
……
2119 201818
Container Storage
Restore Direction
135……
Center for Research in Intelligent Storage
Assembling Buffer
Restore Process with Chunk-based Caching
2 5 10 14 18 13 22 3
……
22 18
Recipe
23 822 28 23 12 13 32 23 28 6
Chunk Cache
2119 2018 133 102
259 178 1423 522
102 518142223 822
Restored Data Storage
133 52 1314 512
……
Container Read Buffer223 182
2313 514133 102
10
18
Container Storage
……
Center for Research in Intelligent Storage
Container-based Caching vs. Chunk-based Caching
Less operating and management
overhead
Relatively higher cache miss ratio,
especially when the caching space is
limited.
Container-based Caching Chunk-based Caching
1. Higher cache hit ratio
2. Even much higher if look-ahead
window is applied
Higher operating and management
overhead
Center for Research in Intelligent Storage
Container-based Caching vs. Chunk-based Caching
0
50
100
150
200
250
300
Con
tain
er-
rea
ds
per 1
00M
B
Rest
ored
Total Cache Size
Container_LRU
Chunk_LRU
00.5
11.5
22.5
33.5
44.5
5
Co
mp
uti
ng
Tim
e
(seco
nd
s/G
B)
Total Cache Size
Container_LRU
Chunk_LRU
Center for Research in Intelligent Storage
2119 2018 133 102
1423 522259 178
Forward Assembly Scheme [1]
8 5 10 14 18 9 22 3
……
22 18
Recipe
23 822 28 23 12 13
2223 822
Container Read buffer
Container Storage
2119 2018
259 178
Forward Assembling Area (FAA)
1814
Look-Ahead Window
……58 2218 18
Restored Data
Storage
822 232
……
1423 522
23 822
[1] Lillibridge M, Eshghi K, Bhagwat D. Improving
restore speed for backup systems that use inline chunk-
based deduplication[C]//FAST. 2013: 183-198.
9
Center for Research in Intelligent Storage
Chunk-based Caching vs. Forward Assembly
1. Highly efficient, when chunks from
the same container are used most in
the FAA range
2. Low operating and management
overhead
Workload sensitive, requires good
workload locality
Chunk-based Caching
Higher operating and management
overhead
Forward Assembly
1. When chunks are re-used in a
relatively long distance (larger than
FAA), caching is more effective
Center for Research in Intelligent Storage
• Deduplication Process
• Restore Process with Different Caching Schemes
– Container/chunk based caching
– Forward Assembly
• Objective and Challenges
• Proposed Approach
– Look-ahead window assisted chunk based caching (all fixed)
– Adaptive Look-ahead Chunk-based Caching (ALACC)
• Evaluations
• Conclusions and Future Work
Agenda
Center for Research in Intelligent Storage
• Objective:
– Forward assembly + chunk-based caching + LAW (limited memory space)
• Challenges
– When the total size of available memory for restore is limited and fixed, how to use these
schemes in an efficient way, is unclear.
– How to make better trade-offs to achieve fewer container-reads, but limit the computing
overhead including the LAW, caching and forward assembly overhead.
– How to make the design adapt to the changing workload is very challenging.
Objective and Challenges
Center for Research in Intelligent Storage
• Deduplication Process
• Restore Process with Different Caching Schemes
– Container/chunk based caching
– Forward Assembly
• Objective and Challenges
• Proposed Approach
– Look-ahead window assisted chunk based caching (all fixed)
– Adaptive Look-Ahead Chunk-based Caching (ALACC)
• Evaluations
• Conclusions and Future Work
Agenda
Center for Research in Intelligent Storage
Look-ahead Window Assisted Chunk Cache
2 5 10 14 10 15 7 22
……
Look-Ahead Window (LAW) Covered Range
22 18
UnknownFAA Covered Range
2 822 18 23 12 13 32 23 28 6
RestoredInformation for Chunk Cache
2 518
FAA
FAB1 FAB2
Chunk CacheContainer Read Buffer
2119 2018
259 28 1423 522
Container Storage
……
Restored Data
Storage
222 822
……
252 188
1423 522
101712 1310
10
What’s the caching policy?
1712 1310
Center for Research in Intelligent Storage
Chunks in the Read-in Container10 P-chunk: Probably used chunk
17 U-chunk: Unused chunk
12 13 F-chunk: Future used chunk
Chunk Cache
F-cache P-cache
2 5 10 14 10 15 7 22
……
Look-Ahead Window Covered Range
22 18
UnknownFAA Covered Range
2 822 18 23 12 13 32 23 28 6
RestoredInformation for Chunk Cache
2 518
FAA
FAB1 FAB2
Container Read Buffer
2119 2018 1712 1310
259 28 1423 522
Container Storage
……
Restored Data
Storage
222 822
……
101712 1310
10 22
18
32
28
21
25
2
5
Center for Research in Intelligent Storage
Caching Priority of F-cache
Chunk Cache
F-cache
2 5 10 14 10 15 7 22
……
Look-Ahead Window Covered Range
22 18
UnknownFAA Covered Range
2 822 18 23 12 13 32 23 28 6
RestoredInformation for Chunk Cache
2 518
FAA
FAB1 FAB2
Container Read Buffer
2119 2018 1712 1310
259 28 1423 522
Container Storage
……
Restored Data
Storage
222 822
……
101712 1310
10 22
18
32
28
12
13
F-chunks being used in the near
future have higher priority than F-
chunks being used in the far future
High priority end
Low priority end
Center for Research in Intelligent Storage
Caching Priority of P-cache
Chunk Cache
P-cache
2 5 10 14 10 15 7 22
……
Look-Ahead Window Covered Range
22 18
UnknownFAA Covered Range
2 822 18 23 12 13 32 23 28 6
RestoredInformation for Chunk Cache
2 518
FAA
FAB1 FAB2
Container Read Buffer
2119 2018 1712 1310
259 28 1423 522
Container Storage
……
Restored Data
Storage
222 822
……
101712 1310
10 5
2
8
21
F-cache
……
P-cache is LRU based
caching policy
10
21
High priority end
Low priority end
Center for Research in Intelligent Storage
• What’s the memory space ratio between forward assembly and chunk cache?
• What’s the size of LAW?– Too large: the computing overhead large but the extra information in the LAW is wasting
– Too small: it becomes forward assembly+LRU cache
• What if the workload locality changes?
Good Enough?
Center for Research in Intelligent Storage
Chunk Cache
2 5 10 14 5 2 18 14
……
Look-Ahead Window Covered Range
22 18
UnknownFAA Covered Range
2 822 18 5 10 13 32 23 28 6
Restored
2 518
FAA
FAB2 FAB3
Container Read Buffer
2119 2018 1712 1310
259 28 1423 522
Container Storage
……
Restored Data
Storage
222 822
……
10 5
……FAB1
If most of the chunks (e.g., duplicated
chunks) from the same container can be
covered by FAA, a large FAA size is
preferred
2 18 18 5 10
Large FAA Size (Small Cache)
Cache Range
Center for Research in Intelligent Storage
Chunk Cache
19 20 21 10 21 3 7 21
Look-Ahead Window Covered Range
12 18
Unknown
2 822 18 28 16 22 2 8 2 19
FAA
FAB3
Container Read Buffer
2119 2018 1712 1310
259 28 1423 522
Container Storage
……
Restored Data
Storage
222 822
……
If most the chunks reuse distance is
relatively larger, caching chunks is
preferred.
22 2 8
22
Information for Chunk Cache
22 2 8 2
……
……
Large Chunk Cache Size (Small FAA)
FAA Range
Center for Research in Intelligent Storage
LAW Size
……
2 5 10 14 10 15 7 22
……
Look-Ahead Window Covered Range
22 18
UnknownFAA Covered Range
2 822 18 23 12 13 32 23 28 6
Restored
2 5 10 14 10 15 7 22
Look-Ahead Window Covered Range
22 18
UnknownFAA Covered Range
2 822 18 23 12 13 32 23 28 6
Restored
14 10 15 7
With larger LAW, more F-chunks can be potentially identified and cached. However,
once the cache is full of F-chunks, further increasing the LAW size cannot improve
the cache efficiency. In contrast, it will bring in more unnecessary overhead.
With smaller LAW, lower computing overhead can be achieved, but cache will store
more P-chunks, which potentially reduce the caching efficiency.
Center for Research in Intelligent Storage
ALACC
Increase FAA
The re-use distances of most duplicated
data chunks are within the FAA range
The data chunks in the first FAB are
identified mostly as unique data chunks
and these chunks are stored in the same
or close containers
Shrink LAW
P-chunk number is very small
F-chunk added during the restore
cycle is very large
Adjust LAW
Shrink Cache
Not satisfied
Enlarge Cache Shrink FAA
Not satisfied
OR
OR Adjust LAW
Center for Research in Intelligent Storage
• Deduplication Process
• Restore Process with Different Caching Schemes
– Container/chunk based caching
– Forward Assembly
• Objective and Challenges
• Proposed Approach
– Look-ahead window assisted chunk based caching (all fixed)
– Adaptive Look-ahead Chunk-based Caching (ALACC)
• Evaluations
• Conclusions and Future Work
Agenda
Center for Research in Intelligent Storage
• Five Caching Designs:
– LRU-based container caching (Container_LRU)
– LRU-based chunk caching (Chunk_LRU)
– Forward assembly (FAA)
– Optimal configuration with fixed forward assembly and chunk-based caching (Fix_Opt)
– ALACC
• Four Traces:
– 2 FSL traces from FSL /home directory snapshots of the year 2014 [1]
– 2 EMC weekly full-backup traces [2]
Experiment Setup
[1] http://tracer.filesystems.org/.
[2] Nohhyun Park and David J Lilja. Characteriz- ing datasets for data deduplication in backup ap- plications. In Workload
Characterization (IISWC), 2010 IEEE International Symposium on, pages 1– 10. IEEE, 2010.
Center for Research in Intelligent Storage
How We Get the Fix_OptThe Fix_Opt configuration of each trace (FAA/chunk
cache/LAW size in container unite)
24
6
8
10
12
14
12
13
14
15
16
17
18
1624
3240
4856
6472
8088
96
FA
AS
ize
Res
tore
Th
rou
gh
pu
t(M
B/S
)
LAW Size
12-13 13-14 14-15 15-16 16-17 17-18
We run all possible configurations
for each trace to discover the optimal
throughput. Notice that we need tens
of experiments to find out the
optimal configurations of Fix_Opt
which is almost impossible to carry
out in a real-world production
scenario.
Chunk cache size = total memory size – FAA size
Center for Research in Intelligent Storage
Restore Throughput
FSL_1 FSL_2
0
5
10
15
20
25
0 1 2 3 4 5
Res
tore
Th
rou
gh
pu
t(M
B/S
)
Version Number
Container_LRU Chunk_LRU
FAA Fix_Opt
ALACC
0
10
20
30
40
50
60
70
0 1 2 3 4 5
Res
tore
Th
rou
gh
pu
t(M
B/S
)
Version Number
Container_LRU Chunk_LRU
FAA Fix_Opt
ALACC
1. ACS stands for Average Chunk Size
2.DR stands for the Deduplication Ratio.
3. CFL stands for the Chunk Fragmentation
Level
Restore Throughput (MB/S):
original data stream size divided by the
total restore time.
(4/12/56)(6/10/72)
>100%
>10%<15%
Center for Research in Intelligent Storage
Speed Factor
0
0.2
0.4
0.6
0.8
1
1.2
0 1 2 3 4 5
Sp
eed
Fa
ctor
Version Number
Container_LRU Chunk_LRUFAA Fix_OptALACC
0
1
2
3
4
5
0 1 2 3 4 5
Sp
eed
Fa
cto
rVersion Number
Container_LRU Chunk_LRUFAA Fix_OptALACC
FSL_1 FSL_2
1. ACS stands for Average Chunk Size
2.DR stands for the Deduplication Ratio.
3. CFL stands for the Chunk Fragmentation
Level
Speed Factor (MB/container-read):
the mean data size restored per
container read
(4/12/56) (6/10/72)
Center for Research in Intelligent Storage
Computing Cost
Factor
FSL_1 FSL_2
0
4
8
12
16
20
0 1 2 3 4 5
Co
mp
uti
ng
C
ost
Fact
or
Version Number
Container_LRU Chunk_LRU
FAA Fix_Opt
ALACC
0
2
4
6
8
10
12
0 1 2 3 4 5
Co
mp
uti
ng
Co
stF
act
or
Version Number
Container_LRU Chunk_LRU
FAA Fix_Opt
ALACC
1. ACS stands for Average Chunk Size
2.DR stands for the Deduplication Ratio.
3. CFL stands for the Chunk Fragmentation
Level
Computing Cost Factor (second/GB):
the time spent on computing operations
(subtracting the storage I/O time from
the restore time) per GB data restored
(4/12/56) (6/10/72)
Center for Research in Intelligent Storage
• Deduplication Process
• Restore Process with Different Caching Schemes
– Container/chunk based caching
– Forward Assembly
• Objective and Challenges
• Proposed Approach
– Look-ahead window assisted chunk based caching (all fixed)
– Adaptive Look-ahead Chunk-based Caching (ALACC)
• Evaluations
• Conclusions and Future Work
Agenda
Center for Research in Intelligent Storage
• Studied the effectiveness and the efficiency of different caching mechanisms.
• Designed an adaptive algorithm called ALACC which is able to adaptively
adjust the sizes of the FAA, chunk cache and LAW according to the
workload changing.
• In our future work, duplicated data chunk rewriting, multi-threading
implementation will be investigated and integrated with ALACC to further
improve the restore performance.
Conclusions and Future Work
Center for Research in Intelligent Storage
Look-ahead Window Assisted Chunk Cache
2 5 10 14 10 15 7 22
……
Look-Ahead Window Covered Range
22 18
UnknownFAA Covered Range
2 822 18 23 12 13 32 23 28 6
RestoredInformation for Chunk Cache
2 518
FAA
FAB1 FAB2
Chunk CacheContainer Read Buffer
2119 2018
259 28 1423 522
Container Storage
……
Restored Data
Storage
222 822
……
252 188
1423 522
101712 1310
10
1712 1310
Center for Research in Intelligent Storage
0
0.5
1
1.5
2
2.5
3
0 1 2 3 4 5
Sp
eed
Fact
or
Version Number
Container_LRU Chunk_LRUFAA Fix_OptALACC
0
1
2
3
4
5
0 1 2 3 4 5
Sp
eed
Fa
ctor
Version Number
Container_LRU Chunk_LRUFAA Fix_OptALACC
EMC_1 EMC_2
Center for Research in Intelligent Storage
EMC_1 EMC_2
0
0.5
1
1.5
2
2.5
3
0 1 2 3 4 5
Co
mp
uti
ng
Co
stF
act
or
Version Number
Container_LRU Chunk_LRU
FAA Fix_Opt
ALACC
0
0.4
0.8
1.2
1.6
2
0 1 2 3 4 5
Co
mp
uti
ng
Co
stF
act
or
Version Number
Container_LRU Chunk_LRUFAA Fix_OptALACC
Center for Research in Intelligent Storage
EMC_1 EMC_2
0
10
20
30
40
50
60
0 1 2 3 4 5
Res
tore
Th
rou
gh
pu
t(M
B/S
)
Version Number
Container_LRU Chunk_LRU
FAA Fix_Opt
ALACC
0
15
30
45
60
75
90
105
0 1 2 3 4 5
Res
tore
Th
rou
gh
pu
t(M
B/S
)
Version Number
Container_LRU Chunk_LRUFAA Fix_OptALACC