network coding distributed storage...• detect any corrupted data chunks stored on cloud servers...
TRANSCRIPT
![Page 1: Network Coding Distributed Storage...• Detect any corrupted data chunks stored on cloud servers Fault tolerance • Tolerate any cloud server failures Efficient recovery • Recover](https://reader036.vdocuments.us/reader036/viewer/2022070819/5f1a2906a0a0a407410286f6/html5/thumbnails/1.jpg)
Network Coding Distributed Storage
Patrick P. C. Lee
Department of Computer Science and Engineering
The Chinese University of Hong Kong
1
![Page 2: Network Coding Distributed Storage...• Detect any corrupted data chunks stored on cloud servers Fault tolerance • Tolerate any cloud server failures Efficient recovery • Recover](https://reader036.vdocuments.us/reader036/viewer/2022070819/5f1a2906a0a0a407410286f6/html5/thumbnails/2.jpg)
Cloud Storage
Cloud storage is an emerging service model for
remote backup and data synchronization
Is cloud storage fully reliable?
2
![Page 3: Network Coding Distributed Storage...• Detect any corrupted data chunks stored on cloud servers Fault tolerance • Tolerate any cloud server failures Efficient recovery • Recover](https://reader036.vdocuments.us/reader036/viewer/2022070819/5f1a2906a0a0a407410286f6/html5/thumbnails/3.jpg)
Problems in the Cloud
3
![Page 4: Network Coding Distributed Storage...• Detect any corrupted data chunks stored on cloud servers Fault tolerance • Tolerate any cloud server failures Efficient recovery • Recover](https://reader036.vdocuments.us/reader036/viewer/2022070819/5f1a2906a0a0a407410286f6/html5/thumbnails/4.jpg)
Cloud Storage Requirements
Data integrity protection
• Detect any corrupted data chunks stored on cloud
servers
Fault tolerance
• Tolerate any cloud server failures
Efficient recovery
• Recover any lost/corrupted data chunks with minimal
overhead
4
![Page 5: Network Coding Distributed Storage...• Detect any corrupted data chunks stored on cloud servers Fault tolerance • Tolerate any cloud server failures Efficient recovery • Recover](https://reader036.vdocuments.us/reader036/viewer/2022070819/5f1a2906a0a0a407410286f6/html5/thumbnails/5.jpg)
(n, k) MDS codes
5
File encode divide
Nodes
Encode a file of size M into chunks
Distribute encoded chunks into n nodes
Each node stores M/k units of data
MDS property: any k out of n nodes can recover file
n = 4, k = 2
A B C D
A+C B+D
A+D B+C+D
A B
C D
A+C B+D
A+D B+C+D
A B
C D
File of
size M
![Page 6: Network Coding Distributed Storage...• Detect any corrupted data chunks stored on cloud servers Fault tolerance • Tolerate any cloud server failures Efficient recovery • Recover](https://reader036.vdocuments.us/reader036/viewer/2022070819/5f1a2906a0a0a407410286f6/html5/thumbnails/6.jpg)
Repairing a Failure
Q: Can we minimize repair traffic?
6
Node 1
Node 2
Node 3
Node 4
repaired node
Conventional repair: download data from any k nodes
Repair Traffic = = M +
A B
C D
A+C B+D
A+D B+C+D
C D
A+C B+D
A B
File of
size M
A B C D
![Page 7: Network Coding Distributed Storage...• Detect any corrupted data chunks stored on cloud servers Fault tolerance • Tolerate any cloud server failures Efficient recovery • Recover](https://reader036.vdocuments.us/reader036/viewer/2022070819/5f1a2906a0a0a407410286f6/html5/thumbnails/7.jpg)
7
Node 1
Node 2
Node 3
Node 4
Regenerating Codes [Dimakis et al.; ToIT’10]
A B
C D
A+C B+D
A+D B+C+D
C
A+C
A+B+C
A
B
+ + Repair Traffic = = 0.75M
repaired node
File of
size M
A B C D
Repair in regenerating codes:
• Surviving nodes encode chunks (network coding)
• Download one encoded chunk from each node
Minimizing repair traffic minimizing system downtime
![Page 8: Network Coding Distributed Storage...• Detect any corrupted data chunks stored on cloud servers Fault tolerance • Tolerate any cloud server failures Efficient recovery • Recover](https://reader036.vdocuments.us/reader036/viewer/2022070819/5f1a2906a0a0a407410286f6/html5/thumbnails/8.jpg)
Goals
Challenges:
• Mostly theoretical studies; limited empirical studies
• Practical deployment remains unknown
Goals: Study practicality of network coding storage
• To realize network coding data storage in practical
implementation
• To conduct extensive experimental studies and evaluate
the performance in a real storage environment
• To provide insights into deploying network coding data
storage in practice
8
![Page 9: Network Coding Distributed Storage...• Detect any corrupted data chunks stored on cloud servers Fault tolerance • Tolerate any cloud server failures Efficient recovery • Recover](https://reader036.vdocuments.us/reader036/viewer/2022070819/5f1a2906a0a0a407410286f6/html5/thumbnails/9.jpg)
Projects
NCCloud [FAST’12, INFOCOM’13, TC]
• Network coding archival storage for public clouds
FMSR-DIP [SRDS’12, TPDS]
• Data integrity protection for network coding archival
storage
CORE [MSST’13]
• Network coding primary storage for Hadoop file system
NCVFS
• Network coding video file system
9
![Page 10: Network Coding Distributed Storage...• Detect any corrupted data chunks stored on cloud servers Fault tolerance • Tolerate any cloud server failures Efficient recovery • Recover](https://reader036.vdocuments.us/reader036/viewer/2022070819/5f1a2906a0a0a407410286f6/html5/thumbnails/10.jpg)
NCCloud
NCCloud is a proxy-based storage system that applies
regenerating codes in multiple-cloud storage
Design properties:
• Build on functional minimum-storage regenerating (FMSR) codes
• Double-fault tolerance
• Optimal storage efficiency
• Minimum repair bandwidth for single-node failure recovery
• Up to 50% saving compared to conventional repair
• Uncoded repair
Trick: non-systematic codes
• Suited to long-term archival storage whose data is rarely read
10
![Page 11: Network Coding Distributed Storage...• Detect any corrupted data chunks stored on cloud servers Fault tolerance • Tolerate any cloud server failures Efficient recovery • Recover](https://reader036.vdocuments.us/reader036/viewer/2022070819/5f1a2906a0a0a407410286f6/html5/thumbnails/11.jpg)
NCCloud: Overview
11
NCCloud
Cloud 1
Cloud 2
Cloud 3
Cloud 4
Users
file upload
download file
Multiple cloud storage:
• Provide fault tolerance against cloud unavailability
• Avoid vendor lock-ins
![Page 12: Network Coding Distributed Storage...• Detect any corrupted data chunks stored on cloud servers Fault tolerance • Tolerate any cloud server failures Efficient recovery • Recover](https://reader036.vdocuments.us/reader036/viewer/2022070819/5f1a2906a0a0a407410286f6/html5/thumbnails/12.jpg)
NCCloud: Key Idea
Code chunk Pi = linear combination of original data chunks
Repair:
• Download one code chunk from each surviving node
• Reconstruct new code chunks (via linear combination) in new node
12
P1 P2
P3 P4
P5 P6
P7 P8
P3
P5
P7
P1’ P2’
A B C D
P1’ P2’
Node 1
Node 2
Node 3
Node 4
File of
size M
NCCloud
n = 4, k = 2
F-MSR codes
Repair traffic = 0.75M
![Page 13: Network Coding Distributed Storage...• Detect any corrupted data chunks stored on cloud servers Fault tolerance • Tolerate any cloud server failures Efficient recovery • Recover](https://reader036.vdocuments.us/reader036/viewer/2022070819/5f1a2906a0a0a407410286f6/html5/thumbnails/13.jpg)
FMSR-DIP
FMSR-DIP enables data integrity protection, fault
tolerance, and efficient recovery for NC storage
Threat model: Byzantine, mobile adversary [Bowers
et al. ’09]
• exhibits arbitrary behavior
• corrupts different subsets of servers over time
Design properties:
• Preserve advantages of FMSR codes
• Work on thin clouds (i.e., only basic PUT/GET assumed)
• Support byte sampling to minimize cost
13
![Page 14: Network Coding Distributed Storage...• Detect any corrupted data chunks stored on cloud servers Fault tolerance • Tolerate any cloud server failures Efficient recovery • Recover](https://reader036.vdocuments.us/reader036/viewer/2022070819/5f1a2906a0a0a407410286f6/html5/thumbnails/14.jpg)
FMSR-DIP: Overview
FMSR-
DIP Users
file upload
download file
NCCloud
Four operations: Upload, Check, Download and Repair
14
Storage
interface
FMSR
code chunks
FMSR-DIP
code chunks
Servers / clouds
![Page 15: Network Coding Distributed Storage...• Detect any corrupted data chunks stored on cloud servers Fault tolerance • Tolerate any cloud server failures Efficient recovery • Recover](https://reader036.vdocuments.us/reader036/viewer/2022070819/5f1a2906a0a0a407410286f6/html5/thumbnails/15.jpg)
FMSR-DIP: Key Idea
15
Two-level protection:
• Fault tolerance (horizontal) protection by FMSR codes
• Integrity (vertical) protection by adversarial error correcting code
Apply adversarial error-correcting codes to each FMSR
code chunk
Enable tunable parameters to trade between
performance and security
![Page 16: Network Coding Distributed Storage...• Detect any corrupted data chunks stored on cloud servers Fault tolerance • Tolerate any cloud server failures Efficient recovery • Recover](https://reader036.vdocuments.us/reader036/viewer/2022070819/5f1a2906a0a0a407410286f6/html5/thumbnails/16.jpg)
CORE
CORE augments existing optimal regenerating
codes to support both single and concurrent
failure recovery
• Achieves minimum recovery bandwidth for concurrent
failures in most cases
• Retains existing optimal regenerating code
constructions
Implement CORE atop Hadoop HDFS
Enable fault-tolerant, storage-efficient MapReduce
16
![Page 17: Network Coding Distributed Storage...• Detect any corrupted data chunks stored on cloud servers Fault tolerance • Tolerate any cloud server failures Efficient recovery • Recover](https://reader036.vdocuments.us/reader036/viewer/2022070819/5f1a2906a0a0a407410286f6/html5/thumbnails/17.jpg)
CORE: Performance
CORE shows significantly higher throughput
than Reed Solomon codes
• e.g., in (20, 10), for single failure, the gain is 3.45x;
for two failures, it’s 2.33x; for three failures, is 1.75x
0
10
20
30
40
50
60
70
(12, 6) (16, 8) (20, 10)
Reco
very
th
pt
(MB
/s)
CORE t=1
RS t=1
CORE t=2
RS t=2
CORE t=3
RS t=3
17
![Page 18: Network Coding Distributed Storage...• Detect any corrupted data chunks stored on cloud servers Fault tolerance • Tolerate any cloud server failures Efficient recovery • Recover](https://reader036.vdocuments.us/reader036/viewer/2022070819/5f1a2906a0a0a407410286f6/html5/thumbnails/18.jpg)
NCVFS
NCVFS, network coding video file system
• Splits a large file into smaller segments that are striped across
different storage nodes
Flexible coding
• Each segment is independently encoded with erasure coding or
network coding
Decoupling metadata management and data
management
• Metadata updates off the critical path
Lightweight recovery
• Monitor health of storage nodes and trigger recovery if needed
18
![Page 19: Network Coding Distributed Storage...• Detect any corrupted data chunks stored on cloud servers Fault tolerance • Tolerate any cloud server failures Efficient recovery • Recover](https://reader036.vdocuments.us/reader036/viewer/2022070819/5f1a2906a0a0a407410286f6/html5/thumbnails/19.jpg)
NCVFS: Architecture
Key entities:
• Metadata server
(MDS)
• Object Storage
Device (OSD)
• Clients
• Monitors
Master-slave design
19
![Page 20: Network Coding Distributed Storage...• Detect any corrupted data chunks stored on cloud servers Fault tolerance • Tolerate any cloud server failures Efficient recovery • Recover](https://reader036.vdocuments.us/reader036/viewer/2022070819/5f1a2906a0a0a407410286f6/html5/thumbnails/20.jpg)
NCVFS: I/O Path
20
Client
MDS
OSD OSD ...
OSD OSD ... OSD OSD
primary
secondary
segment
block
Encode
1
2
3
4
![Page 21: Network Coding Distributed Storage...• Detect any corrupted data chunks stored on cloud servers Fault tolerance • Tolerate any cloud server failures Efficient recovery • Recover](https://reader036.vdocuments.us/reader036/viewer/2022070819/5f1a2906a0a0a407410286f6/html5/thumbnails/21.jpg)
NCVFS: Performance
Aggregate read/write throughput
• Achieve several hundreds of megabytes per second
• Network bound 21
![Page 22: Network Coding Distributed Storage...• Detect any corrupted data chunks stored on cloud servers Fault tolerance • Tolerate any cloud server failures Efficient recovery • Recover](https://reader036.vdocuments.us/reader036/viewer/2022070819/5f1a2906a0a0a407410286f6/html5/thumbnails/22.jpg)
Research Philosophy
Emphasis spans on wide range of theoretical
and applied topics
Research topics need to be:
• Novel and useful
• Addressed by both
• Rigorous algorithmic design and analysis
• Extensive system implementation, prototyping and
experiments
Our measure of success:
• Visibility in international research community
• Conference/journal papers + software tools 22
![Page 23: Network Coding Distributed Storage...• Detect any corrupted data chunks stored on cloud servers Fault tolerance • Tolerate any cloud server failures Efficient recovery • Recover](https://reader036.vdocuments.us/reader036/viewer/2022070819/5f1a2906a0a0a407410286f6/html5/thumbnails/23.jpg)
Network Coding Storage Research
Research results published in top systems
journals/conferences • e.g., TC, TPDS, FAST, DSN, INFOCOM, MSST, SRDS
Open-source software released
Publications and source code:
• http://www.cse.cuhk.edu.hk/~pclee
23
![Page 24: Network Coding Distributed Storage...• Detect any corrupted data chunks stored on cloud servers Fault tolerance • Tolerate any cloud server failures Efficient recovery • Recover](https://reader036.vdocuments.us/reader036/viewer/2022070819/5f1a2906a0a0a407410286f6/html5/thumbnails/24.jpg)
Thank you!
24