1 a triple erasure reed-solomon code, and fast rebuilding mark manasse, chandu thekkath microsoft...
TRANSCRIPT
1
A triple erasure Reed-Solomon code, and fast rebuilding
Mark Manasse, Chandu ThekkathMicrosoft Research - Silicon Valley
Alice SilverbergOhio State University
04/10/23
Motivation
Large-scale storage systems can be expensive to build and maintain
Erasure codes reduce the system costs below those of mirroring
Erasure codes increase the complexity of recovering from failure
In this talk, we present Construction of a triple-erasure correcting code Fast and agile computation for erasure recovery
3
A triple erasure correcting code
Galois fields Vandermonde matrices
Definition and determinant Inductive proof of determinant formula
Reed-Solomon Erasure Codes Existing practice Simplified construction for up to three erasures
Definition Handling 0 or 1 data erasures Handling 2 or 3 data erasures Why it stops at three erasures, and works only for GF(2k)
4
Galois fields
The Galois Field of order pk (for p prime) is formed by considering polynomials in Z/Zp[x] modulo a primitive polynomial of degree k.
Facts x is a generator of the field (because of primitivity). Any primitive polynomial will do; all the resulting fields are
isomorphic. We write GF(pk) to denote one such field. Everything you know about algebra is still true.
In practice, we’ll be interested only in GF(28k), so multiple bytes turn into equivalent-length groups of bytes
5
Vandermonde matrices
A Vandermonde matrix Vk is of the form
and has determinant
kk
kkk
k
k
xxxx
xxxx
xxxx
...
.......
.......
.......
...
...
1...111
210
222
21
20
210
)()()()()()( 110120201
0
kkkk
kjiij
xxxxxxxxxxxx
xx
6
)ˆdet())...()((
)(...)(0
......
......
......
)(...)(0
1)(...1)(0
1...11
det
...
......
......
...
...
...
1...11
det
)det(
1...00
.1.....
.......
.......
00...0
00...1
00...01
det)det(
100201
0101
0101
001
01
011
1001
0
023
021
310
20
30
02
012100
20
00100
0
0
0
kk
kkk
k
kk
k
kk
kk
kkkk
kk
kk
k
kk
Vxxxxxx
xxxxxx
xxxxxx
xxxx
xxxxxxxxx
xxxxxxxxx
xxxxxxxxx
xxxxxx
V
x
x
x
V
Inductive step proving the determinant of a Vandermonde matrix is the product of the differences.
Determinant here is 1.
Expand on first column; after removing common factors from second through last entries in each column, what’s left is Vk-1, with shifted variables.
7
Reed-Solomon Erasure Codes
3
2
1
3
2
1
3
2
1
242
2
.
.
.
.
.1
.1
1.111
1.000
.....
0.100
0.010
0.001
c
c
c
d
d
d
d
d
d
d
d
xxx
xxx
n
nn
n
3
2
1
3
2
1
3
2
1
242
2
.
.
.
.
.1
.1
1.111
1.000
.....
0.100
0.010
0.001
c
c
c
d
d
d
d
d
d
d
d
xxx
xxx
n
nn
n
2. Suppose data disks 2,3 and check disk 3 fail.
2
1
5
4
1
3
2
1
432
.
.
.
.
.1
1.11111
1.00000
.......
0.10000
0.01000
0.00001
c
c
d
d
d
d
d
d
d
d
xxxxx
n
nn
4. Multiplying both sides by R-1, we recover all the data.
2
1
5
4
1
1
432
3
2
1
.
.1
1.11111
1.00000
.......
0.10000
0.01000
0.00001
.
.
.
c
c
d
d
d
d
xxxxxd
d
d
d
n
nn
3. Omitting failed rows, we get an invertible n×n matrix R.
1. We use an n×(n+k) coding matrix to store data on n data disks and k check disks. (k=3 in our example)
8
Existing practice
The use of the identity in the top of the matrix makes the code systematic, which means that data encodes itself
Typically, one takes a matrix with the right properties for the invertibility of submatrices (like a Vandermonde or Cauchy matrix) and diagonalizes it
This produces a matrix, hard to remember or invert, limited to n+k < 257 in GF(256)
A simple trick extends to n+k < 258
9
A simple triple-erasure code
The matrix to the right is simple: an n×n identity matrix for n < 256, and the first three rows of a transposed Vandermonde matrix of size 3×n, using 1, x, and x2, where x is any generator of the multiplicative group
For k=3, in GF(256), we need n+k < 259
3
2
1
3
2
1
3
2
1
242
2
.
.
.
.
.1
.1
1.111
1.000
.....
0.100
0.010
0.001
c
c
c
d
d
d
d
d
d
d
d
xxx
xxx
n
nn
n
10
General invertibility background
Consider the matrix after deleting 3 rowsTo check invertibility, test the determinantTo compute the determinant
Most rows will contain all zeroes, except for a one in what used to be the diagonal element
Expanding along such a row, we get (up to sign), that the determinant is the determinant of the minor excluding the one’s row and column
11
Handling 0 or 1 data erasures
If the 3 deleted rows are the check rows, we know how to compute the check values from the data values
Otherwise, what remains is a minor of the Vandermonde rows
If 2 deleted rows are check rows, the remaining minor is a single element, which is a power of x, hence non-zero.
12
Handling 2 or 3 data erasures, and beyond If the deleted rows are data rows a, b, and c, the minor
is a 3×3 Vandermonde matrix, which is invertible If one deleted row is a check row, and the others are
rows a and b, possible minors are displayed: The first is Vandermonde, as is the second, after
factoring out xa and xb The third is Vandermonde, but we need to show that x2a
and x2b differ In GF(2k), the order of the multiplicative group is 2k-1,
relatively prime to 2, so they do In other characteristics, 1 has two square roots, so we
have to keep b - a small
If we had added more than three check rows, a 3×3 minor generally would not be Vandermonde, and it’s not hard to construct non-invertible minors
cba
cba
xxx
xxx222
111
ba
ba
ba
ba
xx
xx
xx
xx
22
22
11
or ,
,11
13
Fast and agile computation for erasure recovery
14
Reed-Solomon reconstruction
For each failed disk, the matrix multiplication resolves to a dot-product
If each data source (data disk or check disk) has an associated processor, the multiplications can be performed locally
Accumulating the sum in GF(2k) is just exclusive-or
Want high throughput (so disks are rebuilt quickly), low-latency (so blocks can be delivered on demand, when necessary)
15
Computational environment
In what follows, we assume that we have a synchronous network of processors, each with an array of data packets
In each time step, each processor can Receive one packet, and XOR the contents with
a known packet for the same array index Send one packet to another processor or to the
final destination
16
High throughput, but high latency
A bucket brigade of n processors has unit throughput, but linear latency On step i+k, processor i sends
accumulated packet k to processor i+k, and receives packet k+1 from processor i-1, adding the received value to known packet k+1
Processor 0 only sends Processor n is the destination
After n steps of latency, processor n receives one packet per step
Node 0
……
Node 2
Node 1
Node n-2
Node n-1
Node n(sink)
17
Low latency, but low throughput
Build an in-place binary tree Let n = 2k
For i<k, on step rk+i, node 2i(2s+1) sends packet r to node 2i+1s
On step k(r+1), node 0 sends packet r to destination node n
Latency log n+1, throughput 1/k, i.e. 1/log n
Easy doubling of throughput by sending even blocks down, and odd blocks up (since at least half of nodes only send or only receive at each step)
Node 0
Node 1
Node 2
Node 3
Node 4
Node 5
Node 6
Node 7
Step 0,3,…
Step 0,3,… Step 0,3,…
Step 0
Step 1,4,…
Steps 2,5,…
Step 1,4,…
Steps 3,6,…
18
Moderate throughput, moderate latency
Instead of an in-place binary tree, use a rooted binary tree On step 2(k+l), node 2l(4s+1) sends packet k to node 2l(4s+2) On step 2(k+l)+1, node 2l(4s+3) sends packet k to node 2l(4s+2) Throughput ½, latency 2log n, for n = 2k-1
Output every other step, because of input limits
Node 1
Node 2
Node 3
Node 4
Node 5
Node 6
Node 7
Steps 0,2,… Steps 0,2,…Steps 1,3,… Steps 1,3,…
Steps 2,4,… Steps 3,5,…
19
General observations
The patterns of communication described so far all combine the values of consecutive nodes Statically known if incoming block contains
values from higher numbered nodes or lower numbered, so can apply XOR left-to-right
Not interesting for a commutative operator like XOR, but this can apply to non-commutative monoids (which don’t arise in erasure codes, but are cool anyway)
20
Recursive construction: base case
For one node, on step i, send block i from node 0 to destination node 1; this is G0
For two nodes, on step i, send block i from node 0 to node 1. On step i+1, send block i from node 1 to node 2 Denote edges in graph as 4-tuples
<step, block, source, destination> Graph G1 is {<i, i, 0, 1>,<i+1, i, 1, 2>}
Node 0 Node 1 Node 2Steps 0,1,2,…Blocks 0,1,2,…
Steps 1,2,3,…Blocks 0,1,2,…
21
Inductive hypotheses for Gk
Nodes from 0 to n=2k
Node 0 is only a source; <i, i, 0, 1> is in Gk for all i (recall: step, block, source, dest)
Node n is only a destination, <i+k+1, i, s, n> in Gk, so log k+1 delay, full throughput
If <i, j, s, d> in Gk, for d < n, then for some t and u d, but u/2 = d/2, either
<i+1, j, t, d> and <i+1, i+1, d, u> are in Gk or <i+1, j, d, t> and <i+1, i+1, u, d> are in Gk
For all blocks, the edges form an unrooted binary tree; the k-level descendants of a node have node numbers matching the first k bits of the node
Node 0 Node 2
Node 4
Node 1 Node 3
22
Recursive construction: doubling up
Given Gk, produce Gk+1 by doubling the number of nodes to 2n. Add edges <0, 0, 2s, 2s+1> for s<n, Looping over i, for every edge <i, b, s, d> in Gk, add an edge to Gk+1
For {s1, s2}={2s,2s+1} (and similarly d1, d2, d), Gk+1 includes <i, b, s2, s1> and <i, b, d1, d2> (unless d=n, when d1=2n, and d2 is irrelevant)
Add <i+1, b, s1, d1>, and <i+1, i+1, s2 ,s1> Since every node < n is a source in Gk, all pairs will be connected in
some direction in step i+1
0 2
Node 41 3
0 2
1 3
Steps i=0,2,4,…:
Steps i=1,3,5,…:
Block i-2Block i-1
Block i-1
Block i insideevery bubble
23
Chains / step but in-place trees / block
0 2
Node 4
1 3
0 2
1 3
Steps i=0,2,4,…:
Steps i=1,3,5,…:
Block i-2Block i-1
Block i-1
Block i insideevery bubble
0 2
Node 41 3
0 2
1 3
Blocks i=0,2,4,…:
Blocks i=1,3,5,…:
Step i+2Step i+1
Step i+1
Step i insideevery bubble
24
High throughput, low latency
From that recursive construction, we’ve doubled the number of nodes
We sometimes have to add on the left and sometimes on the right, but the inputs accumulated on any input step are always a contiguous subset adjacent to the contiguous subset currently known to the destination, so associativity is sufficient
2i and 2i+1 are always linked for block b at step b; if we condense some of these nodes, we can reduce the number of nodes to get non-powers of 2
25
Further results
Current patterns of communication repeat every 2log log n blocks
We have alternative constructions with slightly worse latency, but full throughput, that are much simpler (repeating patterns every 2 or 3 steps) These constructions require commutativity Generalizations of rooted tree constructions,
improving throughput