1 a triple erasure reed-solomon code, and fast rebuilding mark manasse, chandu thekkath microsoft...

1

A triple erasure Reed-Solomon code, and fast rebuilding

Mark Manasse, Chandu ThekkathMicrosoft Research - Silicon Valley

Alice SilverbergOhio State University

04/10/23

Motivation

Large-scale storage systems can be expensive to build and maintain

Erasure codes reduce the system costs below those of mirroring

Erasure codes increase the complexity of recovering from failure

In this talk, we present Construction of a triple-erasure correcting code Fast and agile computation for erasure recovery

3

A triple erasure correcting code

Galois fields Vandermonde matrices

Definition and determinant Inductive proof of determinant formula

Reed-Solomon Erasure Codes Existing practice Simplified construction for up to three erasures

Definition Handling 0 or 1 data erasures Handling 2 or 3 data erasures Why it stops at three erasures, and works only for GF(2k)

4

Galois fields

The Galois Field of order pk (for p prime) is formed by considering polynomials in Z/Zp[x] modulo a primitive polynomial of degree k.

Facts x is a generator of the field (because of primitivity). Any primitive polynomial will do; all the resulting fields are

isomorphic. We write GF(pk) to denote one such field. Everything you know about algebra is still true.

In practice, we’ll be interested only in GF(28k), so multiple bytes turn into equivalent-length groups of bytes

5

Vandermonde matrices

A Vandermonde matrix Vk is of the form

and has determinant

kk

kkk

k

k

xxxx

xxxx

xxxx

...

.......

.......

.......

...

...

1...111

210

222

21

20

210

)()()()()()( 110120201

0

kkkk

kjiij

xxxxxxxxxxxx

xx

6

)ˆdet())...()((

)(...)(0

......

......

......

)(...)(0

1)(...1)(0

1...11

det

...

......

......

...

...

...

1...11

det

)det(

1...00

.1.....

.......

.......

00...0

00...1

00...01

det)det(

100201

0101

0101

001

01

011

1001

0

023

021

310

20

30

02

012100

20

00100

0

0

0

kk

kkk

k

kk

k

kk

kk

kkkk

kk

kk

k

kk

Vxxxxxx

xxxxxx

xxxxxx

xxxx

xxxxxxxxx

xxxxxxxxx

xxxxxxxxx

xxxxxx

V

x

x

x

V

Inductive step proving the determinant of a Vandermonde matrix is the product of the differences.

Determinant here is 1.

Expand on first column; after removing common factors from second through last entries in each column, what’s left is Vk-1, with shifted variables.

7

Reed-Solomon Erasure Codes

3

2

1

3

2

1

3

2

1

242

2

.

.

.

.

.1

.1

1.111

1.000

.....

0.100

0.010

0.001

c

c

c

d

d

d

d

d

d

d

d

xxx

xxx

n

nn

n

3

2

1

3

2

1

3

2

1

242

2

.

.

.

.

.1

.1

1.111

1.000

.....

0.100

0.010

0.001

c

c

c

d

d

d

d

d

d

d

d

xxx

xxx

n

nn

n

2. Suppose data disks 2,3 and check disk 3 fail.

2

1

5

4

1

3

2

1

432

.

.

.

.

.1

1.11111

1.00000

.......

0.10000

0.01000

0.00001

c

c

d

d

d

d

d

d

d

d

xxxxx

n

nn

4. Multiplying both sides by R-1, we recover all the data.

2

1

5

4

1

1

432

3

2

1

.

.1

1.11111

1.00000

.......

0.10000

0.01000

0.00001

.

.

.

c

c

d

d

d

d

xxxxxd

d

d

d

n

nn

3. Omitting failed rows, we get an invertible n×n matrix R.

1. We use an n×(n+k) coding matrix to store data on n data disks and k check disks. (k=3 in our example)

8

Existing practice

The use of the identity in the top of the matrix makes the code systematic, which means that data encodes itself

Typically, one takes a matrix with the right properties for the invertibility of submatrices (like a Vandermonde or Cauchy matrix) and diagonalizes it

This produces a matrix, hard to remember or invert, limited to n+k < 257 in GF(256)

A simple trick extends to n+k < 258

9

A simple triple-erasure code

The matrix to the right is simple: an n×n identity matrix for n < 256, and the first three rows of a transposed Vandermonde matrix of size 3×n, using 1, x, and x2, where x is any generator of the multiplicative group

For k=3, in GF(256), we need n+k < 259

3

2

1

3

2

1

3

2

1

242

2

.

.

.

.

.1

.1

1.111

1.000

.....

0.100

0.010

0.001

c

c

c

d

d

d

d

d

d

d

d

xxx

xxx

n

nn

n

10

General invertibility background

Consider the matrix after deleting 3 rowsTo check invertibility, test the determinantTo compute the determinant

Most rows will contain all zeroes, except for a one in what used to be the diagonal element

Expanding along such a row, we get (up to sign), that the determinant is the determinant of the minor excluding the one’s row and column

11

Handling 0 or 1 data erasures

If the 3 deleted rows are the check rows, we know how to compute the check values from the data values

Otherwise, what remains is a minor of the Vandermonde rows

If 2 deleted rows are check rows, the remaining minor is a single element, which is a power of x, hence non-zero.

12

Handling 2 or 3 data erasures, and beyond If the deleted rows are data rows a, b, and c, the minor

is a 3×3 Vandermonde matrix, which is invertible If one deleted row is a check row, and the others are

rows a and b, possible minors are displayed: The first is Vandermonde, as is the second, after

factoring out xa and xb The third is Vandermonde, but we need to show that x2a

and x2b differ In GF(2k), the order of the multiplicative group is 2k-1,

relatively prime to 2, so they do In other characteristics, 1 has two square roots, so we

have to keep b - a small

If we had added more than three check rows, a 3×3 minor generally would not be Vandermonde, and it’s not hard to construct non-invertible minors

cba

cba

xxx

xxx222

111

ba

ba

ba

ba

xx

xx

xx

xx

22

22

11

or ,

,11

13

Fast and agile computation for erasure recovery

14

Reed-Solomon reconstruction

For each failed disk, the matrix multiplication resolves to a dot-product

If each data source (data disk or check disk) has an associated processor, the multiplications can be performed locally

Accumulating the sum in GF(2k) is just exclusive-or

Want high throughput (so disks are rebuilt quickly), low-latency (so blocks can be delivered on demand, when necessary)

15

Computational environment

In what follows, we assume that we have a synchronous network of processors, each with an array of data packets

In each time step, each processor can Receive one packet, and XOR the contents with

a known packet for the same array index Send one packet to another processor or to the

final destination

16

High throughput, but high latency

A bucket brigade of n processors has unit throughput, but linear latency On step i+k, processor i sends

accumulated packet k to processor i+k, and receives packet k+1 from processor i-1, adding the received value to known packet k+1

Processor 0 only sends Processor n is the destination

After n steps of latency, processor n receives one packet per step

Node 0

……

Node 2

Node 1

Node n-2

Node n-1

Node n(sink)

17

Low latency, but low throughput

Build an in-place binary tree Let n = 2k

For i<k, on step rk+i, node 2i(2s+1) sends packet r to node 2i+1s

On step k(r+1), node 0 sends packet r to destination node n

Latency log n+1, throughput 1/k, i.e. 1/log n

Easy doubling of throughput by sending even blocks down, and odd blocks up (since at least half of nodes only send or only receive at each step)

Node 0

Node 1

Node 2

Node 3

Node 4

Node 5

Node 6

Node 7

Step 0,3,…

Step 0,3,… Step 0,3,…

Step 0

Step 1,4,…

Steps 2,5,…

Step 1,4,…

Steps 3,6,…

18

Moderate throughput, moderate latency

Instead of an in-place binary tree, use a rooted binary tree On step 2(k+l), node 2l(4s+1) sends packet k to node 2l(4s+2) On step 2(k+l)+1, node 2l(4s+3) sends packet k to node 2l(4s+2) Throughput ½, latency 2log n, for n = 2k-1

Output every other step, because of input limits

Node 1

Node 2

Node 3

Node 4

Node 5

Node 6

Node 7

Steps 0,2,… Steps 0,2,…Steps 1,3,… Steps 1,3,…

Steps 2,4,… Steps 3,5,…

19

General observations

The patterns of communication described so far all combine the values of consecutive nodes Statically known if incoming block contains

values from higher numbered nodes or lower numbered, so can apply XOR left-to-right

Not interesting for a commutative operator like XOR, but this can apply to non-commutative monoids (which don’t arise in erasure codes, but are cool anyway)

20

Recursive construction: base case

For one node, on step i, send block i from node 0 to destination node 1; this is G0

For two nodes, on step i, send block i from node 0 to node 1. On step i+1, send block i from node 1 to node 2 Denote edges in graph as 4-tuples

<step, block, source, destination> Graph G1 is {<i, i, 0, 1>,<i+1, i, 1, 2>}

Node 0 Node 1 Node 2Steps 0,1,2,…Blocks 0,1,2,…

Steps 1,2,3,…Blocks 0,1,2,…

21

Inductive hypotheses for Gk

Nodes from 0 to n=2k

Node 0 is only a source; <i, i, 0, 1> is in Gk for all i (recall: step, block, source, dest)

Node n is only a destination, <i+k+1, i, s, n> in Gk, so log k+1 delay, full throughput

If <i, j, s, d> in Gk, for d < n, then for some t and u d, but u/2 = d/2, either

<i+1, j, t, d> and <i+1, i+1, d, u> are in Gk or <i+1, j, d, t> and <i+1, i+1, u, d> are in Gk

For all blocks, the edges form an unrooted binary tree; the k-level descendants of a node have node numbers matching the first k bits of the node

Node 0 Node 2

Node 4

Node 1 Node 3

22

Recursive construction: doubling up

Given Gk, produce Gk+1 by doubling the number of nodes to 2n. Add edges <0, 0, 2s, 2s+1> for s<n, Looping over i, for every edge <i, b, s, d> in Gk, add an edge to Gk+1

For {s1, s2}={2s,2s+1} (and similarly d1, d2, d), Gk+1 includes <i, b, s2, s1> and <i, b, d1, d2> (unless d=n, when d1=2n, and d2 is irrelevant)

Add <i+1, b, s1, d1>, and <i+1, i+1, s2 ,s1> Since every node < n is a source in Gk, all pairs will be connected in

some direction in step i+1

0 2

Node 41 3

0 2

1 3

Steps i=0,2,4,…:

Steps i=1,3,5,…:

Block i-2Block i-1

Block i-1

Block i insideevery bubble

23

Chains / step but in-place trees / block

0 2

Node 4

1 3

0 2

1 3

Steps i=0,2,4,…:

Steps i=1,3,5,…:

Block i-2Block i-1

Block i-1

Block i insideevery bubble

0 2

Node 41 3

0 2

1 3

Blocks i=0,2,4,…:

Blocks i=1,3,5,…:

Step i+2Step i+1

Step i+1

Step i insideevery bubble

24

High throughput, low latency

From that recursive construction, we’ve doubled the number of nodes

We sometimes have to add on the left and sometimes on the right, but the inputs accumulated on any input step are always a contiguous subset adjacent to the contiguous subset currently known to the destination, so associativity is sufficient

2i and 2i+1 are always linked for block b at step b; if we condense some of these nodes, we can reduce the number of nodes to get non-powers of 2

25

Further results

Current patterns of communication repeat every 2log log n blocks

We have alternative constructions with slightly worse latency, but full throughput, that are much simpler (repeating patterns every 2 or 3 steps) These constructions require commutativity Generalizations of rooted tree constructions,

improving throughput

1 a triple erasure reed-solomon code, and fast rebuilding mark manasse, chandu thekkath microsoft...

Documents

data rows

determinant slide

vandermonde rows

check rows

column slide

erasure recovery slide

vandermonde matrix v

deleted rows