recovering data in presence of malicious errors atri rudra university at buffalo, suny

Recovering Data in Presence of Malicious Errors

Atri RudraUniversity at Buffalo, SUNY

2

The setupC(x)

x

y = C(x)+error

x Give up

Mapping C Error-correcting code or just code Encoding: x C(x) Decoding: y X C(x) is a codeword

3

Codes are useful!

CellphonesSatellite Broadcast Deep-space

communicationInternet

CDs/DVDs RAID ECC MemoryPaper Bar-codes

4

Redundancy vs. Error-correction Repetition code: Repeat every bit say 100

times Good error correcting properties Too much redundancy

Parity code: Add a parity bit Minimum amount of redundancy Bad error correcting properties

Two errors go completely undetected

Neither of these codes are satisfactory

1 1 1 0 0 1

1 0 0 0 0 1

5

Two main challenges in coding theory Problem with parity example

Messages mapped to codewords which do not differ in many places

Need to pick a lot of codewords that differ a lot from each other

Efficient decoding Naive algorithm: check received word with all

codewords

6

The fundamental tradeoff

Correct as many errors as possible with as little redundancy as possible

This talk: Answer is yes

Can one achieve the “optimal” tradeoff with efficient encoding and decoding ?

7

Overview of the talk Specify the setup

The model What is the optimal tradeoff ?

Previous work Construction of a “good” code High level idea of why it works Future Directions

Some recent progress

8

Error-correcting codesC(x)

x

y

x Give up

Mapping C : kn

Message length k, code length n n≥ k

Rate R = k/n 1

Efficient means polynomial in n Decoding Complexity

9

Shannon’s world

Noise is probabilistic Binary Symmetric Channel

Every bit is flipped

w/ probability p Benign noise model

For example, does not capture

bursty errorsClaude E. Shannon

10

Hamming’s world

Errors are worst case error locations arbitrary symbol changes

Limit on total number of errors Much more powerful than

Shannon Captures bursty errors

We will consider this channel

model

Richard W. Hamming

11

A “low level” view

Think of each symbol in being a packet The setup

Sender wants to send k packets After encoding sends n packets Some packets get corrupted Receiver needs to recover the original k packets

Packet size Ideally constant but can grow with n

12

Decoding

C(x) sent, y received x k, y n

How much of y must be correct to recover x ? At least k packets must be correct At most (n-k)/n = 1-R fraction of errors 1-R is the information-theoretic limit

: the fraction of errors decoder can handle Information theoretic limit implies 1-R

x C(x)

yR = k/n

13

Can we get to the limit or 1-R ? Not if we always want to uniquely recover the

original message Limit for unique decoding, (1-R)/2

(1-R)/2 (1-R)/2

1-R

c1

c2

y

R 1-R

(1-R)/2

14

List decoding [Elias57, Wozencraft58] Always insisting on unique codeword is

restrictive The “pathological” cases are rare

“Typical” received word can be decoded beyond (1-R)/2

Better Error-Recovery Model Output a list of answers List Decoding Example: Spell Checker

(1-R)/2

Almost all the space in higher dimension.

All but an exponential (in n) fraction

15

Advantages of List decoding

Typical received words have an unique closest codeword List decoding will return list size of one such

received words Still deal with worst case errors How to deal with list size

greater than one ? Declare an error; or Use some side information

Spell checker

(1-R)/2

16

The list decoding problem

Given a code and an error parameter For any received word y

Output all codewords c such that c and y disagree in at most fraction of places

Fundamental Question The best possible tradeoff between R and ?

With “small” lists Can it approach information-theoretic limit 1-R ?

17May 25, 2007 Ph.D. Final Exam 17

Other applications of list decoding Cryptography

Cryptanalysis of certain block-ciphers [Jakobsen98] Efficient traitor tracing scheme [Silverberg, Staddon, Walker 03]

Complexity Theory Hardcore predicates from one way functions [Goldreich,Levin 89;

Impagliazzo 97; Ta-Shama, Zuckerman 01] Worst-case vs. average-case hardness [Cai, Pavan, Sivakumar 99;

Goldreich, Ron, Sudan 99; Sudan, Trevisan, Vadhan 99; Impagliazzo, Jaiswal,

Kabanets 06] Other algorithmic applications

IP Traceback [Dean,Franklin,Stubblefield 01; Savage, Wetherall, Karlin,

Anderson 00] Guessing Secrets [Alon,Guruswami,Kaufman,Sudan 02; Chung, Graham,

Leighton 01]

18


The model The optimal tradeoff between rate and fraction of

errors Previous work Construction of a “good” code High level idea of why it works Future Directions

Some recent progress

19

Information theoretic limit

< 1 - R Information-

theoretic limit Can handle

twice as many errors

Rate (R)

Unique decoding

Inf. theoretic limit

Fra

c. o

f Err

ors

()

20

Achieving information theoretic limit There exist codes that achieve the

information theoretic limit ≥ 1-R-o(1) Random coding argument

Not a useful result Codes are not explicit No efficient list decoding algorithms

Need explicit construction of such codes We also need poly time (list) decodability

Requires list size to be polynomial

21

The challenge

Explicit construction of code(s) Efficient list decoding algorithms up to the

information theoretic limit For rate R, correct 1-R fraction of errors

Shannon’s work raised similar challenge Explicit codes achieving the information theoretic

limit for stochastic models The challenge has been met [Forney 66, Luby-

Mitzenmacher-Shokrollahi-Spielman 01, Richardson-Urbanke01] Now for stronger adversarial model

22

Guruswami-Sudan

The best until 1998

1 - R1/2

Reed-Solomon codes

Sudan 95, Guruswami-Sudan98

Better than unique decoding

At R=0.8 Unique: 10% Inf. Th. limit: 20% GS : 10.56 %

Unique decoding


Fra

c. o

f Err

ors

()

Rate (R)

Motivating Question:

Close the gap between blue and

green line with explicit efficient codes

23

The best until 2005

1-(sR)s/(s+1)

s 1 Parvaresh,Vardy

s=2 in the plot

Based on Reed-Solomon codes

Improves GS for R < 1/16

Unique decoding


Guruswami-Sudan

Parvaresh-Vardy

Fra

c. o

f Err

ors

()

Rate (R)

24

Our Result

1- R - > 0 Folded RS codes [Guruswami, R.

06]

Unique decoding


Guruswami-Sudan

Parvaresh-Vardy

Fra

c. o

f Err

ors

()

Rate (R)

Our work

25


The model The optimal tradeoff between rate and fraction of

errors Previous work Our Construction High level idea of why it works Future Directions

Recent progress

26

The main result

Construction of algebraic family of codes For every rate R >0 and >0

List decoding algorithm that can correct 1 - R - fraction of errors

Based on Reed-Solomon codes

27

Algebra terminology

F will denote a finite field Think of it as integers mod some prime

Polynomials Coefficients come from F Poly of degree 3 over Z7

f(X) = X3 +4X +5 Evaluate polynomials at points in F

f(2) = (8 + 8 + 5) mod 7 = 21 mod 7 =0 Irreducible polynomials

No non-trivial polynomial factors X2+1 is irreducible over Z7 , while X2-1 is not

28

Reed-Solomon codes

Message: (m0,m1,…,mk-1) Fk

View as poly. f(X) = m0+m1X+…+mk-1Xk-1

Encoding, RS(f) = ( f(1),f(2),…,f(n) ) F ={ 1,2,…,n}

[Guruswami-Sudan] Can correct up to

1-(k/n)1/2 errors in polynomial timef(1) f(2) f(3) f(4) f(n)

29

Parvaresh Vardy codes (of order 2)

f(1) f(2) f(3) f(4) f(n)

g(1) g(2) g(3) g(4) g(n)

f(X) g(X)g(X)=f(X)q mod E(X)

Extra information from g(X) helps in decoding Rate, RPV = k/2n [PV05] PV codes can correct 1 -(k/n)2/3 errors

in polynomial time 1 - (2RPV)2/3

30

Towards our solution

Suppose g(X) = f(X)q mod E(X) = f(X) Let us look again at the PV codeword

f(1) f(1)

g(1) g(1)f(1) f(1)

31

Folded Reed Solomon Codes Suppose g(X) = f(X)q mod E(X) = f(X) Don’t send the redundant symbols Reduces the length to n/2

R = (k/2)/(n/2) = k/n Using PV result, fraction of errors

1 - (k/n)2/3 = 1 - R2/3

f(1) f(1)

f(1) f(1)

32

Getting to 1-R-

Started with PV code with s = 2 to get 1 - R2/3

Start with PV code with general s 1 - Rs/(s+1)

Pick s to be “large” enough to approach 1-R- Decoding complexity increases from that of

Parvaresh-Vardy but still polynomial

33

What we actually do We show that for any generator F\{ 0 }

g(X) = f(X)q mod E(X) = f(X) Can achieve similar compression by grouping

elements in orbits of m’~n/m, R ~ (k/m)/(n/m) = k/n

f(1) f(m) f((m’-1)m )

f(m-1) f(2m-1) f(mm’-1)

f() f(m+1) f((m’-1)m+1 )

34

Proving f(X)q mod E(X) = f(X) First use the fact f(X)q = f(Xq) over F

Need to show f(Xq) mod E(X) = f(X) Proving Xq mod E(X) = X suffices Or, E(X) divides Xq-1 - E(X) = Xq-1 – is irreducible

35

Our Result

· 1- R - > 0 Folded RS codes [Guruswami, R.

06]

Unique decoding


Guruswami-Sudan

Parvaresh-Vardy

Fra

c. o

f Err

ors

()

Rate (R)

Our work

36

“Welcome” to the dark side…

37

Limitations of our work

To get to 1 - R - , need s > 1/ Alphabet size = ns > n1/

Fortunately can be reduced to 2poly(1/)

Concatenation + Expanders [Guruswami-Indyk’02] Lower bound is 21/

List size (running time) > n1/

Open question to bring this down

38

Time to wake up

39

Overview of the talk List Decoding primer Previous work on list decoding Codes over large alphabets

Construction of a “good” code High level idea of why it works

Codes over small alphabets The current best codes

Future Directions Some (very) modest recent progress

40

Optimal Tradeoff for List Decoding Best possible is H-1 (1-R)

H()= - log - (1- )log(1- ) Exists (H-1(1-R-),O(1/ )) list decodable code

Random code of rate R has the property whp > H-1(1-R+) implies super poly list size

For any code

For large q, H-1 (1-R) 1-R

q

q

q

q

41

Our Results (q=2)

Optimal tradeoff H-1(1-R)

[Guruswami, R. 06] “Zyablov”

bound [Guruswami, R.

07] Blokh-Zyablov

# E

rro

rs

Rate

Zyablov bound

Blokh-Zyablov bound

Previous best

Optimal Tradeoff

42

How do we get binary codes ? Concatenation of codes [Forney 66]

C1: (GF(2k))K (GF(2k))N (“Outer” code)

C2: GF(2)k (GF(2))n (“Inner” code)

C1± C2: (GF(2))kK (GF(2))nN

Typically k=O(log N) Brute force decoding for inner code

m1 m2

wNw1 w2

mKm

C1(m)

C2(w1) C2(w2)C2(wN) C1± C2(m)

43

List Decoding concatenated code C1 = folded RS code

C2 = “suitably chosen” binary code Natural decoding algorithm

Divide up the received word into blocks of length n

Find closest C2 codeword for each block

Run list decoding algorithm for C1 Loses Information!

44

List Decoding C2

y1 y2 yN

How do we “list decode” from lists ?

2 GF(2)n

S1 S2 SN

2 GF(2)k

45

The list recovery problem

Given a code and an error parameter For any set of lists S1,…,SN such that

|Si| s, for every i

Output all codewords c such that ci 2 Si for at least 1-fraction of i’s

List decoding is special case with s=1

46

List Decoding C1± C2

y1 y2 yN

S1 S2 SN

List decode C 2

List Recovering Algorithm for C1

47

Putting it together [Guruswami, R. 06] C1 can be list recovered from 1 and C2 can be

list decoded from 2 errors C1± C2 list decoded from 12 errors

Folded RS of rate R list recoverable from 1-R errors

Exists inner codes of rate r list decoded from H-1 (1-r) errors Can find one by “exhaustive” search

C1± C2 list decodable fr’m (1-R)H-1(1-r) errors

48

Multilevel Concatenated Codes C1: (GF(2k))K (GF(2k))N (“Outer” code 1)

C2: (GF(2k))L (GF(2k))N (“Outer” code 2)

Cin: GF(2)2k (GF(2))n (“Inner” code)

m1 m2 mK m

vNv1 v2 C1(m)

M1 M2 ML M

wNw1 w2 C2(M)

Cin(v1,w1) Cin(v2,w2) Cin(vN,wN)

C1 and C2 are FRS

49

Advantage over rate rR Concat Codes C1, C2 ,Cin

have rates R1, R2 and r Final rate r(R1+R2)/2, choose R1< R

Step 1: Just recover m List decode Cin up to H-1 (1-r) errors

List recover C1 up to 1-R1 errors m1 m2 mK m

vNv1 v2 C1(m)

M1 M2 ML M

wNw1 w2 C2(M)


Can handle (1-R1)H-1(1-r) >(1-R)H-1(1-r)

errors

50

Advantage over Concatenated Codes Step 2: Just recover M, given m

Subcode of Cin of rate r/2 acts on M List decode subcode upto H-1(1-r/2) errors List recover C2 upto 1-R2 errors

Can handle (1-R2) H-1(1-r/2) errorsm1 m2 mK m

vNv1 v2 C1(m)

M1 M2 ML M

wNw1 w2 C2(M)


51

Wraping it up

Total errors that can be handled min{(1-R1)H-1(1-r) , (1-R2) H-1(1-r/2) }

Better than (1-R)H-1 (1-r) (R1+R2)/2=R (recall that R1<R) H-1(1-r/2) > H-1(1-r) so choose R2 a bit > R

Optimize over choices of r, R1 and R2

Need nested list decodability of inner code Blokh Zyablov follows from multiple outer

codes

52

Our Results (q=2)

Optimal tradeoff H-1(1-R)

[Guruswami, R. 06] “Zyablov”

bound [Guruswami, R.

07] Blokh-Zyablov

# E

rro

rs

Rate

Zyablov bound

Blokh-Zyablov bound

Previous best

Optimal Tradeoff

53

How far can concatenated codes go? Outer code: folded RS Random and independent inner codes

Different inner codes for each outer symbol Can get to the information theoretic limit

= H-1(1-R) [Guruswami, R. 08]

54

To summarize

List decoding: A central coding theory notion Permits decoding up to the optimal fraction of

adversarial errors Bridges adversarial and probabilistic approaches

to information theory Shannon’s information theoretic limit p = H-1 (1-R) List decoding information theoretic limit = H-1(1-R)

Efficient list decoding possible for algebraic codes

55

Our Contributions

Folded RS codes are explicit codes that achieve information theoretic limit for list decoding

Better list decoding for binary codes Concatenated codes can get us to list

decoding capacity

56

Open Questions

Reduce decoding complexity of our algorithm List decoding for binary codes

Explicitly achieve error bound = H-1(1-R) Erasures: decode when = 1-R

Non-algebraic codes ? Graph based codes ? Other applications of these new codes

Extractors [Guruswami, Umans, Vadhan 07] Approximating NP-witnesses [Guruswami, R. 08]

57

Thank You

Questions ?

recovering data in presence of malicious errors atri rudra university at buffalo, suny

Documents