4. information redundancy reliable system design 2010 by: amir m. rahmani

4. Information Redundancy

Reliable System Design 2010by: Amir M. Rahmani

matlab1.ir

Information Redundancy

Code: representing information• - Morse code

Code word: collection of symbols or digit , use to representing information according to the rules of a given code

Binary code: a code in which code words contain only symbols that are either 0 or 1.

Error detection, Error correction Coding often applied to

• - Information transfer: often serial communication through a channel

• - Information storage

matlab1.ir

Start with k-bit data word

Add r code bits to k-bit data Total = n-bit code word (n=k+r) Not all 2n combinations are valid code words For certain encoding schemes - some types of

errors can also be corrected To extract original data - n bits must be decoded Overhead = r/n

• – e.g., for (single-bit) parity, the overhead is 1/n• – additional bits required• – time to encode and decode

matlab1.ir

Hamming distance (d) Number of bits in which two words differ from

each other; d (x,y)=Σ(xk XOR yk)• E.g., 0010 and 1110 have a Hamming distance of 2

Rules:• Iff d (x,y)= 0 then x=y• d (x,y)= d (y,x)• d (x,y)= d (y,z)>= d (x,z)

For a group of code words, d is the minimum of all hamming distance between all possible pairs of code words.

• E.g., {000, 011, 101, 110} have a Hamming distance of 2 d determines the code’s ability to detect and/or

correct errors• – d-1 bit for error detection• – [(d-1)/2] bit for error correction

matlab1.ir

Hamming distance (d)

Two words in this figure are connected by an edge if their d is 1d=2 Can detect single bit errors

matlab1.ir

Hamming distance (d)

The code {000,111} can be used to encode a single data bit. 0 can be encoded as 000 and 1 as 111. This code is identical to TMRd=3 Can detect single & double bit errors, can correct single bit errors

matlab1.ir

Separability of a Code

A code is separable if it has separate fields for the data and the code bits.

Decoding consists of disregarding the code bits The code bits can be processed separately to

verify the correctness of the data

A non-separable code has the data and code bits integrated together - extracting the data from the encoded word requires some processing

matlab1.ir

Single-bit Parity

Simplest separable error detection code• – Adds one bit of redundancy to each data word

Encoding and decoding cost is low Even (odd) parity: add bit such that total number of

ones in code word is even (odd)• – E.g., 001010 gets a parity bit of 0 for even parity (1 for

odd) Can detect all single-bit errors (All odd-bit errors)

• – Hamming distance >= 2• – Could be greater than 2 if data words don’t use all bit

combinations Drawbacks:

• – Unable to detect common even errors

matlab1.ir

Single-bit Even Parity

matlab1.ir

Even or Odd Parity?

The decision depends on which type of all-bits error is more probable

For even parity - the parity bit for the all zeroes data word will be 0 and an all-0’s failure will go undetected - it is a valid code word

Selecting the odd parity code will allow the detection of the all-0's failure

If all-1's failure is more likely - the odd parity code must be selected if the total number of bits (n+1) is even, and the even parity if n+1 is odd

matlab1.ir

Byte-Interlaced Parity Code

Example: n=64, data bits - a63,a62,…,a0 Eight parity bits: First - parity bit of a63,a55,a47,a39,a31,a23,a15,a7 -

the most significant bits in the eight bytes Remaining seven parity bits - assigned so that

the corresponding groups of bits are interlaced

Scheme is beneficial when shorting of adjacent bits is a common failure mode (example - a bus)

If parity type (odd or even) is alternated between groups - unidirectional errors (all-0's or all-1's) will also be detected

matlab1.ir

Overlapping Parity Code

Simplest scheme; data is organized in a 2-dimensional array

Bits at the end of row - parity over that row Bits at the bottom of column - parity over column Error correcting code?

• - A single-bit error anywhere will cause a row and a column to be erroneous

This identifies a unique erroneous bit This is an example of overlapping parity - each bit is

covered by more than one parity bit

matlab1.ir

Checksum Separable code Checksum is the sum of the original data All checksum schemes allow error detection but not error

location - entire block of data must be retransmitted if an error is detected

a) Single-precision checksum• – overflow problem, i.e. adding n bits modulo 2n

b) Double-precision checksum• – uses double precision, i.e. compute 2n-bit checksum from n-bit

words using modulo 22n arithmetic. c) Residue checksum

• – like single-precision checksum, but overflow is now fed back as carry

d) Honeywell checksum• – compose word of double length by concatenating 2 consecutive

words (done modulo 22n)• – compute checksum on these double words

matlab1.ir

Comparing the Checksum Types

matlab1.ir

Comparison - Example

In Single-precision checksum - transmitted checksum differs from computed checksum

In Honeywell checksum computed checksum differs from received checksum and error is detected

matlab1.ir

Cyclic Codes

Cyclic codes are often non-separable although separable cyclic codes exist

Encoding consists of dividing the data word by a constant number

The coded word is the product Decoding is dividing by the same constant - if

the remainder is non-zero, an error has occurred Cyclic codes are widely used in data storage and

communication

matlab1.ir

Cyclic Redundancy Checks (CRC)

CRC is based on a mathematical calculation performed on message.

We will use the following terms: M - Message to be sent (k bits) F - Frame Check Sequence (FCS) or CRC to be

appended to message (n bits) T - Transmitted message includes both M and F

=> (k+n bits) G - n+1 bit pattern (called polynomial generator)

used to calculate F and check T

matlab1.ir

Cyclic Redundancy Check (CRC) Key idea

• – given a k-bit frame (message)• – transmitter generates a n-bit sequence called frame check

sequence (FCS)• – so that resulting frame of size k+n is exactly divisible by

some predetermined number Multiply M by 2n to shift, and add F to padded 0s

• T = 2 n M + F Dividing 2nM by G gives quotient and remainder

(remainder is 1 bit less than divisor)• 2 n M/G = Q + R/G

then using R as our FCS we get• T = 2 n M + F

on the receiving end, division by G leads toT/G = (2 n M +R)/G = Q + R/G +R/G =Q

If remainder is non-zero, it’s an error

matlab1.ir

Cyclic Redundancy Check (CRC)

Example, assume G(X) has at least 3 terms• – G(x) has 3 1-bits• » detects all single bit errors• » detects all double bit errors• » detects odd #’s of errors if G(X) contains the

factor (X + 1)• » any burst errors < length of FCS• » most larger burst errors

matlab1.ir

Cyclic Redundancy Check (CRC)

A polynomial view:• variable X with binary coefficients, where the

coefficients correspond to the bits in the number.• M = 110011, M(X) = X5 + X4 + X + 1, and for G

= 11001 we have G(X) = X4 + X3 + 1• Math is still mod 2

• » An error E(X) is received, and undetected iff it is divisible by G(X)

matlab1.ir

CRC ExampleM = 10110100011, G = 1101 ; XOR instead of Minus

10110100011 000 | 1101

1101

1100

1101

1100

1101

1011

1101

1100

1101

100=> CRC = 100

matlab1.ir

Cyclic Redundancy Check (CRC) Pre-defined polynomial examples:

• • CRC-12: X12+X11+X3+X2+X+1• • CRC-16: X16+X15+X2+1• • CRC-CCITT = X16 + X12 + X5 + 1

Why is CRC popular?• • Easy to implement! Just need shifters and XORs

Hardware Implementation:• G(X) = 1 + a1X +a2 X + …+ an-1 Xn-1 + an X n

matlab1.ir

Hamming Code (7,4)

Class of (n,k) Hamming codes, e.g., (7,4) [r= n-k =3] Let i1, i2, i3, i4 be the information bits Let p1, p2, p4 be the check bits p1 = i1 XOR i2 XOR i4

p2 = i1 XOR i3 XOR i4

p4 = i2 XOR i3 XOR i4

Unordered code

To detect all unidirectional errors• M-of-n code• Berger code

matlab1.ir

matlab1.ir

m-of-n codes

All code words are n bits in length and contain exactly m 1’s

Simple implementation Can detect all single errors Can detect all unidirectional multiple

errors

matlab1.ir

Berger Code Separable code

• . counts the number of 1s in the word• . expresses it in binary• . complements it• . appends this quantity to the data

Example - encoding 11101• . Four 1s• . 100 in binary• . 011 after complementing• . the encoded word 11101011

Can detect all single errors Can Detects all unidirectional bit errors - one or more 1s

turn to 0s and no 0s turn to 1s (or vice versa) Overhead = r/(k+r)

• k data bits - at most k 1s , r =[log 2(k+1)] redundant bits

matlab1.ir

Other Coding Schemes

Many Error Detecting/Correcting codes exist• – E.g., Arithmetic codes, Reed-Solomon codes, Residue

codes, Bi-Residue codes, etc.

Many of them require more mathematic than belongs in this course

Reasons for other types of codes• – Burst errors• – Byte errors• – Cost/Performance• – Multiple-bit errors• – Ease of hardware implementation

matlab1.ir

Error Recovery

Probably the most important phase of any fault-tolerance technique

Two approaches: • Forward• Backward

matlab1.ir

Forward Error Recovery Forward Error Recovery continues from an

erroneous state by making selective corrections to the system state

This includes making safe the controlled environment which may be damaged because of the failure

It is system specific and depends on accurate predictions of the location and cause of errors

Examples: redundant pointers in data structures and the use of self-correcting codes such as Hamming Codes

matlab1.ir

Backward Error Recovery (BER) If error detected, recover backwards & re-execute

• – Recover to previous state of system that we know is error-free• – Assumes that error will be gone by time of re-execution

Some terminology:• – Recovery point: the point to which we recover in case of error• – Check pointing: periodically saving state of system• – Logging: saving changes made to system state

Many commercial machines use BER• – Sequoia, Synapse N+1, Tandem NonStop

BER also includes all-software schemes• – Nightly backups of file systems

May sacrifice performance to achieve availability• – Where might we lose performance?• – May not be suitable for real-time systems

Disadvantage• – it cannot undo errors in the environment!

matlab1.ir

The Domino Effect With concurrent processes that interact with each

other, BER is more complex Consider:

R22

R21

R13

R12

R11

IPC4

IPC3

IPC2

IPC1

Exe

cuti

on ti

me

P1 P2

If the error is detected in P1 rollback to R13.

If the error is detected in P2?

matlab1.ir

6 BER Issues

1- What state needs to be saved?

2- How do we save this state?

3- Where do we save it?

4- How often do we save it?

5- How do we recover the system to this state?

6- How do we resume execution after recovery?

matlab1.ir

1- What State needs to be saved

Need to save all state that would be necessary if this were to become the recovery point

In general, we only need to save the user-visible state

For example, microprocessors:• – Must save architectural state• – Don’t have to worry about micro-

architectural state

matlab1.ir

2- How to Save State

Two “hints” of BER:• – Check pointing: Periodically stop system and save state• – Logging: Log all changes to state

Check pointing• – Only suffers overhead at periodic checkpoints• – Can only recover at coarse granularity• – Size of checkpoint is often fixed

Logging• – Finer granularity of rollback• – suffers overhead for logging many common operations• – Amount of state logged is variable

matlab1.ir

3- Where to Save State

Have to save state where it is “reliable”• – A fault in the recovery point state could make recovery

impossible In processor (can’t survive loss of processor chip)

• – Processor saves registers to shadow registers In cache (same as processor, if on-chip cache)

• – Processor copies registers into cache In memory (memory can be made very reliable)

• – Processor copies registers into memory• – Write-through cache copies data into memory

In disk (maybe the safest, but slow)• – E.g., databases log updates to disks

In tape (too slow except for rare backups)

matlab1.ir

4- When to Save State

Check pointing• – Can choose checkpoint interval

Logging• – Continuously saving state (every time it

changes) For check pointing, a larger checkpoint

interval means• – Less overhead due to check pointing (since

less frequent)• – Coarser checkpoint granularity (can’t

recover to arbitrary point)

matlab1.ir

5- How to Recover State

Check pointing: Copy pre-fault recovery point checkpoint into architectural state

Logging: Unroll log to undo changes since recovery point

Tradeoff between these two depends on system

matlab1.ir

6- How to Resume Execution

Simply resuming execution after recovery may not be possible

• – E.g., recovery due to hard fault in interconnection switch

May need to reconfigure before resuming, to ensure forward progress

• – E.g., reconfiguring the routing in interconnect to avoid dead switch

matlab1.ir

Implementing EDC/ECC in Hardware

Where does EDC/ECC get used?• – Disk, CD-ROM• – Memory (DRAM, SRAM)• – Buses

Tradeoff between EDC and ECC ECC: Forward error recovery

• – Often on critical path, so can slow down even fault-free system

• - in a ony-way transmition EDC: Backward error recovery

• – Detecting error requires recovery (can be slow)

4. information redundancy reliable system design 2010 by: amir m. rahmani

Documents