viterbi-implementation

Viterbi implementation, using FPGAs

Version number Date

1.0 2008

Authors: Mário Véstias, Helena Sarmento

Revised by: Helena Sarmento

Technical Report

Project: UWB Receiver: baseband processing using reconfigurable hardware (UWBR)

PTDC/EEA-ELC/67993/2006

2

Funded by:

3

Funded by:

Abstract This report describes the Viterbi decoding algorithm and presents an implementation of the

decoder for the UWB MB_OFDM technology. The work done is based on a previous

implementation that was analysed in order to improve performance. Results for

implementations with different number of soft bits e traceback length are presented.

4

Funded by:

5

Funded by:

Table of Contents Viterbi Algorithm....................................................................................................................7

Convolutional Code 7

Trellis Diagram 8

State transitions 8

Decoding 9

Hard versus soft decision 10

Euclidean distance 11

Decoding Length 13

Viterbi Decoder implementation...........................................................................................13

BMU 14

ACSU 14

SMU 15

DU 15

Results...................................................................................................................................17

Conclusions...........................................................................................................................17

References.............................................................................................................................19

6

Funded by:

7

Funded by:

Viterbi Algorithm

The Viterbi algorithm is a maximum-likelihood algorithm for decoding of convolutional

codes. It is a recursive sequential minimization algorithm that can be used to find the least

expensive way to route symbols from one edge of a state diagram to another. Viterbi

algorithm uses a cost analysis mechanism to calculate the distance between the received

symbol s and the symbol associated to an edge.

The Viterbi algorithm solves the minimization problem by applying recursively equation

(1).

PM [j](t) = min (PM [i](t-1) + BM [i, j](s)) (1)

PM [j](t) is the path metric associated to the minimum cost path leading to state j at time t.

BM [i, j](s) is the branch metric associated to the transition from state i to state j. BM[i,

j](s) is the distance between the received symbol s and the symbol associated to that

transition from state i at time t-1 to sate j at time t.

Convolutional Code

A convolutional code is a type of error-correcting code in which each block of m input bits

(m-bit string) is transformed into a block of n bits and the transformation is a function of

the last k bits. The quantity m/n is the code rate, being a measure of the efficiency of the

code. The constraint length k represents the number of bits in the encoder memory that

affect the generation of the n output bits.

The convolutional code is defined by a set of n generating polynomials for each input bit.

The constraint length is equal to the highest-degree generator polynomial.

in

outA

outB

outC

DDDD2D6 D5 D4 D3 D0D1

D D D

Figure 1 – MB-OFDM convolutional encoder

Figure 1 presents the convolutional encoder defined for MB-OFDM, where m =1, n = 3,

codification rate m/n =1/3 and constraint length k = 7. Generator polynomials are

represented by equations (2)(3)(4).

G0 =(133)8 = (1011011)2 (2)

G1 =(165)8 = (1110101)2 (3)

8

Funded by:

G2 =(171)8 = (1111001)2 (4)

G0 generates outA, G1 outB and G2 outC (equations (5)(6)(7)).

OutA = D0 + D3 + D4 + D5 + D6 (5)

OutB = D0 + D2 + D4 + D5 + D6 (6)

OutC = D0 + D3 + D4 + D5 + D6 (7)

Trellis Diagram

A Trellis diagram is a state diagram. A trellis diagram for a convolutional code, in which 1

bit is shifted at a time into the shift register, with k stages has 2k−1

states. The trellis

diagram for the convolutional encoder of Figure 2 is presented on Figure 3. Figure 3

presents the four states (00, 01, 10, 11) and the transitions between states, for 4 time

intervals (t0-t1, t1-t2, t2-t3 and t3-t4). The initial state at time t0 is 00. When the input bit is 1

transitions are represented by dashed lines and for 0 by solid lines. Output values for each

transition are represented near the transition branch.

input bit

outA

outB

State

Figure 2 – Convolutional encoder (m = 1, n =2, k =3)

t0 t1 t2 t3 t4

••••

••••

••••

••••

••••

••••

••••

••••

••••

••••

••••

••••

••••

••••

••••

••••

••••

••••

••••

••••

00

10

01

11

00 00 00 0011 11 11 11

01 01 01

10 10

01 01

10 10

00 00

11 11

Figure 3 – Trellis diagram (m = 1, n =2, k =3)

State transitions

For a convolutional encoder with one input bit (m = 1), there are only two paths that merge

at each node, as represented in the example of Figure 4 . There are only two states that can

9

Funded by:

change to the same state. The states differ in the least significant bit (D0) and the input bit

must be the same. A butterfly can represent the state transitions to each state (Figure 5).

D5 D4 D3 D2 D1 D0

D5 D4 D3 D2 D1 D0

A B C D E 0

A B C D E 1

D5 D4 D3 D2 D1 D0

D5 D4 D3 D2 D1 D0

1 A B C D E

1 A B C D E

state i

state i+1

state m+32

state m+32

D5 D4 D3 D2 D1 D0

D5 D4 D3 D2 D1 D0 0

A B C D E 0

A B C D E 1

0 D5 D4 D3 D2 D1 D0

D5 D4 D3 D2 D1 D0

0 A B C D E

0 A B C D E

state i

state i+1

state m

state m

1

1

Figure 4 –State transition (m = 1, k = 7)

0

0

1

1

i

I +1

m

m + 32

Figure 5 –Butterfly representation of transitions (m = 1, k = 7)

Decoding

The Viterbi algorithm tries to find a path of the trellis diagram, where the sequence of

output symbols approximately matches the received sequence. To accomplish this task, it

calculates for each path the path metric, which measures the distance to the received

symbols sequence.

As two paths can merge at each node (Figure 5), two path metrics are computed for each

node. Only one path will survive. The survivor presents the minimum distance to the

received sequence. Thus the number of computations in decoding, performed at each time

interval, increases exponentially with k. The exponential increase in the number of

computation make impractical to use large constraint lengths to implement convolutional

codes.

10

Funded by:

t0 t1 t2 t3 t4

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

00

10

01

11

00 00 00 0011 11 11 11

01 01

10 10

01 01

10 10

00 00

11 11

2

1

0

11 00 01 012

2

2

01

Figure 6 – Path metric

In Figure 6 the starting state is assumed to be 00. It has been found that a Viterbi decoder

may start decoding at any arbitrary point in a transmission, if all state metrics are initially

reset to zero. In this example, received bits are represented by orange numbers. An error

exists, represented by the italic 0. Two paths reach state 10 at time t3 (blue and green

paths). Considering the hamming distance, branch metrics associated to each transition of

this path were calculated. Metrics of the blue and the green paths are calculated by (8) and

(9). Blue path has a smaller distance to the received sequence.

PM [01](t3) = 0 + 1 + 1 = 2 (8)

PM [01](t3) = 2 + 2 + 2 = 2 (9)

If the decoding process ends at time t3, the received sequence will be corrected from

110001 to 110101.

Hard versus soft decision

In the demodulation process, received analog waveforms are converted to a digital signal.

Sampled voltages are quantized. In the simplest quantization method, the hard decision,

two levels are used. The demodulator codes the two levels, using a single bit. It decides

whether “0” or “1”is the received bit.

Probability of being 0

8 level soft decision

2 levels hard decision0 1

Probability of being 1

. . . { { { { { { { { . . .

000 001 010 011 100 101 110 111

Figure 7 –Hard decision and 8-level soft decision for BPSK

11

Funded by:

On the bottom of Figure 7, the sampled voltage (BPSK) is quantized in two levels and it is

demodulated to a 0 or 1. Adding more levels to the quantization process improves the

decoder performance, as the demodulator provides the decoder with a measure of

confidence for its decision. For instance, in an 8 level soft decision, the demodulator

identifies 8-levels. These levels indicate a 0 or 1 with a high or low confidence [1]. The 3

soft-bits can be coded, using an offset-binary format or sign-magnitude format [2] (Table

1). Two’s complement format can also be adopted. Figure 7 presents 3 soft bits with an

offset-binary format.

Table 1 – Soft bits mapping for 8 levels

Code words

Offset-binary Sign-magnitude 2’s complement

111 111 011 strongest 1

110 110 010

101 101 001

100 100 000 weakest 1

011 000 111 weakest 0

010 001 110

001 010 101

000 011 100 strongest 0

For a Gaussian channel, 8-level quantization, when compared wit 2 level quantization,

results in a performance improvement in required signal to noise ratio of approximately 2

dB. Analog (infinite level quantization) results in a 2.2 dB signal to noise improvement

over 2 level quantization. Only a 0.2 dB loss exist for 8-level when compared to analog

representation [3].

For the hard decision decoding, the Hamming distance is used. It is defined as the number

of bits that are different between the received symbol at the decoder and the output symbol

of the trellis diagram branch. Euclidean distance is adopted for soft decision decoding.

Euclidean distance

In the Viterbi algorithm, comparison between path metrics is required to determine the

survivor path. Since path metrics are made up of the accumulated branch metrics we can

only analyze branch metrics.

For a 1/3 code rate, the Euclidean distance calculates the distance between the 3 received

noisy received symbol (Ai,Bi,Ci) and the ideal output symbol (A,B,C) of the transition

between two states of the trellis diagram (Figure 8), using Equation (10).

12

Funded by:

•

•

•

•

•

•

•

•

•

•

•

•

Ai Bi Ci

A B C

Figure 8 – Received bits and output bits in a 1/3 decoder

( ) ( ) ( ) ( )222,, CCBBAACBAbmbm iiiiiii −+−+−== (10)

Developing the first term of Equation (10) we obtain equation (11).

( ) AAAAAA iii 2222−+=− (11)

If ideal output symbols or noiseless symbols 0 and 1 are represented by symmetrical values

–a and +a, equation (12) is obtained.

( ) iii AAAaAA 2222−+=− (12)

As only differences between branch metrics are important, we can add or multiply all

branch metrics, in the same time interval, by a constant. The comparison between path

metrics will not change. Since all branch metrics, in the same time interval, have the same

term Ai we can compute the branch metric based on equation (13) to obtain equation (14).

iAa

A2− (13)

( )iiiiiii C

a

CB

a

BA

a

ACBAmb −−−=′ ,, (14)

Table 2 presents the branch metric for the 8 symbols received. Only additions need to be

implemented to calculate Euclidean distances. We considered noiseless symbols 0 and 1

represented by the symmetrical values –a and +a. It can be demonstrated [4] that for other

representation (b, b+k) the branch metric can also be calculated only with additions (Table

2).

Table 2 – Branch metrics for the 8 possible received symbols

Received symboliii CBA Branch metric

000 ( ) CBAbm ++=0,0,0

001 ( ) CBAbm −+=01,0,0

010 ( ) CBAbm +−=0,1,0

13

Funded by:

011 ( ) CBAbm −−=1,1,0

100 ( ) CBAbm ++−=0,0,1

101 ( ) CBAbm −+−=1,0,1

110 ( ) CBAbm +−−=0,1,1

111 ( ) CBAbm −−−=1,1,1

Decoding Length

Looking at the example of Figure 6, we can see that only at time t3 the decoder can decide

the first decoded bit (0 on the blue path). For long sequences, the Viterbi algorithm

requires large decoding delays and, as so, large amount of memory because paths must be

stored before being discarded. The storage requirements grow exponentially with constraint

length k. For a code rate of 1/n, a set of 2k-1

paths must be stored after each decoding step.

It has been demonstrated that a delay as much as five times the constraint length results in

negligible performance degradation.

For punctured codes, the decoding length (decoding delay or traceback length) must be

increased to compensate for the addition of dummy 0s [5][6]. High-rate codes, where more

bits are punctured, have low minimum distances between coded sequences and, therefore,

the survivor paths take longer to converge [7]. Simulations are usually done to determine

the decoding length.

Viterbi Decoder implementation

We implement the Viterbi decoder based on our previous implementation [7].

The functionality of a Viterbi decoder is usually implemented by three functional units: the

branch metric unit (BMU); the add-compare select unit (ACSU); and the survivor memory

unit (SMU). BMU calculates the distance (metric) between the received noisy symbol and

the output symbol of the state transition (branch). ACSU computes the accumulated metric

associated with the sequence of transitions (path) to reach a state. When more then a path

arrives to a state, ACSU selects the path with the lowest metric value, which is the survivor

path. SMU stores the information that permit to traceback from a state to the previous one.

Figure 9 – Viterbi decoder architecture

14

Funded by:

Figure 9 presents the classical architecture of a Viterbi decoder [5][10], where ACSU has a

parallel architecture. For high speed communications, throughput can only be achieved by

parallel or pipelined architectures [11]. In Figure 9, traceback processing is realized by the

decision unit (DU), using data stored in SMU.

BMU

The BMU computes the Euclidean distance between the received symbol and the output

symbol of a transition. Considering the base code rate of 1/3 (MB-OFDM), a symbol of tree

bits is compared. For each bit, we use four soft bits. Soft bits format used is presented of

Figure 10.

1001-7

1101-3

1111-1

1110-2

00011

00102

01004

00113

1011-5

1010-6

1100-4

01015

01106

01117

dynamic range

Figure 10 – Soft decision format

Figure 11 presents the BMU implementation, where A, B, C represent the noisy received

bits. The eight outputs represent the branch metrics of Table 2.

A

C

B

bm(101)bm(000) bm(001) bm(010) bm(011) bm(100) bm(111)bm(110)

n

n

n

n

n

n

n+2 n+2 n+2 n+2 n+2 n+2 n+2 n+2

Figure 11 – BMU implementation

ACSU

The MB-OFDM encoder, with constraint length seven, has a state diagram with 64 states

[9]. In order to compute in parallel the accumulated distances for each state, at each time

step, 64 ACSUs were implemented.

15

Funded by:

6

12

12

6

12

12

Comparator

MUX

1

bmSn

bmSm

MSnt-1

MSt

dSt1

12

MSmt-1

Figure 12 – ACSU

As depicted on Figure 12, the trellis diagram for 1/3 MB-OFDM, two path metrics are

computed for each state. Therefore, each ACSU has two inputs from the BMU: the

accumulated metric for each path. One decision bit, identifying the previous state to allow

traceback. Each ACSU has two outputs: the accumulated metric for each path and one

decision bit, identifying the previous state to allow traceback. For example, the ACSU00,

which calculates the accumulated metric for the 000000 state, has two inputs from the

BMU: bm(000) and bm(111). In fact, the output of the state transition from 000000 state to

000000 is 000, and from 000001 state to 000000 is 111.

SMU

SMU is a memory, storing the decision bit (ds on figure 13) to identify, for each state and

for each time step, the previous state. We used for decoding length, seven times the

constraint length. Therefore, the SMU memory has 49×64 bits: the algorithm runs for 49

time steps and 64 states exist.

d000

d001

...d063

d100

d101

...d163

… d4800

d4801

...d4863

Figure 13 – SMU

For a good performance, as 64 ACSUs exist, 64 bits will be available at each time unit.

Therefore, the memory is implemented with 49 elements of 64 bits (Figure 13).

DU

The decision unit detects the state with the lowest metric and identifies the path to reach it.

To identify the state with the lowest metric, the final path metrics are compared to each

other, until the state with the lowest metric is found. Instead of using dedicated

comparators, our implementation reutilizes the comparators of the ACS unit (see figure 14).

16

Funded by:

6

12

12

6

12

12

Comparator

MUX

1

0

0

MSnt-1

MSt

dSt1

12

MSmt-1

Figure 14 – Comparison using the ACS unit

The 64 values are compared using the comparators of the first 32 ACS units. For a valid

comparison, the bm inputs are set to 0. The process repeats iteratively for 6 cycles (64 = 26)

until the best metric is found. At each step, and according to the comparison result, the state

with the lowest metric is partially identified. At the final step, the state is completely

identified.

This approach reduces the resources needed to compute the Viterbi algorithm. However,

since the resources are shared, the analysis of the next bits is delayed by 6 cycles. The

traceback block (see figure 15) starts from the state with the best metric and determines

serially the values of the bits (b0, b1, b2, b3, …, btbl-1, where tbl is the traceback length).

The circuit determines the backward path based on the decision bits found during the

calculation of the paths. From a state and a decision bit the DU block finds the previous

state. The process repeats iteratively until the first decision bit is found.

MUX

DECISOR

E d

6

6

1 1 1 1 1 1

1 1 1 1 1 1

6

6

1

R0 Rtbl-1

Register

b0 b1 b2 b3 Btbl-1

R1 R2 R3

Figure 15 – DU traceback

17

Funded by:

Results

The Viterbi decoder was described in VHDL and placed and routed in a Virtex-5 FPGA

using ISE 10.1. Different designs with traceback lengths from 35 to 70 with 3 and 4 soft

bits were implemented (see results in table II).

Table 3 – Results for the Viterbi decoder

Soft bits Traceback length LUT/FF pairs BRAM Freq

(MHz) Mbps

35 2628 242 207

42 2628 242 212

49 2628 242 216

56 2903 241 219

3

63 2903 241 221

35 2935 242 207

42 2935 242 212

49 3173 240 214

56 3173 240 217

63 3173 240 219

4

70 3173

2

240 221

As expected, the implementation with 4 soft bits consumes more resources than that with 3

soft bits. For example, with a traceback length of 49 the implementation with 4 soft bits

consumes around 20% more resources. This percentage reduces to 10% for higher

traceback lengths.

All implementations achieve more than 200 Mbps. The throughput is lower than the

operating frequency since our implementation of the Viterbi decoder uses the ACSUs to

implement the decision unit. Since the DU unit takes 6 cycles to execute, the throughput,

Th, is given by equation (15)

Th = Freq × 1/ ( 1 + 6/traceback_length ) (15)

Based on the implementation results and taking into account the analysis of BER (Matlab

simulations), we conclude that with 3 soft bits the most efficient solution is the one with

trace 49. It achieves almost the same BER and throughput with 10% less resources.

However, with 4 soft bits, the most efficient solution is the one with trace 70. Using 4 soft

bits instead of 3 achieves 22% improvement in the BER at the cost of an additional 21%

resource utilization.

Conclusions

Many Viterbi implementations have been proposed for reconfigurable computing using

FPGAs (see, for example [12][13]). Recently, only a few works have been proposed as a

result of some specificity associated with the target application.

For example, [1] presents a configurable 3-bit soft decision Viterbi decoder implementation

that meets the requirements for WLAN and broadband applications. The programmable

18

Funded by:

design supports a constraint length K=7 soft decision Viterbi decoder (SDVD) realization

with a code rate (R) of 1/2 and traceback lengths (TBL) of 35 and 50 symbols. The

architecture works with a throughput of 155 Mbps in a XC2VP100-1704ff–5 FPGA device.

Our proposal achieves higher throughputs and uses fewer resources since we are using the

comparators from the ACSU blocks to compare the final cost values with a small penalty

over the performance.

19

Funded by:

References [1] B. Sklar, Digital Communications- Fundamentals and Applications, Second Edition,

Prentice Hall, 2001.

[2] Qualcomm Application Note AN1650-2, “Setting Soft-Decision Thresholds for

Viterbi Decoder Code Words from PSK Modems”

[3] Heller, J. Jacobs, I., “Viterbi Decoding for Satellite and Space Communication”,

IEEE Transactions on Communication Technology, Volume: 19, Issue: 5, Part 1,

pp: 835-848, October 1971

[4] H. Lou, "Implementing the Viterbi Algorithm: Fundamentals and Real-Time Issues

for Processors Designers", IEEE Signal Processing Magazine, Vol. 12, No. 5, pp. 42-

52, 1995.

[5] S. Singhal and M. Gilani, Crafting a Custom Viterbi Decoder for WLAN Designs,

Jan 2002, http://www.commsdesign.com/showArticle.jhtml?articleID=16504015

[6] C. L. Taylor, Punctured Convolutional Coding Scheme for Multi-Carrier Multi-

Antenna Wireless Systems, EECS Department University of California, Berkeley

Technical Report No. UCB/ERL M01/27, 2001

[7] Robert H. Morelos-Zaragoza, Art of Error Correcting Coding, second edition, John

Wiley & Sons, 2006

[8] Rui Borges, Horácio Neto and Helena Sarmento, “Implementing a Viterbi decoder in

a FPGA for a UWB MB-OFDM receiver” , XXII Conference on Design of Circuits

and Integrated Systems, November 2007

[9] ECMA, "Standard ECMA-368: High Rate Ultra Wideband PHY and MAC

Standard", December 2007

[10] Chang Y.-N., Suzuki H., and Parhi K., “A 2-mb/s 256-state 10-mw rate-1/3 Viterbi

decoder”, IEEE Journal of Solid-State Circuits, vol. 35, no. 6, pp. 826-834, June

2000.

[11] I. Bogdan, M. Munteanu, P. A. Ivey, N. L. Seed, N. Powell, “Power Reduction

Techniques for a Viterbi Decoder Implementation”,

www.mitzanu.ro/resume/pdf/espld00.pdf.

[12] J. Cavallaro and M. Vaya. Viturbo: a reconfigurable architecture for viterbi and turbo

decoding. ICASSP ’03, 2:II– 497–500 vol.2, April 2003.

[13] K. Chadha and J. Cavallaro. A Reconfigurable Viterbi Decoder Architecture.

Conference Record of the Thirty-Fifth Asilomar Conference on Signals, Systems and

Computers, 2001, 1:66–71, 2001.

[14] Abdul-Rafeeq Abdul-Shakoor and Valek Szwarc, “A High Performance Soft

Decision Viterbi Decoder for Wlan and Broadband Applications”, IEEE

CCECE/CCGEI, Ottawa, May 2006.

viterbi-implementation

Documents