high-performance, low-cost joint equalizer and trellis decoder for 1000base-t gigabit ethernet...

THE JOURNAL OF CHINA UNIVERSITIES OF POSTS AND TELECOMMUNICATIONS Volume 14, Issue 2, June 2007

ZHU Yue, RONG Meng-tian

High-performance, low-eost joint equalizer and trellis

decoder for 1000BASE-T gigabit Ethernet transceiver CLC number TN47 Document A

Abstract This article presents an M-algorithm (MA) decoder with 4 survival paths (MA4) for Institute of Electrical and Electronics Engineers (IEEE) 802.3ab 1000BASE-T gigabit Ethernet (GbE) transceiver. To fulfill the entire requirements, various methods were introduced to accelerate the MA4 decoder while retaining the desired high performance and low complexity. Optimized look-ahead architecture was employed to solve the critical path problem with minimal gate consumption. Symbol compression methods saved registers during pipeline stages. A sorting network accelerated the kernel sorting operation at low hardware cost by utilizing the special characteristics of MA4. Simulations and synthesis results show that the proposed decoder achieves 125 MHz clock frequency and 1 Gb/s throughput in 1.8 V 0.18 pm standard cell complementary metal-oxide-semiconductor (CMOS) process. It achieves additional 0.4 dB coding gain over 14tap parallel decision feedback decoder (PDFD) with 39% area reduction.

Keywords IOOOBASE-T, GbE, M-algorithm, PDFD, look-&& technology

1 lntroductlon

The GbE on category-5 Unshielded Twins Pair (UTP), which corresponds to the IEEE 802.3ab IOOOBASE-T standard, transmits 125 MBaud 4 dimension 5 level pulse amplitude modulation (4D-PAM5) signal over 4 pairs of UTPs to achieve lGb/s data transmission rate, and trellis-coded modulation (TCM) coding is employed to offer additional coding gain of 6 dB. A maximum likelihood sequence estimation (MLSE) decoder achieves theoretic coding gain, but it is too complex to be implemented. To reduce the complexity of MLSE to an acceptable degree without serious performance drawback, decision feedback sequence estimation, or its special case, parallel decision feedback decoding is employed [1-6]. A PDFD searches the most likely sequence in a reduced-state

Received date 2006-09-12 LHU Yue ( 2 2 ) , RONG Meng-tian Department of Electronic Engineenng, School of Electronic Informdtion and Electncal Engineenng, Shanghai Jiao Tong Unrversity, Shanghm 200240, Chlnd E-mall zhuyue@gmd com

Article ID 1005-8885 (2007) 02-0106-06

trellis. However, it is not the exclusive choice. An MA [7] decoder retains only M paths with the best metrics to reduce the complexity of the decoder. Therefore, it can work on both the nonreduced trellis and the reduced trellis. Each of these M paths is extended and sorted to find the best M paths, which will be stored in every symbol interleave. This approach creates 4M path extensions in every clock cycle of a GbE application. Similar to parallel decision feedback decoding, MA is a breadth-first search algorithm over code trellis. The decoder latency of MA decoder is fixed in sequence decoding because of that characteristic. An MA4 decoder provides higher performance than that of a PDFD, and retains less complexity. However, several operations are required in the critical path of MA4 than that of PDPD, which is the fatal shortage of MA decoder for high throughput applications [6].

To realize an MA decoder for 1000BASE-T GbE transceiver with 0.18 pm standard cell CMOS process, several acceleration methods, including various types of look-ahead (LA) technologies, were employed to shorten the critical path of MA4 decoder. The proposed MA4 decoder has similar operation frequency as that of LA-PDFD in the same 1C processing without any performance scaling down.

Furthermore, similar to PDFD, an MA decoder can work with a decision feedback equalizer based prefilter (DFP) 12-4, 6, 8,9]. Special idle mode is taken into consideration during the design of DFP to turn the decoder into low-power mode between GbE frames. Although these designs are considered during the design of this decoder, we concentrate more on high performance in this article, and those optimizations will be analyzed in another section.

The remainder of this article is organized as follows: in Sect. 2, MA is described. Architectural details of the MA decoder for IOOOBASE-T GbE is proposed in Sect. 3. Experimental resulls are presented in Sect. 4, and conclusion is provided in Sect. 5.

2 Algortthm dercrlptlon

The input of MA is whitened signals z , from feed-forward

equalizer (FFE) at the moment k :

No. 2 ZHU Yue, et al.: High-performance, low-cost joint equalizer and trellis decoder for 1000BASE-T ... 107

where uk is the 4-D-symbol being transmitted at moment k,

( A , , ] are the post cursor channel coefficients, which is related

to the ,jth twisted pair. In this article, the subscript j means the jth dimension of a 4-D-signal , j~ (0, 1, 2, 3) .

First, output symbol Lit,”(0) is generated by tracing back D stages on the survival path with least path metric (PM) pk (0) , and survival sequence of current state will be tested. Not more than M - 1 paths may be deleted in the phase and the rest Mc paths are retained. Ci,,,(p)is the Ith 4-D-symbol of survival

path P,(p) , 0 d p < M. In this article, p denotes the signal that is related to the path with pth smallest PM.

We tested the ambiguity check operation of the paths used in the MA decoder [91 and found that the requirement of output consistency is not necessary in this design. On the other hand, the requirement of output consistency causes a little performance drawback in each (SNR) we have simulated. Thus, a duplicated path test operation is adopted in our design although ambiguity check is applicable.

Second, path extensions are generated from the current survival paths P,(p), and related branch metrics (BMs) are

calculated. Though MA can work on nonreduced trellis, the BMU is very complex in such a case. Thus, an MA decoder of 1000BASE-T GbE works on the reduced trellis, which is similar to the work pattern of PDFD. Signals z ~ , ~ are mapped to the nearest 4--D-symbol uk,,,, , ( ~ ~ , e , + ~ ) that belongs

to the subset sk+l(pt , e , , , ) after post-cursor IS1 cancellation,

and the related branch metrics are generated. st+,(pk,ek+,) is the subset of the e ,+l (pk) th branch extending ofpathp, :

where It,,@) is the output of DFU and 1.

‘k.>(P) = -xL,,‘k,k->,,(p) (3) ,=I

Third, branch metrics are added with the corresponding PM to generate 4M, PM, 1 < M , < M :

r,‘+l ( P ) = rk (Pk + 4 (pk 3 ek +I ) 1 P = 4pk + ‘ k +I (4) A sorting operation determines the extensions of M survival

path, which have the smallest PMs: T‘k , l ( i ) < f k + , ( j ) ; 0 < i < j d 4 M C - 1 (5) where, r k + l ( P ) = ~ ~ + , ( 4 d , , , ( p ) + e , + , ( ~ ) ) and d,+,(p)=p, is the path exchange information.

( r, i , ( p ) I 0 d p < M } is stored for next symbol interleaving.

Finally, stored survival sequences are updated according to the path selection information d,+,(p) and branch ID e,+](p) :

‘k+l (PA +I) = (‘. ( d k + l @I.+i)); ‘k,r. j ( d k + l -‘k+1 )) (6) A 4-I)-symbol is decoded when the process mentioned above

is cycled once. Thus, all these operations should be finished in

8 ns, and this raises a significant challenge for circuit design.

3 Systemdeslgn

Despite the similarity between MA decoder and PDFD, it is impossible to employ the architecture of DFPD in MA decoder design because of the complex operation in add-test-sort unit (ATSoU). 4-D branch metric units (BMUs), together with 1-D-BMUs and decision feedback units (DFUs), have to be ripped out of the critical path using look-ahead (LA) technology, which is much more complex than that of PDFD decoder. The modified decoder tree is shown in Fig.1. The computing of 4-D-branch metrics (BMs) is one symbol interleave ahead of sorting operation. Thus, 16M BMs, instead of 4M ones are computed. Only part of BMs’ precomputing is shown in the figure for clear appearance. The LA logics remove unnecessary precomputing results before they enter ATSoU and the sorting operation is applied only to 4M paths. However, an ATSoU with M > 4 is still too complex to be realized in 0.18 pm CMOS process.

k; 1 f k; 1

Fig. 1 The decoder tree of look-ahead MA4 decoder

Generally, there are three approaches to apply look-ahead technology in the decoder of GbE application, symbol-look- ahead (SLA), path-look-ahead (PLA), and subset-look-ahead (SuLA). Precomputing bases on the 5 possible values of transmitted 1-D-symbol in an SLA decoder, whereas it bases on the 4 possible path extensions of each path in a PLA design. SuLA is based on the fact that the output symbols of 4-D-BMUs are generated from the two types of slices outputting from

PLA is applicable to the survival paths, whereas SLA and SuLA are not applicable. Thus, it works well in all parts of MA/PDFD except in ATSoUfadd-compare-select unit (ACSU). Both SLA and SuLA are applicable only to few taps of survival symbols. Generally, each pipeline stages of the decoder must be employed with PLA logics though there are exceptions in some speck1 cases when L is very small [3].

We generate the system architecture with optimized LA

1-D-BMUs.

108 The Journal of CHUPT 2007

architecture (Fig.2). Retiming operation is strictly taken to ensure that there is no operation after path selection signal dt+, , e,,, . The details will be discussed later in this

section.

3.2 LA-DFU

The architecture of LA-DFU is shown in Fig. 4, where PLA and SULA are applied to accelerate computing. DFU estimates the tail of (ISI) from tap 2 to tapl4. PLA is applied to the entire DFU and SuLA is applied to the tap 2 (shown in Fig. 4). SLA is also applicable. However, it is much more complex than the subset-look-ahead applied here. Carry save adder (CSA) is employed to speed up the add operations in DFU. The carry-in ports of CSA tree, which are often omitted, are enabled to simplify the multipliers employed in DFUs (Fig. 5). Generally, an M-inputs/2-outputs CSA tree provides M-2 one bit carry-in port without additional stage or global carry chain, no matter the CSA is formed by 3-2 compressor or 4-2 counter or both. The remaining 2 carry signals can be added to the final partial sums or simply discarded, which actually causes no drawback

------ Compressed 1 -D symbols .......... . ...... I -D/4-D subset in performance. 1-D/4-D BM Couple of I-D symbols - - - - Path selections

Fig. 2 Architecture of MA4 decoder for 1000BASE-T GbE

transceiver

3.1 Survlval memory unlt (SMU)

SMU of MA4 stores only four paths, whereas SMU of PDFD stores eight. Thus, the SMU of MA4 is much simpler. The SMU of MA arranges survival paths according to its path metric, whereas that of PDFD sorts survival path by their decoder states. So the exchange architecture of MA is not in accordance with the decoder trellis, which is different from PDFD. Hybrid SMU is also possible when a prefilter is employed, but full register exchange architecture (REA) SMU, which is shown in Fig. 3, is the only choice in this contribution because low-latency survival sequences are required. Retiming operation is employed to decrease fan-out of highly timing restricted dk lines, which is efficient to speed up design and reduce power consumption.

4 I ' r o ~ m *

Fig. 3 Architecture of SMU with retiming operation

I I

4

Simulation indicates that the MA4 decoder affords almost the same performance but trace back depth D changes from 10 to 14. However, here, D is equal to 14 in this contribution to get highest performance.

- from LA logic 1-D subset from ATSOU -1

From other path

Fig. 4 One of PLA DFU, which precomputes the IS1 estimation

of tap2-14

42:Ol

Fig. 5 Cany save architecture of the multiplier, where glok cany

is cancelled. The cany will be added to the output in CSA tree

3.3 LA-1-D BMU

LA-1-D-BMU has to estimate the tap1 of IS1 tail before the computing of 1-D-BMs and 1-D-symbols. LA technology is employed here to improve the operation speed and to decrease the hardware complexity to fully avail of the duplicated part of precomputed paths. Three applicable look-ahead structures are listed in Table 1 , and a SULA structure is chosen in our design (shown in Fig. 6).

No. 2 ZHU Yue, et al.: High-performance, low-cost joint equalizer and trellis decoder for 1000BASE-T.. . 109

Table 1 Comparison of applicable LA in 1-D-BMUs LA technology No Symbol Path Subset

required/ pair 1-D-BMUs 4 ~ 4 x 4 ~ 6 4 4 ~ 4 x 5 4 0 4 ~ 4 ~ 4 x k 2 5 6 4 ~ 4 x 2 ~ 3 2

Couple of 1 -D symbols 1-D subset from

~ ~ , - D M B & I-D symbol ~ g+ 1-D-BMU

( A W I - - - -

F 1 -D-BMU r)

(A&B) L - - --?

From LA-DFU

2 : 1 MUX

To 4 LA- 4-D-BMUs

Fig. 6 One path of LA 1-D-BMU, where precomputing of tap1

is subset-LA

A couple of A/B I-D-BMUs treated as one unit and the two 3 bit output symbols of 1-D-subset A/B are compressed to 2 bit in total. That saves 213 pipeline registers of 1-D-symbols, and at the same time, few additional encode/decode logics are required.

Further simulations are made to assure the quantized resolution of I-D-BMs. The result is shown in Fig. 7. 4 bit is the optimized value of 1-D-BMs quantized resolution.

18 r P

1 0.1 0.01 0.001 0.000 1 96 bytes block error rate

Fig. 7 Performance of MA4 vs the quantized resolution of ID-BMs

3.4 LA4D-BMU

4-D-BMUs of PDFD work on fixed states but those of MA decoder do not, so additional subset selection logic is added to choose correct subsets of output symbols. This is achieved by simply switching the A/B type inputs of channel 4 according to the least significant bit (LSB) of the current path state. Even subset groups, which are shown in Fig. 8, are selected when swapping unit bypasses its input. On the other hand, the swap unit

swaps A/B type inputs of channel 4 if an odd group is selected, and all of the 4 outputs turn to their corresponding odd groups.

4 groups of 4 groups of

I-D BMs 4-D BMs

I-D-BMU Sk5) to LA logic B

State From ATSoU 2 Fig. 8 Structure of 4-D-BMU

A part of the sorting operations was tried to move from ATSoU to balance operations in stages of pipeline, but that was proved to be harmful. The 4-D-BMU works in PLA mode and precomputes the 64 possible path extensions in advance, which is one symbol interleave ahead of ATSoU. The exchange logic next to 4-D-BMUs selects 16 proper path extensions out of 64 ones by Q and ek.

To save registers and MUX logics, symbol compression operation is employed. The 4-D-BMUs of MA4 do not directly output 4-D-symbols, whereas those of PDFD do. On the contrary, they output the 4-D-subsets selection of A/B types 4-D-symbol to save pipeline registers. Instead of 48 registers, only 13 are required to store 4 4-D-symbols output of a 4-D-BMU. That saves 560 registers, 73% of registers with are employed to store 4-D-symbols, in pipeline.

3.5 ATSoU

The ATSoU is the kernel module of MA decoders. 16 path extensions have to be tested for duplication and sorted to get 4 paths, which have least metrics. The core operation of ATSoU is the sorting processing, which can be implemented with a sorting network [lo]. However, a highly paralleled compare operation should be employed to the sorting network because of the critical timing restriction, and besides, only four smallest results are required. Therefore, the 16-input 4-output sorting network is formed with 2 4-output radix-8 sorting unit (SOU) and an 8 : 4 merging unit (MU), which draws a balance between critical path and chip area. A 4-output radix-8 sorting operation find out the smallest four PMs out of 8 inputs just as a round robin does. 28 comparators are required for each. The

110 The Journal of CHUPT 2007

8 : 4 merging operation merges two-sorted input group, and only 10 comparators are required.

4 X 4 I -D-subset I ToDFIJ 4 m 1 Select

MUXex & symbol decoder

logic 4-D-Symbols from LA logic

State To 4-D-BMU Path state tag

4 X 4-D-symbols * TO

i I I

PMs renorinalize 'lest

Add I

16 4-D-BMs from LA logic

Fig. 9 Architecture of ATSoU

ATSoU of MA4 stores four path metrics, whereas ACSU of PDFD stores eight. However, path metric overflow problem exists in ATSoU too. Because of the limited difference between PMs, the renormalization approach proposed by Ref. [ I l l is applicable in MA decoders too, Nevertheless, it is not the best one this time. The classic renormalization method, which subtracts the smallest path metric out of other PMs, exhibits better behavior here. This approach works well in MA4 decoder because no more logics are required to find the smallest PM. Furthermore, let 1 be the upper limit of 1-D branch metric and the difference between 4 survival PMs of MA4 can never be larger than 4. It must be noted that such a characteristic do not exist when M > 4. In this way, the width of stored PMs is the same with 4-D-BMs, and renormalized PM of path 0 will be always 0 if the renormalization method we recommend is employed. Considering that PMs of 4 path-extensions of path 0 can never be larger than 4, each path-extensions that have PMs larger than 4 can be directly chucked . Thus, 2 bit of PM comparator are saved compared with mod-based renormalization approach.

The four branches from one path are never equal to each other, so the duplicated paths can only be developed gradually from different paths stored in SMU. We employ 6 comparators to monitor same symbols, which will be stored in SMU. Additional counters and logics trace path extensions and record the same path length. Path is marked and discarded if its counter is larger than D, which means that the stored part of this path is exactly similar to another path, which has a smaller PM. On the other hand, the ambiguity check requires only 3

comparators, so it is a good choice if low-complexity characteristic is more important.

We finish 14tap MA4 decoder coding with RTL. Verilog, where duplicate path test is employed, and trace back depth D = 14 and width of 1-D-BM is equal to 4. A 14tap path-look-ahead PDFD with optimized CSA DFU and retiming SMU is programmed as a reference design, and here also trace back depth is D = 14. The gate count of reference design is 69% of the result reported by Haratsch in Ref. [I], and this is because of its optimized look-ahead architecture employed in DFUs, SMU, and 1-D-BMUs. The proposed MA4 decoder has the same decoder latency as the reference PDFD.

Both decoders were simulated with 9.6 x lo7 symbols for each input SNR with worst-case channel environment. The results are shown in Fig. 10. The 14tap MA4 decoder achieves about 0.4 dB additional coding gain over 14tap PDFD decoder in the worst channel environment of 1000BASE-T GbE, which is in accordance with the conclusion of Haratsch E. [6].

-8- 14tap MA4

I 0 1 001 0001 00001 96 byte block error rate

Fig. 10 Block error rate vs input SNR of PDFD and MA4 decoder

The main structure data of the two decoders are listed in Table 2.

Table 2 Structural data of MA4 and PDFD Algorithm MA4 PDFD

Count of path 4 8

1-D-BMU/pair 32 128

Count of DFU 4 n

LA technology SuLA+PI,A PLA

Count of 4-D-BMU 16 8 Count of PM comparator 66 (6 bit) 48 (8 bit)

Both these designs are. compiled with UMC 0.18 Fm standard cell CMOS process. The results are shown in Table 3.

Table 3 Synthesis results of 14tap MA4 and

14tap PLA PDFD decoder

MA4 PDFD Count of gates 120 380 209 464 Critical pawns 7.78 7.78

No. 2 ZHU Yue, et al.: High-performance, low-cost joint equalizer and trellis decoder for 1000BASE-T.. . 111

MA4 decoder is 39% smaller than the PDFD reference design, but they have similar critical path latency.

I Condurrlonr

We have successfully designed a high performance joint equalizer and trellis decoder with LA MA4 in 0.18 pn standard cell CMOS process. Simulations and synthesis results show that approximately 0.4 dB in coding gain is achieved, but the cost of the chip area is scaled down by 39%, which is the best performance of the decoders for 1000BASE-T transciever that has been ever reported.

Acknowledgements This work is supported by the National Science Foundation for Creative Research Groups (60521002) and Shanghai Natural Science Foundation (037062022).

References

1. Haratsch E F, Azadet K. A pipelined 14-tap parallel decision-feedback decoder for 1000BASE-T gigabit Ethernet. Proceedings of 2001 International Symposium on VLSI Technology, Systems, and Applications, Apr 18-20, 2001, Hsinchu, China. Piscataway, NJ, USA: IEEE, 2001: 117-120 Karatsch E, Azadet K. A 1-Gb/s joint equalizer and trellis decoder for 1000BASE-T gigabit Ethernet. Solid-state Circuits,

2.

IEEE Jo~rnal o f , 2001,36(3): 374-384

3. Azadet K, Haratsch E. DSP implementation issues in 1000BASE-T gigabit Ethernet. Proceedings of 2001 International Symposium on VLSI Technology, Systems, and Applications, Apr 18-20, 2001, Hsinchu, China. Piscataway, NJ, USA: EEE, 2001: 109-112

4. Lin Hsiu-ping. Chen N, Lai Jyh-ting, et al. 1000BASE-T gigabit Ethernet baseband DSP 1C design. Proceedings of the 2004 International Symposium, on Circuits and Systems (ISCAS'04): Vol 4, May, 23-26, 2004, Vancouver, Canada. Piscataway, NJ, USA: IEEE, 2004: 401-404

5. He Run-sheng, Nazari N, Sutardja S. A DSP based receiver for 1000BASE-T PHY. Proceedings of IEEE International Solid-State Circuits Conference, Feb 5-7, 2001, San Francisco, CA, USA. Piscataway NJ, USA: IEEE, 2001: 308-309,458

6. Haratsch E. High-speed VLSI implementation of reduced complexity sequence estimation algorithms with application to gigabit Ethernet 1000BASE-T. Proceedings of International Symposium on VLSI Technology, Systems and Applications, Jun 8-10, 1999, Taipei, China. Piscataway, NJ, USA: LEEE, 1999: 171-174

7. Anderson J, Mohan S. Sequential coding algorithms, a survey and cost analysis. IEEE Transactions on Communications, 1984, 32(2): 169-176 Hatamian M, Agazzi 0, Creigh J, et al. Design considerations for gigabit Ethernet IOOOBASE-T twisted pair transceivers. Proceedings of IEEE Custom Integrated Circuits Conference, May 11-14, 1998, Santa Clara, CA, USA. Piscataway, NJ, USA:

8.

IEEE, 1998: 335-342 9. GU Xin-yu, HE Zhi-qiang, TIAN Bao-yu, et al. The algorithm

and its application of a multi-channel adaptive decision-feedback equalizer. The Journal of China Universities of Posts and Telecommunications, 2003, lO(4): 61-64

10.

11.

Olariu S, Pinotti M, Zheng Si-qing. How to sort N items using a sorting network of fixed I 0 size. IEEE Transactions on Parallel and Distributed Systems, 1999, 10( 5 ) : 487 4 9 9 Hekstra A. An alternative to metric rescaling in viterbi decoders. IEEE Transactions on Communications. 1989,37(11): 1220-1222

Biographies: ZHU Yue, Ph. D. Candidate in Electronic Engineering Department of Shanghai Jiao Tong University, interested in VLSI design.

RONG Meng-tian, got the master's degree in Electronic Engineering Department of Fudan University in 1983, professor, Ph. D. advisor of Electronic Engineering Department of Shanghai Jiao Tong University, research field communication and VLSI design.

high-performance, low-cost joint equalizer and trellis decoder for 1000base-t gigabit ethernet...

Documents