dsp algorithms and architectures for …libvolume3.xyz/electronics/btech/semester7/dsp...asdsp...

Linköping Studies in Science and TechnologyDissertation No. 705

Department of Electrical EngineeringLinköpings universitet, SE-581 83 Linköping, Sweden

Linköping 2001

DSP ALGORITHMS AND ARCHITECTURES FOR

TELECOMMUNICATION

Mikael Karlsson Rudberg

DSP Algorithms and Architectures forTelecommunication

Copyright © 2001 Mikael Karlsson Rudberg

Department of Electrical EngineeringLinköpings universitet,SE-581 83 Linköping

ISBN 91-7373-069-6 ISSN 0345-7524Printed in Sweden by UniTryck, Linköping, 2001

AbstractTechniques for providing users with high quality, high capacity digital transmis-sion links has been in the research focus the last years. Both academia and indus-try try to develop methods that can provide the consumers with high capacitytransmission links at low price. Utilizing the twisted-pair copper wires that existin almost every home for wideband data transmission is one of the most promis-ing technologies for providing wideband communication capacity to the con-sumer.

In this thesis we present algorithms and architectures suitable for the signal pro-cessing needed in the Asymmetrical Digital Subscriber Line (ADSL) and theVery High Speed Digital Subscriber Line (VDSL) standards. The FFT is one ofthe key blocks in both the ADSL and the VDSL standard. In this thesis wepresent an implementation of an FFT processor for these applications. The imple-mentation was made adopting a new design methodology suitable for program-mable signal processors that are optimized towards one or a few algorithms. Thedesign methodology is presented, and an improved version where a tool for con-verting a combined instruction and algorithm description to a dedicated, pro-grammable DSP processor is shown.

In many applications as for instance video streaming the required channel capac-ity far exceeds what is possible today. Data must in those applications be com-pressed using various techniques that reduces the required channel capacity downto a feasible level. In this thesis architectures for image and video decompressionis presented.

Keeping the cost of ADSL and VDSL equipment low puts requirements on usinglow cost technologies. One way, proposed in this thesis, is to accept errors in theA/D and D/A converters and correct these errors utilizing digital signal process-ing, and the properties from a known application. Methods for cancellation oferrors found in time interleaved A/D converters are proposed.

AcknowledgmentI would like to thank my supervisor Prof. Lars Wanhammar for his support andguidance. Gunnar Björklund at Microelectronics Research Center, EricssonMicroelectronics AB for the support that made it possible to finish my Ph.D. aspart of my work at Ericssson Microelectronics AB.

I also want to thank all people that I have worked with in the VIBRA researchproject, the project that has financed much of the research in this thesis. Thanksalso to Mikael Hjelm for valuable discussions on DMT algorithms and the syn-thesis tool.

I will also thank everyone at Electronics Systems, Linköping University and atMicroelectronics Research Center for valuable discussions and help with inspira-tion.

i

Table of Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1 Digital communication . . . . . . . . . . . . . . . . . . . . . . 31.1.1 Digital communication systems 41.1.2 Modulation 5

1.2 The JPEG and MPEG standards . . . . . . . . . . . . . . . 6

1.3 The DMT transmission technique . . . . . . . . . . . . . . 81.3.1 DMT modulation 91.3.2 Frequency allocation 111.3.3 The DMT symbol 121.3.4 The splitter 12

1.4 Scope of the thesis . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Digital Signal Processing Architectures . . . 17

2.1 DSP algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.1.1 Sample period bound 192.1.2 Mapping of algorithms to hardware 192.1.3 Power consumption 19

2.2 DSP architectures . . . . . . . . . . . . . . . . . . . . . . . . . 212.2.1 Fixed-function architectures 212.2.2 Programmable architectures 22

2.3 DSP architectures with programmability and high efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4 Design methodology for ASDSP . . . . . . . . . . . . . 242.4.1 Modelling of a JPEG DSP 262.4.2 Design and implementation of an FFT processor 27

2.5 ASDSP design methodology . . . . . . . . . . . . . . . . 322.5.1 Architecture synthesis from mC 33

3 Variable Length Decoding . . . . . . . . . . . . . . 35

ii

3.1 Variable length codes . . . . . . . . . . . . . . . . . . . . . . 35

3.2 The VLC decoding process . . . . . . . . . . . . . . . . . . 363.2.1 Tree based decoding 363.2.2 Symbol parallel decoding 37

3.3 VLC decoder with simplified length decoder . . . . 38

3.4 VLC decoder with pipelined length decoder . . . . 39

3.5 VLC decoder with symbol decoder partitioning . . 40

3.6 Length decoder implementation . . . . . . . . . . . . . . 40

3.7 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4 Data Converters in Communication Systems 43

4.1 Analog-to-digital conversion . . . . . . . . . . . . . . . . 43

4.2 ADC errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.3 Time-interleaved ADC . . . . . . . . . . . . . . . . . . . . . 484.3.1 Offset in TIADCs 484.3.2 Gain and sample timing mismatch 514.3.3 Gain and timing mismatch effects on SNDR 524.3.4 Gain and timing mismatch cancellation 54

4.4 Digital-to-analog conversion . . . . . . . . . . . . . . . . 554.4.1 Error sources 564.4.2 Scrambling 57

5 Author´s Contribution to Published Work 63

6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 65

iii

Paper 1

New Approaches to High Speed Huffman Decod-ing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 771 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

2 PREVIOUS WORK . . . . . . . . . . . . . . . . . . . . . . . . . . 80

3 TWO NEW FAST HUFFMAN DECODER STRUC-

TURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

3.1. The basic Huffman decoder 813.2. Huffman length decoder with relaxed evaluation time 823.3. Pipelined Huffman length decoder 833.4. Symbol decoder 84

4 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Paper 2

Implementation of a Fast MPEG-2 Compliant Huffman Decoder . . . . . . . . . . . . . . . . . . . . . 871 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

2 HUFFMAN DECODER . . . . . . . . . . . . . . . . . . . . . . 90

2.1. Handling of special markers 92

3 IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . . . . . . 92

3.1. Improvements of the length decoder 933.2. Symbol decoder 933.3. Synthesis 943.4. Symbol tables 94

iv

4 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

Paper 3

High Speed Pipelined Parallel Huffman Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 971 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

2 HUFFMAN DECODER MODELS . . . . . . . . . . . . . 100

3 PIPELINED PARALLEL HUFFMAN DECODING . . .

101

3.1. Reducing symbol decoder requirements 1023.2. Symbol decoder partitioning 103

4 DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

Paper 4

Design of a JPEG DSP using the Modular Digital Signal Processor Methodology . . . 1071 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

2 METHODOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . 110

2.1. Modelling with the MDSP methodology 111

v

3 HARDWARE PARTITIONING . . . . . . . . . . . . . . . . 111

3.1. Interface design 112

4 HARDWARE/SOFTWARE TRADE-OFFS . . . . . . 113

4.1. Huffman processor 1134.2. IDCT processor 114

5 CONCLUSIONS AND FURTHER WORK . . . . . . . 114

6 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

Paper 5

Design and Implementation of an FFT Processor for VDSL . . . . . . . . . . . . . . . . . . 1171 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

2 ALGORITHM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

3 DESIGN FLOW . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

4 DESIGN SPACE EXPLORATION . . . . . . . . . . . . . 122

5 ARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . . . . 123

5.1. IO 1245.2. Memory 1245.3. Datapath 124

6 IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . . . . . 125

6.1. Key data 125

7 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

8 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

vi

Paper 6

Application Driven DSP Hardware Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 1291 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

2 RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . 132

3 SYNTHESIS FRAMEWORK . . . . . . . . . . . . . . . . . 132

4 THE DSP SYNTHESIS TOOL . . . . . . . . . . . . . . . . . 133

4.1. Target architecture 1334.2. Synthesis library 1334.3. Synthesis 134

4.3.1. User control 135

5 EXAMPLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

6 FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

7 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

8 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

Paper 7

ADC Offset Identification and Correction in DMT Modems . . . . . . . . . . . . . . . . . . . . 1411 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

1.1. Mismatch between ADC channels 144

2 IDENTIFICATION OF OFFSET . . . . . . . . . . . . . . . 145

2.1. Communication system 145

vii

3 CORRECTION OF OFFSET IN DMT MODEMS . 146

3.1. DMT based communication system 1463.2. Correction of offset before connection 1483.3. Correction of offset during initialization 148

3.3.1. Activation 1483.3.2. Modem training 149

3.4. Correction of offset during transmission 149

4 SIMULATION RESULTS . . . . . . . . . . . . . . . . . . . . 150

5 HARDWARE ARCHITECTURE . . . . . . . . . . . . . . 151

6 ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . 151

7 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

8 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

Paper 8

Calibration of Mismatch Errors in Time Inter-leaved ADCs . . . . . . . . . . . . . . . . . . . . . . . . 1531 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

1.1. Error sources in a TIADC 1561.2. Gain Mismatch 1561.3. Timing Mismatch 1571.4. Methods to cancel gain and timing mismatch 158

2 THE DMT MODEM . . . . . . . . . . . . . . . . . . . . . . . . . 159

3 IDENTIFICATION OF ERRORS . . . . . . . . . . . . . . 160

3.1. Error identification 1613.2. Signal reconstruction 1613.3. Implementation aspects 162

4 SIMULATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

5 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

viii

6 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

Paper 9

Glitch Minimization and Dynamic Element Matching in D/A Converters . . . . . . . . . . . 1671 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

1.1. Reducing glitches 1701.2. Reducing influence from matching errors 1711.3. Scrambler 1731.4. Scrambler with unordered thermometer code 174

2 SIMULATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

3 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

4 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

Paper 10

Dynamic Element Matching in D/A Converters with Restricted Scrambling . . . . . . . . . . . . 1811 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

2 A DEM APPROACH . . . . . . . . . . . . . . . . . . . . . . . . 184

2.1. Code selection case A: Bit increase 1852.2. Code selection case B: Bit decrease 1852.3. DEM approach 185

3 REALIZATION OF A DEM ENCODER . . . . . . . . 186

3.1. Description of the operations 1873.2. Operations in case B 188

ix

4 A 4-BIT CONVERTER EXAMPLE . . . . . . . . . . . . 188

5 SIMULATION RESULTS . . . . . . . . . . . . . . . . . . . . 190

6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

7 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

1

Abbreviationsand Acronyms

ADC Analog-to-digital converterADSL Asymmetrical digital subscriber lineASDSP Application specific digital signal processor/processingCO Central officeCPE Customer premises equipmentDAC Digital-to-analog converterDCT Discrete cosine transformDEM Dynamic element matchingDMT Discrete multi toneDNL Differential nonlinearityDSL Digital subscriber lineDSP Digital signal processingEC Echo cancellingEXU Execution unitGSM Global system for mobile communicationsIDCT Inverse discrete cosine transformIFFT Inverse fast fourier transformINL Integral nonlinearityFDM Frequency division multiplexFEQ Frequency domain equalizerFFT Fast fourier transformFIFO First in first outFIR filter Finite impulse response filterHDSL High speed digital subscriber lineJPEG Joint pictures experts groupMDSP Modular digital signal processor

2

MPEG Motion pictures experts groupOFDM Orthogonal frequency domain multiplexingPAR Peak to average ratioPOTS Plain old telephone systemSFDR Spurious free dynamic rangeSFG Signal flow graphSHDSL Symmetric high bit-rate digital subscriber lineSNR Signal to noise ratioSNDR Signal to noise and distortion ratioTDM Time division multiplexingTEQ Time domain equalizerTIADC Time Interleaved analog-to-digital converterVDSL Very high speed digital subscriber lineVLC Variable length codeQAM Quadrature amplitude modulation

1 Introduction

3

1 IntroductionThis thesis consists of two parts where part one provides a background to theapplications of interest and problems relevant for this thesis. Part two consists ofa selection of publications. The research have been carried out in the period 1995to 2001. The publications consider hardware implementation of signal processingin telecommunication systems, ranging from coding of images to transmissionvia wideband digital subscriber lines.

1.1 Digital communicationDigital communication is today used in a range of products from mobile phonesto computer networks. The use of digital communication provides an increasedperformance compared to previously used analog communication methods. Oneimportant factor for the success of both the Internet and the mobile phones arethe advances within the process technology. New process generations makes itcheaper to implement advanced digital signal processing, which is the enabler fordigital communication. Digital signal processing makes it possible to implementcommunication methods with more complex modulation schemes, adaptivereceivers and error correction. It is today possible to achieve transmission capac-ities close to the channel capacity theorem stated by Shannon [1]. The theoremdescribes the theoretical capacity limit on a communication channel disturbed byadditive white gaussian noise with power spectral density of , a channelbandwidth of , and average power level of . The capacity is then given by

(1.1)

N0 2⁄B P

C B log2 1P

N0B----------+

bits/s⋅=

1 Introduction

4

Since the channel capacity is limited there is a need for techniques that canreduce the required channel capacity for a given service. Three important areaswhere compression techniques are widely used for better utilization of the chan-nel capacity is transmission of speech, image and video. In a mobile phone sys-tem voice data is compressed from 64 kbit/s down to 11.4 kbit/s (GSM, half-rate)keeping an acceptable quality of the speech [2].

For images and video transmission the JPEG [3] and MPEG [4] standards arewidely used. Interesting to note is that even if the available bandwith keepsincreasing compression of image and video signals will be crucial for many yearsto come. Transmitting standard resolution video with an acceptable qualityrequire 1.5-2.5 Mbit/s with compression. Transmitting uncompressed video is noteven an option today since this would require data rates above 50 Mbit/s.

1.1.1 Digital communication systemsA digital communication systems can be outlined as shown in Fig. 1.1 [5]. Thesignal is created in a digital source which for instance can be digital data gener-ated in a computer, digitized speech or digital video. The source encoder pro-vides a one-to one mapping from the input signal to a new representation suitablefor transmission. The objective is to eliminate or reduce redundancy, i.e. givingthe signal a more efficient representation. The source decoder re-creates theoriginal signal. The channel encoder and decoder is used for providing a reli-able transmission link by introducing a controlled redundancy that is used fordetection and correction of transmission errors. In the modulator the informationis modulated, which gives a signal suitable for transmission using the desired fre-quency band. The task of the detector in the receiver end is to detect which sig-nal that was transmitted from the transmitter. Sometimes the detector and thechannel decoder are collected into one block which in this thesis is referred to asthe decoder.

Figure 1.1 Digital communication system.

digitalsource

sourceencoder

channelencoder

user sourcedecoder

channeldecoder detector

channel

modulatordigitalsource

sourceencoder

channelencoder

user sourcedecoder

channeldecoder detector

channel

modulator

noiseTransmit path

Receive path

1 Introduction

5

1.1.2 ModulationModulation is the way information is mapped onto a signal. The transmittedinformation is divided into symbols where one symbol has a finite duration. Theinformation content is encoded into the shape of the waveform during the symbolperiod. Common ways to encode the information is to put it into the amplitudeand/or phase of the waveform. We will in this thesis mainly consider the quadra-ture amplitude modulation (QAM) technique and its relatives.

In QAM the information is mapped onto a carrier, which often is a sinusoid,using different phases and amplitudes. The transmitted signal ( ) is a sinusoidwith four possible phases ( ). These phases are created by varying and inEq. 1.2 where the sin and cos terms are the basis functions. is a constantrelated to the transmitted energy [6].

(1.2)

This encoding scheme can be illustrated using a constellation diagram where ison one axis and is on the other one. In Fig. 1.2 an example is shown where 4-QAM is used. In this case only the phase of the signal is of interest. The four pos-sible points allow two binary bits to be transmitted per symbol. The encoding ofthe two bits are normally made so that the error probability is minimized, whichin this case will be to use a grey-encoding scheme with as small differencebetween the bits in adjacent constellation points as possible.

The detector will take decision of how to interpret the received symbol. Whenusing 4-QAM the decision is taken based on which one of the four possible con-stellation points that is located closest to the received symbol. The decisionboundaries are outlined as shaded areas in Fig. 1.2. The distance between the

Figure 1.2 QAM constellation diagram.

si t( )ϕi a b

E

si t( ) E a ωt( )cos⋅ b ωt( )sin⋅+( )⋅=

ab

(00) (01)

(11)(10)

a

b

E

1 Introduction

6

received constellation point and the ideal position is a measure of the noise levelin the channel. If the noise level is too high the detector may not be able to cor-rectly detect which symbol that originally was sent from the transmitter, andtherefore there will be a bit error. In order to reduce the probability for bit errorsit is common to introduce coding, where redundancy is added to the signal in acontrolled way so that some bit errors can be corrected.

In the general case we can allow more than four points in the constellationswhich here is referred to as M-ary QAM. A typical case with 16 possible pointsin the constellation is shown in Fig. 1.3. When more than four points are used ina QAM constellation both the amplitude and the phase is used as signal carrier.

1.2 The JPEG and MPEG standardsCompression of images and video are important for transmitting of high qualityimages and video over a transmission channel with limited channel capacity. TheJPEG image compression standard from 1994 is one of the most common stan-dards for transmission of images over the Internet [3]. The JPEG standard is ageneric standard suitable for continuos tone digital images. The compressionscheme implemented in the JPEG standard is a combination of algorithms thatcan exactly recreate the original information and algorithms where information isremoved from the images and hence cannot be exactly recreated. The compres-sion algorithms that remove information are referred to as lossy algorithms inthis thesis.

In Fig. 1.4 the key algorithms in a JPEG decoder are outlined. The lossy com-pression is made using the Discrete Cosine Transform (DCT) that converts theimage representation to the frequency domain. In the frequency domain data isquantized so that fewer bits are needed for representing high frequency informa-

Figure 1.3 16 QAM.

(0000)

a

b(0001)

(1000) (1001)

(0010)

(1010)(1011)

(0011)

(0101)(0100)

(1100) (1101)

(0110)

(1110)(1111)

(0111)

1 Introduction

7

tion. The quantization has been optimized against how sensitive humans are fordifferent frequencies in the images. The human eye is less sensitive for noise athigher frequencies than at lower.

Further data compression is achieved using Run-Length-Zero (RLZ) coding andvariable length coding. None of these methods remove information, it just finds amore efficient representation. In the RLZ coding long sequences with zeros isreplaced with the number of zeros in a row and the next nonzero value. Forexample the sequence {0,0,0,0,3} is replaced with {4,3}. In the variable lengthcoding the frequency of RLZ coded data determines the number of bits used forrepresenting a value. For instance we may assign a shorter representation for theRZ coded value {0,1} than for {4,1} which is less frequent. More about variablelength codes and decoding of variable length coded data are found in chapter 3.

A common format for compressed video is the MPEG-2 Video standard [4].There are many similarities between the JPEG and MPEG standards. Both stan-dards use DCT, RLZ and VLC for compression of images. The main differencebetween the standards is that the MPEG-2 Video standard not only compressesthe digital images one by one but also consider similarities between adjacentimages in the video stream. To accomplish this a motion estimation unit isneeded in the video encoder. The motion estimation search for similaritiesbetween images in the video sequence and is the most resource requiring algo-rithm in the MPEG encoder. Instead of transmitting the image data, only the dif-ference between images may be transmitted in those cases when this is moreefficient. While the MPEG encoding not has to be made in real-time the decoding

Figure 1.4 JPEG coding.

InputbufferDCT

imagedata

JPEG codedimages

VLDQuantize RLZ VLC

1 Introduction

8

task has to since the decoding is made while the video stream is watched. Real-time MPEG decoding is therefore more important than real-time encoding. Anoutline of an MPEG-2 decoder is shown in Fig. 1.5.

The first step in the decoder is to extract the control information which containinformation about what type of coding that has been used, image size, and so on.The VLC and RLZ decoders reverse the VLC and RLZ encoders operations. TheInverse Quantizer (IQ) multiplies the coefficients with the quantization coeffi-cients used in the quantizer which restores the signal levels at each frequency.The Inverse Discrete Cosine Transform (IDCT) transforms the image back fromthe frequency domain to the spatial domain. If only the difference between twoimages has been transmitted the image data is restored by adding the previouslytransmitted image to the received difference image. Finally the images may haveto be re-ordered since the encoder performs a picture re-ordering to better exploitsimilarities between adjacent images in the video stream.

1.3 The DMT transmission techniqueTwisted pair copper wires, which today mainly are used for telephony can also beused for transmission of data at quite high speeds. There are several competingstandards and techniques that can be used for increasing the data rates on thetwisted pair cables. The family of standards for this kind of communication isoften referred to as xDSL where DSL stands for Digital Subscriber Line. Some ofthe standards belonging to this family are ADSL, ADSL.lite, VDSL, HDSL andSHDSL.

The Asymmetrical DSL (ADSL) standard is the technique that today dominateshigh speed communications on twisted pair cables [7]. The ADSL technique han-dle the last few kilometers from the so called central office (CO) to the homes.The equipment in our homes is usually referred to as the customer premisesequipment, CPE, see Fig. 1.6. ADSL is suitable when the distance from the CO isless than around 5 km. The reach for VDSL is even shorter since higher, moreattenuated, frequencies are used.

Figure 1.5 MPEG-2 Video decoder.

PicturereorderIQ Motion

comp

VLDParserParserInputbufferInputbuffer RLZ

ParserIDCT

Inputstream

decoded video

VLCdecoder

1 Introduction

9

The data rates in ADSL are up to 9 Mb/s from CO to the CPE side, and up to 1Mb/s in the other direction. The reason to provide higher bit rates in the down-stream direction is that it is assumed that the need for high data rates is higher inthis direction.

The very high speed DSL (VDSL) standard will provide data rates up to 50 Mb/s.The standardization of VDSL has however been delayed much due to problemswith agreeing on which modulation method is most suitable. Much of the work inthis thesis has been based on a VDSL technique proposed by Ericsson whichbased on the Discrete Multi Tone modulation (DMT) scheme which also is usedin the ADSL standard [8].

1.3.1 DMT modulationIn the DMT technique the information is encoded on a large number of signalcarriers. The signal conditions on a twisted pair cable may vary on different fre-quencies and the independency between the carriers in the DMT technique pro-vide a possibility to optimize the amount of information to send on each carrier.

The multi-carrier signal is in DMT created by using the Inverse Discrete FourierTransform (IDFT) with the input where and basically are the sameas in Eq. 1.2. The effective encoding technique on each carrier will be M-aryQAM where M varies depending on channel capacity for each carrier. Thedecoding is done by first splitting the multi-carrier signal into its components byusing the Discrete Fourier Transform (DFT) on the received signal and thendecoding each carrier individually. In practice the IDFT and DFT are calculated

Figure 1.6 The ADSL scenario.

central office

CO ADSL

CO ADSL

CO ADSL

CO ADSL

CO ADSL

CO ADSL

cross connectingpoint

internet

ADSL

PC

TV

a j b⋅+ a b

1 Introduction

10

by using the numerical equivalent fast transforms IFFT and FFT. The constella-tion size on each carrier is dynamically adapted to a varying noise level by usinga “bit swapping” algorithm [9].

The main blocks in a DMT modem are outlined in Fig. 1.7. In addition to the out-lined block we also need blocks for clock recovery and symbol synchronizationas well as serial/parallel converters, etc. These blocks have been excluded to sim-plify the explanation of the basic idea behind DMT communication.

The information is put into frames and symbols in the block called framer.Redundancy information is added in the forward error correction block, FEC,which makes it possible to detect and correct some transmission errors. TheReed-Solomon decoder (RS-decoder) is used for correction of transmissionerrors. There are two transmission paths, one with an interleaver and one without.The Interleaver/Deinterleaver pair will spread transmission errors in time whichwill increase the error correction performance in the RS-decoder. Unfortunatelyalso the delay through the system is increased which causes problems forinstance in two-way communication. Therefore there is also a non-interleavedtransmission path that can be used for delay sensitive applications.

EC stands for echo cancelling which is needed if the data transmitted in upstreamand downstream direction share the same frequency space. In this case thereceived signal will contain some of the transmitted signal which must beremoved in order to not disturb the decoder.

TEQ is the time domain equalizer, and FEQ is the frequency domain equalizer.The task for the equalizers is to work as an “inverse filter” to the channel impulseresponse so that the original signal is restored, giving a signal as close to thetransmitted signal as possible. By using two equalizers the total complexity forimplementing the equalizers will be reduced compared with using only a TEQ.

Figure 1.7 Block diagram of a DMT modem.

DACIFFT

FFTdecoder TEQ ADCEC

analogfrontend

FFT

IFFT

decoder TEQ ADC

DAC

EC

analogfrontend

transmit path

receive path

lineencoderencoder

interleaver

deinterleaverRS-decoder

TEQFEQ

TEQFECframer

deframer

1 Introduction

11

The analog frontend contains analog filters and a line driver. Sometimes the digi-tal-to-analog (DAC) and the analog-to-digital (ADC) converters as well as digitalinterpolation and decimation filters are also counted as parts of the analog fron-tend.

1.3.2 Frequency allocationADSL uses 256 carriers in downstream and 32 carriers in upstream direction, seeFig. 1.8. The communication takes place in both directions at the same time andsince the upstream and downstream bands are overlapping it will require an echocanceller in order to cancel the signal sent in the other direction using the samefrequencies. It is, however, allowed in the standard to use non-overlapping fre-quency bands which will be cheaper since less complex hardware is needed (fre-quency domain multiplexing, FDM). Today the normal case is to not useoverlapping frequencies since the consumer market is very cost sensitive.

The 256 carriers downstream and the 32 carriers upstream covers the frequenciesfrom 0 Hz up to 1.104 MHz and 138 kHz, respectively. The number of carriersactually used are lower since it is not possible to use the frequencies occupied byother systems like the plain old telephone system (POTS).

Figure 1.8 ADSL frequency plan.

1.104 MHz138 kHz30 kHz

4 kHz

POTSband

upstreamband

downstream band

power level

1.104 MHz138 kHz30 kHz

4 kHz

POTSband

upstreamband

downstream band

power level (a) ADSL with overlapping frequencies

(b) ADSL without overlapping frequencies

1 Introduction

12

VDSL use a wider range of frequencies than ADSL. In this work we have aimedat frequencies up to around 11 MHz, which may be slightly changed when thestandard is set. From beginning a time-division multiplexing scheme was pro-posed where only transmission in one direction was taking place at the sametime. The current proposal does, however, propose that different frequencies areused instead (frequency domain multiplexing, FDM). The frequency plan has notbeen completely finalized yet but it seems clear that there will be several down-stream as well as upstream bands in the final standard.

1.3.3 The DMT symbolA DMT symbol consists of 2 samples that are mapped on up to carriersusing the IFFT. Additionally a cyclic prefix (CP) is added before the 2 sam-ples. Hence the total symbol length become 2 + samples. The CP is a copyof the last part of the 2 samples and is used for reducing the transients thatoccur between two symbols. The length of the CP is set in the standard and hasbeen selected by compromising between how long CP can be afforded, and thelength needed for the transients to fade away before the information arrives, seeFig. 1.9.

Additional to the user data there are extra fields in the symbol that contain infor-mation used for the two modems to exchange system parameters and other con-trol information.

1.3.4 The splitterThe ADSL modem is thought to be an add-on to the subscriber line, and it istherefore necessary to keep the POTS well separated from the new ADSL sys-tem. POTS and ADSL use different frequencies and they can therefore be sepa-rated using a low pass filter in the POTS receive path and a high pass filter in theADSL receive path, see Fig. 1.10.

Figure 1.9 Cyclic prefix avoids transients.

N NN

N CPN

DMT symbol

copy

start of transients no transients left

Cyclic prefix

1 Introduction

13

The reason to keep the POTS installation instead of running all communicationover the ADSL modem is that it has been considered very important to have aconnection that works even during a power failure. A POTS system get its powerfrom the twisted pair cable, but today an ADSL modem cannot be powered fromthe twisted pair, and therefore the POTS system is kept as a life line. More infor-mation about the DMT technique can be found in [10,11,12].

1.4 Scope of the thesisThe choice of algorithms in an application has a direct impact on the perfor-mance of a given application. The complexity of the algorithms has an impact onboth the power consumption and how it should be implemented.

The implementation of a digital signal processing (DSP) algorithm is a well stud-ied area [13,14], but with changing process technologies the focus is changingfrom that the area required for an implementation being the most importantparameter to reducing the power consumption, and to provide an efficient designflow. A clear trend is that DSP hardware need to provide an increased program-mability which makes it possible to reuse hardware for several applications andto have a more parallel design flow where the algorithms do not have to be stablebefore the hardware implementation can start.

The main target application for this thesis is the digital subscriber line (DSL)modem. In addition, work has also been performed in the area of image decodingwhere the JPEG and MPEG standards have been the target. Efficient DSP imple-mentations dedicated for FFT processing, variable length decoding and JPEGdecoding is presented. Application Specific Digital Signal Processing (ASDSP)with the ability to combine efficient processing with reprogrammability are dis-cussed.

Figure 1.10 The splitter.

POTS

ADSL

LP

HP

Twisted pair cable

Splitter

1 Introduction

14

Another area which have been studied in this work is how DSP algorithms can beused to improve performance in A/D and D/A converters. By identifying errorsand then trying to correct them or spectrally move distortion the data converterperformance can be increased.

In the publications [15,16,17] architectures for fast decoders for variable lengthcodes (VLCs) are proposed. Variable length codes are not used directly in digitalcommunication but they are often used in the data streams that are transmittedover the communication channel. In both digital audio and video VLCs are usedfor reducing the amount of data that must be transmitted. For example the MPEGand JPEG standards which are used to compress images and video sequences aretherefore important. Much of the work has also been reported in [18], but someadditional discussion is also made in Chapter 3.

The design process for efficiently designing Application Specific Processors wasstudied in the papers [19,20,21]. This work is a continuation of the work made byK.G. Andersson [22,23], but with improvements that includes better ability toreuse old designs, and an efficient way to synthesize the architectures. Wepresent two case studies where the first [19] is an ASDSP for decoding JPEGimages and the second [20] is an ASDSP for the Fast Fourier Transform (FFT). Asynthesis tool for making the design path more efficient is reported in [21]. Thedesign process is further discussed in Chapter 2.

The last four papers cover distortion reduction techniques in D/A and A/D con-verters. Signal processing algorithms have been developed that can be used toincrease the performance in the data converters. In [24] we propose a method thatcan cancel offset errors in a time interleaved A/D converter utilizing the receiverwhich in this case is a digital modem. This method has also been subject for apatent application [25]. A method to cancel gain and skew mismatch in an A/Dconverter is proposed in [26].

In [27,28,29] architectures that make it possible to trade between glitches andmismatch in the weights in a current-source D/A converter are proposed. Dataconverters are discussed in Chapter 4.

Related to this thesis and the publications [25-29] is the tutorial “A/D and D/AConverters for Telecom. Applications” that was held at ICECS´2001 [30]. In thistutorial we tried to relate distortion reduction methods to both each other and toapplications.

1 Introduction

15

Most of the work has been carried out within an industrial research project calledVIBRA at Ericsson Microelectronics AB. The aim with VIBRA was to developanalog and digital building blocks for DSL based systems. VIBRA have hadstrong connections to other research projects within Ericsson studying algorithmsand hardware for DSL systems. For secrecy reasons the complete picture of howthis work relates to work within other parts of Ericsson is not possible to presentin this thesis.

1 Introduction

16

2 Digital Signal Processing Architectures

17

2 Digital Signal ProcessingArchitectures

A digital signal processing (DSP) system typically consists of a set of algorithmswhich are implemented in a combination of hardware and software. Normallythere are different types of DSP algorithms that interacts with each other in theDSP system, see Fig. 2.1.

There are certain algorithms that must be executed continuously within a giventime frame. Data are continuously processed when arriving, and must be pro-cessed before the next data arrives, i.e. in real-time. Examples of algorithms thatfall into this category are filters, Fast Fourier Transforms (FFTs), and speechencoders.

There are also adaptive algorithms usually used for optimizing filters to thepresent transmission channel. The adaptive algorithms are normally executedduring initialization of the DSP system when for instance the receiving filters andthe echo cancellers are trained for optimal performance. After initialization, theadaptive algorithms are used to monitor changes in the communication channel,component drift due to temperature variations, etc. The adaptive algorithms canoften be executed at a slower rate than the data rate, especially when the algo-rithms are used for monitoring changes in the channel.

Further, there are also some control algorithms that will supervise the transmis-sion, detect carriers, and control the initialization, see Fig. 2.1. The control pro-cess handles the scheduling of different activities during the initialization stage,and start or stop of filter adaption.


18

A DSP algorithm operates on either a block of samples or computes a new outputvalue for every new input sample, i.e. it is stream based. Speech coders, imageand video coding algorithms usually work with blocks of samples, while digitalfilters typically process the data stream continuously.

2.1 DSP algorithmsA common way of representing a DSP algorithm is by using a Signal Flow Graph(SFG) [14]. An example of a SFG of a simple filter is given in Fig. 2.2. The boxmarked with is a delay element that delays data from one sample to the next.The SFG has no connection to how the algorithm is implemented, but may beused to derive a suitable architecture, calculate the amount of resources needed,and schedule the operations onto an architecture [31].

An equivalent way of representing the algorithm is by using differential equa-tions. The algorithm in Fig. 2.2 can for instance be described as

. (2.1)

Figure 2.1 DSP system tasks.

Figure 2.2 SFG of a simple filter.

Hard real-time processinginputoutput

Adaptive algorithms

Supervision and control

T

T

0.5

+-

x(n) y(n)

y n( ) x n( ) 12---y n 1–( )–=


19

2.1.1 Sample period boundThe data rate of an implementation of an algorithm is bounded by the recursiveloops in the algorithm. The minimum sample period of a recursive algorithm isgiven by

(2.2)

where is the total operation latency in the recursive loop , and is thenumber of delay elements found in the loop [32]. The critical loop, is the loopthat limits the sample rate. There are several ways of improving the sample rateby various algorithm transformations like for instance moving operations out ofcritical loops [14].

2.1.2 Mapping of algorithms to hardwareWhen designing an architecture aimed for implementation of a given algorithmthe following things must be considered.

The required data rate, and the feasible clock rate set a lower bound on the num-ber of resources needed. The minimum amount of execution units (EXU) of agiven type needed to execute an algorithm at a given data rate is given by

(2.3)

where is the time required for an operation of type and is the number ofoperations.

It is important to schedule the operations properly in order to reach a high degreeof utilization of the EXUs. The scheduling should also consider the dataflowbetween the blocks in the architecture. Reducing the interconnect will alsoreduce the parasitic load from the wires, and hence also the power consumption.

2.1.3 Power consumptionThere are four sources that contribute to the total power consumption in a digitalCMOS circuit [33].

(2.4)

Tminmax

TOPi

Ni------------

i=

TOPi i Ni

NEXUkNk Tk⋅Tmin

----------------=

Tk k Nk

Pavg Pswitching Psc Pleak Pstatic+ + +=

αCLfclk

VVDD IscVDD IleakVDD IstaticVDD+++=


20

where is the power that is consumed every time a signal nodechanges state. is the average switching activity for all nodes in the circuit, and

is the switched capacitance. The signal levels are assumed to be 0 and witha power supply of .

is the short circuit current that occurs when NMOS and PMOS transistorsare active simultaneously which may occur during switching, giving a short-cir-cuit current from to ground.

is due to the leakage current that arises from sub-threshold effects. The rel-ative contribution from is increasing because of the scaling of thresholdvoltages that is made in new process technologies. A reduction of the thresholdvoltages for the transistors increases the leakage current, [34]. FutureCMOS processes will enable an increased amount of on-chip memory which willgive a significant contribution to the total leakage current.

The static current in a purely digital circuit mainly origins from logicgates whose inputs have reduced swing. When using full swing static logic thestatic current will be low.

The power consumption in different functional parts of a DSP system can be par-titioned into three components,

. (2.5)

is the power consumed in the functional units, i.e. where the actual algo-rithm is executed. grows approximately linearly with the number of opera-tions. To decrease this part the computational complexity of the implementedfunction should be decreased. This can be done by choosing another algorithm ortrying to simplify the original one [33].

is the power consumed when storing internal signal values during theexecution of the algorithm. The amount of storage needed is mainly dependenton a) how many samples are needed to compute one output data for a given algo-rithm, and b) the architecture used for executing the algorithm. It is important toreduce data movement between different memory elements to decrease .One way of doing this is for instance to implement a First-in-first-out (FIFO)buffer using a memory and a memory pointer instead of using a shift register. Thepositioning of the storage elements is also important, local storage may be lessexpensive than global memories. Low computational complexity do not have toimply few load and store operations and and should therefore beco-optimized [33].

Pswitchingα

CL VVDD

Psc

VDD

PleakPleak

I leak

Istatic

Pavg Pcalc Pstore Pctrl+ +=

PcalcPcalc

Pstore

Pstore

Pcalc Pstore


21

is the power consumed in the control unit that among other things controlsthe dataflow between the storage elements and the functional units. The com-plexity of the controller is dependent on the datapath architecture, the schedulingof operations and the algorithm.

2.2 DSP architecturesThe implementation strategy for a DSP algorithm depends on the required datarate, acceptable power consumption, maximum chip area, and the complexity ofthe algorithm. But also parameters like available building blocks and the requiredflexibility of the final system. The two main choices of implementation strategiesare to use either a fixed-function architecture without the possibility to change itsfunctionality afterwards or a programmable architecture with larger flexibility.Sometimes there is an advantage to choose an architecture that offers a limitedprogrammability since this increases the possibility to make changes in the algo-rithms after the processing of the chip.

2.2.1 Fixed-function architecturesA fixed-function architecture can be obtained either by using isomorphic map-ping where the SFG of the algorithm is directly mapped to an architecture [14] orby using time-sharing of execution units. When an isomorphic mapping is usedthe number of EXUs equals the number of operations in the SFG. This kind ofarchitecture is best suitable for algorithms that contain few operations and requirea high data rate since the area will be large. The benefit with an isomorphmapped architecture is that the overhead can be kept low since little control ofthe dataflow is needed. Since the isomorphic mapped architecture requires a min-imal amount of control overhead and the clock rate often is low, which allows theuse of a low power supply voltage, the total power consumption can be low.

In the time-shared architecture the operations in the SFG are mapped onto one ora few EXUs. In order to control the dataflow between the EXUs and storage ele-ments (STU) there must be one or several control units (CU) that control opera-tions, dataflow and memory accesses. The control unit controls the dataflow byapplying signals that affect the way data is transported or computed in the archi-tecture, see Fig. 2.3.

Pctrl


22

It is also possible to mix the two strategies, isomorphic mapping and time-shar-ing by implementing efficient EXUs using isomorphic mapping and then time-share the EXUs. For example in the FFT algorithm the inner loop contain a but-terfly operation, which is often implemented using an isomorphic mapping andthen time-shared for the different butterflies in the FFT [35,36].

The time-shared architecture adds complexity to the interconnect, control units,and possibly to the execution units as well. This extra control overhead willincrease the power consumption and it is therefore essential to keep the overheadas low as possible if the total power consumption is an important design parame-ter.

2.2.2 Programmable architecturesIn a programmable architecture the control unit must be programmable which inthe simplest way is done by placing the sequence of control signals in a memory.The different operations that can be controlled by control signals will be referredto as instructions, and the instruction set is the set of instructions supported by agiven architecture.

To take benefit of the programmable architecture the instruction set must be mod-ified and extended compared to the simple time-shared architecture in order tointroduce some flexibility. The execution units may need to be more general, e.g.a multiplier that can use more than a few selected coefficients is more useful than

Figure 2.3 time-shared architecture.

EXU 1

EXU 2

EXU N

STU 1..M

CU 1..Koperationcontrol

dataflowcontrol

storagecontrol

statusflags


23

one that can multiply with one pre-defined coefficient only. The interconnectmay need extensions that remove restrictions on the dataflow. The control unitmay also need to support more advanced data flows, as for instance nested loopsand conditional jumps.

The addition of more flexibility in the datapath and programmable control unitswill increase the complexity of the architecture as well as the power consump-tion. If the programmable architecture will be used for a wide range of applica-tions the instruction set will become more extensive. As a consequence it is anadvantage from an efficiency point of view if the DSP architecture can be tar-geted towards a small range of algorithms since this will reduce the instructionset and therefore increase the efficiency.

A programmable DSP architecture has the advantage of being easier to reuse forseveral applications. One way of providing some flexibility without having to goall the way to a DSP processor is to have a set of user controlled parameters thataffects the algorithm in some predefined way, i.e. parametrization. The length ofan FFT, or the number of taps in a Finite Impulse Response (FIR) filter can forinstance be made as a parameter to the block. In this way it is also possible tomake architectures that can be used in many applications but still can be synthe-sized efficiently if the parameters are fixed before the synthesis stage. Forinstance a programmable filter can be turned into a filter with fixed coefficients,making it possible to simplify for instance multipliers.

To summarize we have the following types of DSP architectures with variousdegree of efficiency and flexibility

• Fix function architectures that only can execute one pre-determined algo-rithm, where the operations can be either time-shared or isomorphic mapped to the EXUs. In this thesis this class is represented by the presented work dealing with variable length codes, see chapter 3.

• Parametrized architectures that only can execute pre-determined algo-rithms, but with a possibility to control some parameters as for instance filter length. This class of architectures are not explicitly treated in this thesis, but some parametrization is used in the case studies for programmable DSP archi-tectures.

• Programmable architectures that are controlled by a microprogram and that can be used for replacing the algorithms with new ones without having to change the hardware. This architecture is used in the case studies presented later in this chapter.

• Reconfigurable architectures that are realized using reconfigurable logic as for instance FPGAs. These architectures are not discussed in this thesis.


24

2.3 DSP architectures with programmability and high effi-ciencyIn the work presented in [19,20,21] we have aimed at finding a method to com-bine high performance with little overhead and high efficiency in the architec-ture. Using a programmable application specific DSP processor makes it possibleto closely match an architecture to the target application, while still providingenough programmability to be able to incorporate changes in the algorithms latein the design flow. During the design of an ASIC the DSP algorithms may changemany times, both during the design process and after the release of the product.The application specific DSP can hopefully incorporate many of the changes inthe algorithms without architecture modifications, while this is difficult to do in afixed-function architecture.

2.4 Design methodology for ASDSPTo be able to match the architecture with the algorithm it is necessary to have agood understanding of both DSP architectures and the algorithms to be imple-mented. The process of finding a good ASDSP architecture involves the task ofmodelling the algorithm on the target architecture. Since the architecture are tobe closely matched with the algorithm this task may include several iterationswhere the architecture is step-wise refined until a good architecture has beenfound. To support the step-wise refinement we need a design methodology thatcan model both ASDSP architectures and algorithms efficiently. In our work wehave been using the design methodology reported in [23] that supports a concur-rent design flow with simultaneous modelling of algorithm, instruction set andDSP architecture, see Fig. 2.4. This design methodology which is called theModular DSP Methodology, MDSP) has been evaluated in two case-studies (see2.4.1 and 2.4.2).

The MDSP methodology uses a “C-like” language called µC, to describe both thealgorithm and the instruction set needed to implement the algorithm using oneunified model. The µC-model relies on an underlying control unit model with aninstruction set that supports instructions like “conditional jump”, and “sequential


25

execution”. The µC-model is both cycle accurate and bit accurate which makes itpossible to do bit and cycle accurate simulations early in the design process. Anexample of a µC-model that describes an FIR filter is given in Fig. 2.5.

After verification the µC-model is translated into a VHDL architecture that sup-ports the instructions needed for the execution of the given algorithm, and thensynthesized using conventional tools. An example of an architecture that is com-patible with the µC-code in Fig. 2.5 is shown in Fig. 2.6. Note that the VHDLarchitecture may support a larger instruction set than what is required from theµC-model. A microcode is finally extracted from the µC-model together with theVHDL architecture, and a library with building blocks like registers, memories,and ALUs. The instructions used in the µC-code must have a correspondingbuilding block in the building block library. The library is easy to extend withnew functions when needed.

Figure 2.4 MDSP design flow.

mC model

HW fixed

SW fixed

spec.

RTLmodel

ASICdes. flow

formalverification

functionlibrary

m-code gen.


26

2.4.1 Modelling of a JPEG DSPThe first case study was the modelling of an ASDSP dedicated to decompressingimages that had been coded according to the JPEG image coding standard [3].This work was reported in [19]. The JPEG standard consists of a mix of differentalgorithms and it is therefore difficult to find one architecture that is well adaptedto all algorithms. Instead the resulting architecture is a compromise where theinstruction set has been optimized for the critical algorithms. The resulting archi-tecture consists of a two-core solution where one core handles protocol process-

Figure 2.5 µC code of a 32 tap FIR filter.

1: // Declaration part 2: MDSP fir 3: { 4: 5: INPUT inp(14, PARALLEL);// input port, 14 bits 6: OUTPUT outp(14, PARALLEL); // output port, 14 bits 7: REG acc(30), i(6), ca(5), da(5); // different registers 8: RAM d(32,16); // RAM with 32 16 bit words 9: ROM c(32,16, "rom.data"); // ROM 10: 11: PROCEDURE compfir ();// procedure declaration 12: } 13: 14: // Code part 15: 16: PROCEDURE main() 17: { 18: for(;;){ // loop forever 19: do {;} while(!inpF) ; // While no input on the input 20: // port inp do nothing 21: inpF=0, d[da]=inp; // Reset input by setting inpF=0, 22: // store inp in RAM. “,” means that 23: // this is made in parallel 24: 25: compfir(); // call procedure compfir 26: outp=acc; // place the value of acc on outp port 27: } 28: } 29: PROCEDURE compfir() // compute fir 30: { 31: acc=0,ca=0; 32: 33: i=30; 34: do { 35: acc+=d[da++]*c[ca++], 36: i--; 37: } while (i>0) 38: acc+=d[da]*c[ca++]; 39: return; 40: }


27

ing and Huffman decoding. The second core is dedicated to processing of theInverse Discrete Cosine Transform (IDCT) which represents a high computa-tional work load. Due to the partitioning of the algorithms only image data needsto be passed to the IDCT processor core. The parameters can be kept entirely inthe Huffman processor core.

The experience of this case study is twofold. The methodology worked well fordefining programmable architectures for a special application. With some modi-fications it was possible to design a high efficiency core with high utilizationusing the design methodology. The modifications we needed to add was hard-ware to support loops, where the jumps did not cost any extra clock cycle, and apossibility to describe finite state machines to handle the I/O of the IDCT proces-sor. As it turned out the IDCT core architecture is in most aspects similar to thearchitecture that would have been obtained if conventional design methods hadbeen used for designing a fixed-function ASIC. Hence, even if the methodologyis supposed to be used for programmable architectures with one control unit anda data path it is possible to describe complex architectures as the finite statemachine that work in parallel with the main control unit.

2.4.2 Design and implementation of an FFT processorThe second case study was the task of implementing a core dedicated for the FFTmeeting the requirements of the VDSL standard proposal. The FFT is an algo-rithm with a simple structure, which basically consists of repeated calculations ofa butterfly operation [37]. Because of the regular, data independent structure of

Figure 2.6 32-tap FIR filter architecture.

ROM c

outp

+,pass

*0

RAM d d

a

inp

i

1

+,-,pass

>

0to control unit

acc

imm op

firl

imm op

circ_add0

1

ca

circ

_add0

1

from firl


28

the FFT algorithm the best choice is normally to implement an FFT in a dedicatedarchitecture. The special feature with this case study was that both a high degreeof programmability and a high throughput was required due to uncertainties inthe proposed standard. Therefore our choice was to implement the FFT in a pro-grammable architecture that easily could be adapted to changes in the standard.

The FFT algorithmThere exist many different algorithms for computing a discrete Fourier transformmore efficient than by direct calculation of the DFT sum

(2.6)

where is the transform length.

In 1965 the first algorithm for calculating the DFT more efficiently was presentedby J. Cooley and J. Tukey [38]. The presented algorithm reduced the arithmeticcomplexity from to . The FFT algorithm presented byCooley and Tukey is usually referred to the Cooley-Tukey FFT or the decimationin time (DIT) FFT algorithm.

In this implementation a decimation-in-frequency (DIF) FFT algorithm is usedand it was first published in [39].

Both the DIT and DIF FFT algorithms are attractive from implementation pointof view because of the regular structure of the SFG with a number of columns inwhich the same butterfly operation is repeated, see Fig. 2.7.

Figure 2.7 SFG of an 8-point radix-2 DIF FFT.

X k( ) x l( ) e j2πlk– N⁄⋅

l 0=

N 1–

∑= k 0 1 ... N 1–, , ,=

N

O N2( ) O N N( )log( )

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

W4p

Wp

W2p

W3p

W4p

W2p

W2p

W4p

x(0)

x(1)

x(2)

x(3)

x(4)

x(5)

x(6)

x(7)

X(0)

X(4)

X(2)

X(6)

X(1)

X(5)

X(3)

X(7)

++

Wp

butterfly

column 0 column 1 column 2


29

It is possible to derive FFT algorithms with different radix which implies differ-ent types of butterflies. If the FFT algorithm contains only butterflies with twoinputs it is a radix-2 algorithm, with four inputs it is a radix-4 algorithm, and soon. In our implementation of the FFT algorithm we chose to support both radix-4and radix-2 butterfly operations. Radix-4 butterflies require fewer memoryaccesses than radix-2 butterflies, while it is possible to calculate only FFT sizesthat are a power of four using radix-4 butterflies. By supporting both radix-2 andradix-4 butterflies, FFTs with a length which are a power of two can be calcu-lated.

The FFT algorithm used in the DMT technique is derived from the normal FFTalgorithm with some modifications due to that the signal sent to the line is realvalued. In this case it is possible to calculate a FFT using an point FFT andan additional calculation step. The algorithm is based on the fact that the Fouriertransform of a real valued input sequence is conjugate-symmetric [40], i.e.

. (2.7)

This kind of symmetry in the Fourier transform is commonly called Hermitiansymmetry. The steps of calculating a FFT with real input values using an FFT are [37,41] are given below

Assume input sequence x(n) is real, with n= 0,1,...2 .

Form a new complex sequence

, l = 0,1,... . (2.8)

Compute the point DFT of

(2.9)

Create the DFT of by the following computation:

(2.10)

and

2N N

X ejω( ) X*

e j– ω( )=

2N N

N 1–

y l( ) x 2l( ) jx 2l 1+( )+= N 1–

N y l( )

Y k( ) y l( ) e j2πlk– N⁄⋅

l 0=

N 1–

∑=

x n( )

X k( ) 12--- Y k( ) Y

*N k–( )+( ) 1

2j----- e

j2πk2N

-----------–Y k( ) Y

*N k–( )–( )⋅ ⋅+=

k 0 N 1–,[ ]∈


30

. (2.11)

Only values in the range need to be calculated since the out-put is symmetric according to Eq. 2.7.

In a similar way it is possible to calculate a real valued output sequence with alength of points by only calculating an -point IDFT when the output isknown to be real valued [42]. The stages in this calculation are given below.

(2.12)

Perform an point IDFT on the sequence .

(2.13)

Perform a de-interleaving stage to create the output values.

(2.14)

X k( ) 12--- Y k( ) Y

*N k–( )+( ) 1

2j----- e

j2πk2N

-----------–Y k( ) Y

*N k–( )–( )⋅ ⋅–=

k N 2N 1–,[ ]∈

k 0 N 1–,[ ]∈

2N N

Y k( ) 12--- X

*k( ) X N k–( )+( ) 1

2j----- e

j2πk2N

-----------–X

*k( ) X N k–( )–( )⋅ ⋅+=

k 0 N 1–,[ ]∈

N Y k( )

y n( ) 1N---- y l( ) ej2πlk N⁄⋅

l 0=

N 1–

∑=

2N

x 2n( ) Re y n( )( )x 2n 1+( ) Im y n( )( )

==


31

The calculation stages in the FFT/IFFT operations used in the DMT technique isalso illustrated in Fig. 2.8.

Functional descriptionThe FFT processor can handle between 128 and 1024 carriers at a data rate thatcorresponds to 25 MHz sample rate. We chose to use two parallel processingcores, where each core can handle one direction, or alternatively the two corescan be used in the same direction with an increased data rate. An outline of thetop level of the FFT architecture is shown in Fig. 2.9.

There are two I/O blocks that handle the two data streams (upstream and down-stream). The two I/O blocks communicate with six sets of memories each onecapable of keeping one complete symbol in memory. To keep the memory band-width high enough in the FFT a segmented bus structure where each memory sethave access to three buses, i.e. the two I/O units and one of the FFT cores. Eachmemory set contains two physical memories, and it is possible to do one read andone write to the memory set each clock cycle as long as this is not made to thesame physical memory. The on-chip busses have been designed such that it ispossible to do both read and write over the same bus in the same clock cycle.

Since we had to support both different types of time division multiplex and fre-quency duplex modulation on the line without major changes in the external con-trol logic the memory buffering scheme was put inside the FFT processor. Apossibility to add and remove cyclic prefix from the symbols is included in the I/O units. This saves a buffer stage in the VDSL modem. One of the I/O units hasbeen supplied with a complex multiplier in order to be able to integrate a fre-quency domain equalizer with the addition of some external control logic.

Figure 2.8 FFT/IFFT calculation exploiting Hermitian symmetry.

IFFT2N->Npre-processing

N pointIFFT

FFTN->2Npostprocessing

N pointFFT

transmit path

receive path


32

The FFT DSP core is optimized for processing of complex valued data, andtherefore instructions like complex multiplication, complex addition etc werechosen. One complex multiplier, two ALUs for addition and subtractions and onecombined scaling and rounding unit are the available resources for the main FFTcalculation. There are also three address generation blocks, two for the read andwrite addresses to access the data, and one coefficient generation block for thetwiddle factors in the FFT algorithm.

The memory buffering scheme as well as the FFT length and the length of thecyclic prefix is software controlled.

2.5 ASDSP design methodologyThe goal with the MDSP methodology has been to develop an improved designmethodology which can be used for efficient design of ASDSPs containing a mixof fixed-function hardware, efficient programmable cores with specializedinstruction sets, and software using a hardware-software co-design approach.

Often the instruction set tends to become large, and there is a need to be able tostop designers from adding new instructions at some point. The possibility to eas-ily add new instructions is nice at the initial stage of the architecture design, butwhen the architecture evolves the introduction of new instructions must be morerestricted. Our solution is to have an instruction definition file in parallel with the

Figure 2.9 FFT processor architecture.

IO A

MEMORY SYSTEM

A0 A2A1 B0 B2B1

IO B

DSP A DSP B

INA INBOUTA OUTB


33

µC model. The instruction definition file makes it possible to have dedicateddesigners that are allowed to introduce new instructions while others just imple-ment parts of the software using the existing ones.

The resulting architecture may also become too limited if not special attention isput on what kind of flexibility is needed in the architecture. Addressing modesshould be general enough to allow alternative addressing schemes and the dataflow between the EXUs should be allowed to be different from the one used inthe implemented algorithm.

A good compromise between having an efficient architecture with few special-ized instructions or using a very extensive instruction set is to use parameter con-trolled EXUs. For instance the size of a circular buffer can be set by storing thesize in a register, rounding type can be controlled by setting some bits in a regis-ter and so on.

2.5.1 Architecture synthesis from µCAn important experience of the work with the MDSP methodology was that therewas a need for a a tool that can help the designer with translating the instructiondescription in µC to an architecture suitable for the algorithm. To achieve a moreefficient way to create a good DSP architecture from a µC model, a synthesis toolwas created.

We wanted a tool that would be deterministic in the sense that small modifica-tions to the µC model should only lead to small changes in the synthesized archi-tecture. We also wanted a tool that was easy to understand so that the designercould easily learn how to write good µC code. That is, the effort should not be onthe optimization of the architecture, but instead to capture the designers inten-tions.

A prototype tool based on the goals mentioned above has been developed and isreported in [43,21]. The tool work according to a few simple rules, but with apossibility to easily override the rules when they yields a poor architecture.

•All operations, whose target register is the same, are collected into an arith-metic logic unit (ALU).

•Each target register has its own ALU connected to it. It may appear to be astrange rule, but the designer is supposed to override the rule by identify-ing register files. A register file will have one ALU connected to it. A reg-ister file is in this context a number of registers that all can be used in anidentical way in all instructions.

•A constant expression used in an operation or assigned to a register or otherstorage elements is implemented as an immediate operand which comes


34

from a field in the instruction word. Exceptions from the rule is if the con-stant is zero or one. This rule can be overridden by specifying an option tothe tool or by changing the µC code.

Our experience is that the tool works well, but in some cases the designer isrequired to change the µC model to work around some problems.

It is difficult to compare our solution with other solutions, but there exist someother systems which use a C-like language for the hardware modelling. Therealso exist several systems that use a C derivative for general HW modelling[44,45] and there is also an initiative called The Open SystemC initiative where ahardware design language based on C is proposed [46].

Many prototype systems for hardware synthesis have been proposed. In some ofthe systems an algorithm model is fed into a synthesis program that performs aautomatic resource allocation and scheduling of operations [47-51]. The disad-vantage of doing behavioral synthesis with the algorithm as starting point is thatthe synthesized instruction set will be limited to what is necessary in the imple-mented algorithms. If extra flexibility, i.e. more instructions, is needed in thearchitectures this is difficult to incorporate, and even if possible, it is difficult tore-program the control unit for a modified algorithm.

An advantage with our tool is the high degree of control of the resulting architec-ture. No optimization stages are included in the tool. One argument for that is thatwe want the designer to create an efficient DSP architecture with an instructionset well suited for the application. When the architecture and the most importantalgorithms are in place the rest of the design can be made in software using astandard C compiler targeted against the chosen instruction set. This will how-ever require that the tool can generate an instruction description file that fits thechosen compiler. This function has however not been implemented yet.

The need for a tool with a high degree of interaction and a possibility for repro-gramming the synthesized architecture has also been identified and incorporatedin a design environment called AMICAL [52,53].

3 Variable Length Decoding

35

3 Variable Length DecodingVariable length coding is an important method that is implemented in several ofthe standards used for saving bandwidth when transmitting images or video. Asdescribed in chapter 1, section 1.2 variable length coding is included in both theJPEG image coding standard [3] and the MPEG-2 video coding standard [4].

3.1 Variable length codesVariable length codes (VLC) are used for reducing the redundancy in transmittedinformation in communication systems. In a variable length code, commonlyused symbols are assigned shorter code words in order to minimize the totalamount of bits used to represent the information. The most known type of VLC isthe Huffman code which is the code that reaches the smallest code size amongthe VLCs since it is constructed from the statistics of the symbol use [54]. AVLC can be represented as an unbalanced binary decision tree, whose leaves rep-resent the symbols and the paths from the root node to the leaves represent theVLC, Fig. 3.1. For example the symbol e will be represented by the code 110when using the VLC code in Fig. 3.1.

Figure 3.1 Example of tree representation of a variable length code.

a

10

10

b c

10

10 d e

f

10

10

g

level 0 (root)

level 1

level 2

level 3

level 4


36

3.2 The VLC decoding processThere are two basic architectures that have been used for VLC decoding. In thetree traversal method, one or a few bits at a time are used for traversing the binarydecision tree that represents the VLC code. An overview of architectures imple-menting the tree traversal method can be found in [55,56]. The second architec-ture type is symbol parallel in the sense that one symbol is decoded at a timenormally by using a table look-up approach. This type of architecture is morecommon than the tree traversal method. Some examples are found in [57-64].

3.2.1 Tree based decodingThe tree based decoding method is basically a large state-machine that given thecurrent state, and the next incoming bits determines the next state ( ),and if a leaf in the decision tree is reached the symbol has been found, which issignalled with a symbol_ready signal, see Fig. 3.2. The throughput of the decoderdepends on the average code length of the input data, and the critical path of thetree traversal method is to find the next state . The output data rate of thisVLC-decoder architecture will be variable. The maximum decoding rate willthen be limited by

(3.1)

where is the average code length, and is the number of new bits thatare decoded every cycle.

Since EQ 3.1 is a fundamental limit on this architecture the only feasible way toincrease the throughput is to increase . Unfortunately, increasing willincrease the time required for updating the state, , giving an optimum atsome point.

Figure 3.2 Tree-based VLC decoding SFG.

Nbits Q+

TQ+

Tmin

Lave TQ+×Nbits

---------------------------=

Lave Nbits

Nbits NbitsTQ+

T

VLClogic

Q+Q

Nbits symbol

symbol ready


37

In [56] a pipelined tree-based coding architecture is presented. This architectureis pipelined such as the next state, , is fed forward to the following pipelinestage. This will make it possible to reduce . The proposed architecture is,however, only suitable when multiple independent bit streams exist.

3.2.2 Symbol parallel decodingThe most commonly used architecture is the parallel decoding architecture whereone or several symbols are decoded every cycle. A bit-vector of bits is fed to the VLC-decoder, where is the length of the longest VLCcode and is the number of symbols to decode in parallel. The decoderdecodes the symbols, usually by using a table look-up technique [58-60,65].Since the number of bits consumed for each symbol decoding varies with inputdata, the length of the decoded symbols must be calculated. The code length isfed back to the input buffer, which will throw away the used bits in the input vec-tor, see Fig. 3.3.

The symbol decoding process can be pipelined, while the length decoding resultmust be fed back to the buffer before the next symbols can be decoded.Hence, the critical path is found in the length decoder as

(3.2)

where is the time it takes to decode the length of symbols, and is the time it takes for the buffer to throw away the used bits. The through-

put can be increased by decoding more symbols every cycle, but this will alsoincrease and .

There are variants on this architecture where varies with symbol length.In [64] an architecture is presented that in some special cases can decode severalVLC codes in parallel. When one of the code words is a short code this is handledin parallel with the decoding of the following code.

Figure 3.3 Symbol parallel VLC decoding SFG.

Q+

TQ+

Lmax Nsymb×Lmax

Nsymb

VLClogic

code_length

symbol

Buf

fer

Tin

Nsymb

Tmin

TLdec Tbuf+

Nsymb-------------------------------=

TLdec NsymbTbuf

TLdec Tbuf

Nsymb


38

3.3 VLC decoder with simplified length decoderBoth the tree based and symbol parallel decoding methods have a fundamentalspeed limit given by for the respective architectures.

In [15,16] we propose an architecture where features from the tree-based andsymbol parallel VLC-decoder are combined, see Fig. 3.4. The proposed architec-ture uses a bit-serial input fed through a shift register that works as a serial to par-allel converter. The parallel output from the shift register is connected to a lengthdecoder. The time for the length decoder to decode the length of a symbol isthereby varying with the symbol length. For example, a one bit code is decodedin one clock cycle, a two bit code is decoded in two clock cycles and so on, justas in tree based decoding. Each level in the binary decision tree has its own out-put from the length decoder, , and is later combined using a multiplexer forchoosing the right output. The output to choose is based on the number of clockcycles since the last decoded symbol. The first clock cycle after the last decodedsymbol the output corresponding to a code length of one is selected by a mul-tiplexer, the next cycle the output corresponding to a code length of two is cho-sen, , and so on until a new symbol has been found. While the tree-baseddecoder only use the previous state and the next incoming bit (or bits) to find thenext state, the proposed architecture have a parallel input making it possible todecode the symbol in parallel. Since the input is bit-serial, while the decoding isin parallel, the decoder have cycles available to decode an bit code. This isan advantage if we assume that short codes are easier to decode fast than longcodes.

The critical part of this architecture is the length decoder. The symbol decodercan be pipelined to reach sufficient speed, and also make use of information fromthe length decoder.

The critical path in this architecture is

(3.3)

where is the time to find out if the code has the length of bits, and isthe multiplexer delay.

Tmin

Li

L1

L2

M M

Tmin maxi∀

TLiTmux+

i--------------------------

=

TLii Tmux


39

3.4 VLC decoder with pipelined length decoderIn [15,16] we propose an architecture where the length decoder is pipelined, seeFig. 3.5. Just as the architecture shown in Fig. 3.4 this architecture use a bit-serialshift register at the input. But instead of using the output of the length decoder tofind out when to start to decode the next symbol, the length decoder starts todecode a new symbol every clock cycle. The length decoding is started assumingthat the first bit in the bit-serial shift register at the input is the first bit in the nextsymbol. In most cases this is not the case, and the result of the decoding is auto-matically discarded because of the way the counter, that controls the multiplexerat the output of the length decoder, works. This way of running the lengthdecoder without having the input aligned to the symbols is denoted speculativedecoding.

Since the length is unneeded for restarting the length decoder, and only neededfor indicating to the symbol decoder when to start decoding of a new symbol, thecritical path in this architecture is from the input to the multiplexer that selectsone of the ‘s to the reset of the counter that control the multiplexer, i.e.

. (3.4)

The output of the length decoder can be used for synchronization of the symboldecoder, or if speculative decoding is used in the symbol decoder as well, to indi-cate when a valid output exist.

Figure 3.4 VLC decoder with varying rate length decoder.

Shift regin consumed bits

load

L1

L2

LM

Varying ratelengthdecoder

counterreset

Register

Symboldecoder

1

M-1

new_symb

symbol

symbol ready

Li

Tmin Tcntreset Tmux+=


40

To equalize the latency in the decoder the delay through the length decoder is dif-ferent for different code lengths. The delay for a code length is restricted to beequal to where is the clock period. That is, the pipeline depth is set equalto the code length.

3.5 VLC decoder with symbol decoder partitioningIn the architecture proposed in [17] the length decoder is used for sorting theinput data into groups, where each group only consists of code words of certainlengths. In the same way as the time allowed for decoding can be made propor-tional to the code length, the symbol decoding time can also be made propor-tional to the code length. For instance, a symbol with a code length of -bitswill not occur more often than at most every th clock cycle, giving a decoderspecialized for symbols with a length of or more bits, clock cycles to com-plete the task. In the proposed architecture the length decoder executes first, andthe symbol decoder starts when a new code word is available. Fig. 3.6 shows asolution with two symbol decoders, one fast for short VLC codes and one slowerfor longer VLC codes.

3.6 Length decoder implementationSince there is one output for each level in the binary decision tree, the implemen-tation of the length decoder can be made simple. The output from the lengthdecoder must become one every time there is a code of length at the input. Ifthe actual length is shorter than the value of is irrelevant. This makes it pos-

Figure 3.5 VLC decoder with, varying rate length decoder.

ii T⋅ T


L1L2

LM

Pipelinedvarying ratelengthdecoder

counter

Symboldecoder

new_symb

symbol

symbol ready

M

MM

M M

Lii

i L i


41

sible to use “don’t care” in many of the positions in the truth table for the lengthdecoder. In Table 3.1 the truth table of a length decoder for the example given inTable 3.1 is shown. The simplified boolean equations are given in EQ 3.5 - 3.8.

(3.5)

(3.6)

(3.7)

(3.8)

Figure 3.6 VLC decoder with partitioned symbol decoder.

VLC codeC1 C2 C3 C4

L1 L2 L3 L4 Symbol

0 1 X X X a

101 0 0 1 X d

110 0 0 1 X e

1000 0 0 0 X b

1001 0 0 0 1 c

1110 0 0 0 1 f

1111 0 0 0 1 g

Table 3.1. Truth table for length decoder.


L1L2

LM

Pipelinedvarying ratelengthdecoder

new_symb

M

symbol decoderfor 1 to N-1 bits

symbol decoderfor N to LM bits

symbol

symbol ready

symbol

symbol ready

counter

control

start_short

start_long

Delay

L1 not C1( )=

L2 0=

L3 C2 C3⊕=

L4 1=


42

The complexity of calculating each output is mainly decided by the number ofcodes with length . This is because of the large number of don’t care positions inthe truth table. If the truth table is sorted with increasing code length as in Table3.1 the output will always have all don’t care positions on the first rows in thetable, followed by the codes where the output has to be one and then the remain-ing rows will have the output zero.

3.7 RemarksThe proposed architectures are mainly suitable for fixed-function VLC decoders.A prototype chip implementing the static MPEG-2 Video VLCs has beenreported in [16]. The MPEG-2 standard use fixed VLCs, while this is not the caseusing the JPEG standard. The need for a fast VLC decoder is usually higher fordecoding of video than for still images, which makes it relevant to study VLCdecoders with fixed VLCs.

It is difficult, but not impossible, to make a good programmable solution usingthe proposed architectures. The difficulty is the length decoder which must bemade very parallel and fast. A possible solution that may be worth to examinefurther is to use programmable logic to realize the length decoder.

Lii

Li

4 Data Converters in Communication Systems

43

4 Data Converters inCommunication Systems

Analog-to-Digital (ADC) and Digital-to-Analog (DAC) converters are criticalcomponents in many communication systems. The current trend is to move moreand more of the functionality of a communication system into the digital domainin order to provide an increased flexibility and reduce cost. In order to acomplishthis the requirements on the data converters increase both in terms of higheraccuracy and larger bandwidth.

In order to continue to push the data converter performance even further there isa need to handle problems caused by the processing of the chips. The variationsin transistor parameters, especially for analog circuits causes a degradation in theperformance [66]. To increase the performance of data converters we believe thatmore attention must be put on optimizing the data converters against the targetapplication. There exists many analog and digital calibration techniques that aimat reducing the matching error problems, but few methods take full advantage ofthe properties of the target application. We stress that in order to get the most outof digital calibration and error correction all available information about the pro-cess, the application, and the data converter architecture should be utilized as faras possible. In our case the target application is DSL based communication sys-tems. In this chapter we propose methods that can be used to correct some of theproblems in ADCs and DACs.

4.1 Analog-to-digital conversionAn ideal ADC is normally defined as a block that converts a continuous time sig-nal to a discrete time signal with discrete amplitude, i.e. a digital signal. The ana-log-to-digital conversion is often split into two steps, where the first step convertsthe continuous time signal to a discrete time signal, i.e. a sample-and-hold step.


44

The second step quantize the amplitude continuous signal values. In Fig. 4.1 amodel of the analog-to-digital conversion is shown. The quantization is usuallymodelled with an additive zero-mean gaussian distributed noise source with thevariance .

The quantization noise term depend on the number of bits that are used to repre-sent the digital signal. This is usually referred to as the resolution of the ADC. Inthe ideal ADC the maximum quantization error is in therange where is defined as

(4.1)

where refers to the full scale input range and is the resolution.

Assuming a random input signal the noise will be equally distributed in the range and the variance, , will be

(4.2)

There are also many other types of noise sources and imperfections that degradethe performance. Here we differentiate between two types of error sources, 1) thestatic error which not is frequency dependent, and 2) the dynamic errors that nor-mally increase with frequency. In order to measure the performance of the ADC anumber of measures have been defined. Some of the measures are listed below.

Figure 4.1 The analog-to-digital conversion process.

σ2

nT

x(t) x(nT) Qx(n)

nT

x(t) x(nT) x(n)+

s2

q n( ) sq n( ) s n( )–=∆± 2⁄[ ] ∆

∆ FS

2N

-------=

FS N

∆± 2⁄ σ2

σ2 E q2

n( )[ ] q2 n( ) 1∆---⋅ qd

∆ 2⁄–

∆ 2⁄

∫ ∆2

12------= = =


45

Differential nonlinearity (DNL)The differential nonlinearity is defined as the deviation from the ideal stepsize between two adjacent codes in the ADC [67], see Fig. 4.2.

(4.3)

Sometimes a normalized definition is used instead

(4.4)

Integral nonlinearity (INL)The integral nonlinearity is defined as the total deviation from the ideal value andcan be expressed in terms of DNL by

(4.5)

Spurious free dynamic range (SFDR)The spurious free dynamic range is defined as the ratio of the power between theinput signal and the largest spurious within the frequency band. The SDFRexpressed in dBc is

Figure 4.2 Non-ideal transfer function for a 2-bit ADC.

∆

DNLi Xi 1+ Xi– ∆–= i 0 2N 1–,[ ]∈

DNLi

Xi 1+ Xi– ∆–

∆----------------------------------= i 0 2N 1–,[ ]∈

analog input

digital output

DNL1

D

D

DNL2

INLi DNLk

k 0=

i

∑=


46

(4.6)

Signal-to-noise ratio (SNR)The signal-to-noise ratio (SNR) is the ratio between the signal power and thetotal noise power within a certain frequency band, excluding the harmonic com-ponents.

(4.7)

Signal-to-noise and distortion ratio (SNDR)The signal-to-noise and distortion ratio (SNDR) is the ratio between the signalpower and the total noise power within a certain frequency band, including theharmonic components.

(4.8)

Peak-to-average ratio (PAR)The peak-to-average ratio of a signal gives information on how the signal is dis-tributed over the amplitude range. A low PAR indicates a more uniform distribu-tion of the amplitudes in the input signal. A high PAR indicates that highamplitudes may occur at the input which affects the dynamic range that must behandled by the data converter.

The PAR is defined as

(4.9)

4.2 ADC errorsIn most ADC architectures the conversion process consists of a) sampling of theinput signal and b) comparing the sampled signal by a set of reference voltages.Depending on how the reference voltages are created and how the comparison ismade we obtain different architectures. In a pipelined architecture one or a fewbits are converted at each stage in a pipeline. In a time-interleaved ADC(TIADC) several ADCs are used in a time interleaved way. Both the pipelinedand the time-interleaved ADC may be based on a simpler flash or a successiveapproximation conversion scheme. In a flash converter the sampled data isdirectly compared against all reference voltages. When using successive approxi-

SFDRdBc 10Signal Power

Largest Spurious Power---------------------------------------------------------

log=

SNRdB 10Signal PowerNoise Power--------------------------------

log=

SNDRdB 10Signal Power

Noise and Distortion Power------------------------------------------------------------------

log=

PARpeak amplitude

rms value------------------------------------=


47

mation the conversion is made by a binary search strategy applying one referencevoltage at a time. Another type of ADC is the sigma-delta ADC which workswith oversampling. Since the principle relies on oversampling this architecture isless interesting when a high conversion speed is required.

Due to deviations from the ideal values in the components used for creating thereference voltages, the voltages will contain errors which will result in DNL andINL errors. These errors are independent of the frequency and is thereforereferred to as static errors.

Other important error sources are the offset and gain errors. Analog circuits mayhave a DC offset which will result in an output signal even for zero input signal.The gain variations are normally caused in amplifiers or capacitors in the ADC,or in the sample-and-hold circuit. If the gain and offset errors are assumed to befrequency independent these errors can be modelled as

(4.10)

where is the gain, and is the DC offset. Both the gain error and the offseterror may reduce the maximum voltage swing in the ADC.

Figure 4.3 Example of flash and successive approximation ADC principle.

Reference generator

s(t)

s(nT)

nT s>x?x-

+

Flash ADC SA ADC

s(nT)

R

R

R

vref

s(t)

nT

+

-

+

-

+

-D

ecode

xe n( ) g x n( )⋅ o+=

g o


48

At high input frequencies the dynamics of the circuitry becomes important. Forexample the sample-and-hold circuit may not be fast enough to track the inputsignal, and the reference voltage generation may settle too slow. These frequencydependent errors will at some frequency become dominating and will limit thebandwidth of the ADC.

4.3 Time-interleaved ADCOne way to increase the bandwidth of an ADC is to use several ADCs in paralleland to sample data in a time interleaved fashion, see Fig. 4.4. The conversion ratein each individual ADC is reduced to while the overall sample rate is keptat , where is the number of ADCs that are used in parallel.

It is important that the differences between the ADCs in a TIADC are small sincethese differences will result in distortion.

4.3.1 Offset in TIADCsA difference in offset between the ADCs in the TIADC will appear as a periodicoutput signal with a period of samples, Eq. 4.11.

(4.11)

or expressed in the frequency domain

(4.12)

Figure 4.4 Time interleaved ADC.

fs M⁄fs M

x(t) xTIADC(n)

MT

(M+1)*T

(2M-1)*T

x0(n)

x1(n)

xM-1(n)

ADC M-1

ADC 0

ADC 1

M

xtiadc n( ) x T( ) o1+ … x MT( ) om+, ,{ }=

Xtiadc ejω( ) 1T--- X ω k

2πMT--------⋅–

k ∞–=

∞

∑ O ejω( )+=


49

There are analog offset cancellation techniques that can be used to reduce the off-set differences in the analog circuitry [68]. An advantage of removing the offsetin the analog domain is that the offset will not reduce the available input range. Adisadvantage is that the analog offset cancellation increases the complexitywhich may reduce the performance of the analog circuitry.

In [69] a mixed digital and analog technique is proposed where most of the workis made in the digital domain, in addition to some minor analog circuits. Theinput samples are multiplied with a random sequence of using amodified sample-and-hold circuit. The samples observed at the output of one ofthe ADCs in the TIADC will be

(4.13)

where is the offset added by the ADC. By choosing so that its meanvalue is close to zero the mean value of Eq. 4.13 will approach . A calibrationunit continuously computes the mean value using a large number of samples,which is used as an estimation of . The original signal is then recreated by adigital multiplication using the same sequence as used in the sampling pro-cess, i.e.

(4.14)

In Eq. 4.14 the fact that has been used.

In [24] we propose a purely digital method that takes advantage of that the propo-erties of the application is known. The application studied is the DMT modemand the offset is identified by using the symbol decoder. The FFT in the DMTreceiver makes it possible to do the offset estimation in the frequency domain.There is no need for identifying the individual offset from each ADC, instead it issufficient if the total offset contribution can be identified and removedbefore the decoder. See Fig. 4.5 where the DMT modem outline from chapter 1 isrepeated. The offset will move the received constellation points from their idealposition and make it more difficult to detect the transmitted information, see Fig.4.6.

c n( ) 1 1–,{ }=

xc n( ) c n( ) xi n( ) oi+⋅=

oi c n( )oi

oic n( )

xcorr n( ) c n( ) c n( ) x⋅ i n( ) oi oiˆ–+( )⋅ xi n( ) c n( ) oi oi

ˆ–( )⋅+= =

c n( ) c n( )⋅ 1=

O ejω( )


50

The offset signal is additive and independent of the input signal as shown in Eq.4.17. This additive error will cause an offset in the constellation diagram at thefrequencies where varies between 1 and . An example on howthe offset error affects the received constellation point at the disturbed carriers isshown in Fig. 4.6. In [24] we show that the offset error can be identified andreduced if the magnitude of the error is reasonably small. The main result in thepaper is that the error due to the difference between the decoded information andthe received signal can be used for offset estimation. Taking the average value ofthe error between the detected signal and the received signal will identify the off-set assuming that the mean value of the noise is zero ( ),see Eq. 4.15.

(4.15)

Figure 4.5 DMT modem.

Figure 4.6 Effects of offset errors in a DMT modem when using a TIADC.

DACIFFT

FFTdecoder TEQ ADCEC

analogfrontend

FFT

IFFT

decoder TEQ ADC

DAC

EC

analogfrontend

transmit path

receive path

lineencoderencoder

interleaver

deinterleaverRS-decoder

TEQFEQ

TEQFECframer

deframer

Re

Im

k f⋅ s M⁄ k M 1–

O ejω( ) E N ejω( )[ ] 0=

E N ejω( ) O ejω( )+[ ] E N ejω( )[ ] E O ejω( )[ ]+ E O ejω( )[ ]E Srec ejω( ) Sdec ejω( )–[ ]

= = =


51

There is an error in the simulations shown in [24] which accidently left about10% of the offset. Later simulations have shown that the offset estimation can bemade much better with an error well below one per cent.

The offset error will decrease the SNDR at the receivers end, but since the ADSLstandard has been specified to adapt to a large range of different signal qualities itwill still be possible to transmit data. As the offset estimate becomes more accu-rate the increased SNDR can be utilized to increase the bit rate.

4.3.2 Gain and sample timing mismatchThe timing of the sample clock will become more important in a TIADC than inother types of ADCs since an inexact timing will lead to unequal time intervalsbetween the samples obtained by the different ADCs, Fig. 4.7.

When using a single ADC it is important to have a sample clock generator withlow jitter, but in a TIADC it is also important to achieve a similar delay from theclock source to all sample-and-hold units in the TIADC to avoid nonuniformsampling with a period of cycles.

Considering a TIADC with channels with gain and timing mismatch, see Fig.4.8, the output from the TIADC will be

(4.16)

and in frequency domain

(4.17)

Figure 4.7 Sample timing mismatch.

T(1+r0)T 2T 4T3T t

x(t)

Amplitude

DA

T(1+r1)

T(1+r2)

T(1+r3)

M

M

xtiadc n( ) g1 x T 1 r+ 1( )( )⋅ … gm x TM 1 r+ m( )( )⋅, ,{ }=

Xtiadc ejω( ) 1T--- Ak ejω( ) X ω k

2πMT--------⋅–

⋅

k ∞–=

∞

∑=


52

where is described by

(4.18)

is the gain error in ADC number . is the relative sampling error for eachADC.

4.3.3 Gain and timing mismatch effects on SNDRThe effects of gain and skew errors can be fatal for the performance of a TIADC.In [70] and [71] expressions for gain and skew errors have been derived.

The SNDR for a TIADC with only gain error was estimated in [70] to

(4.19)

where is the average gain and is its standard deviation. is the number ofADCs used in the TIADC. For a resolution of 10 bits, Eq. 4.19 shows that should be kept smaller than 0.1%.

The gain error was approximated in [71] to

(4.20)

Figure 4.8 Time interleaved ADC error sources.

Ak ejω( )

Ak ω( ) 1M----- gm e⋅

j ω k 2π MT⁄⋅–( )rmT–e

jkm 2π M⁄⋅–⋅

m 0=

M 1–

∑=

gm m rm

*

xTIADC(n)x(t)

MT(1+r0)

(M+1)*T(1+r1)

(2M-1)*T(1+rM-1

)

x1(n)

x1(n)

xM-1(n)+

oM-1 s2M-1

+gM-1

+o0 s2

0

+*g0

+o1 s2

1

+*g1

SNDR 20g

σg------

log 10 11M-----–

log–=

g σg Mσg

SNDR 201

σt2πfin------------------

log 10 11M-----–

log–=


53

where is the standard deviation of the timing skew and is the input signalfrequency. For a 10 bits resolution and 20 MHz input signal must be smallerthan 8 ps.

σt finσt


54

4.3.4 Gain and timing mismatch cancellationGain and timing mismatch cause the same type of distortion as described in Eq.4.17 and Eq. 4.18.

The skew problem is, however, more difficult to handle and it is shown in [72]that it is not possible to achieve perfect reconstruction in a conventional TIADCby only using linear filtering.

In [73,74] the skew errors are corrected using polynomial interpolation, which isa well-known method for estimating signal values at intermediate time instants,see Fig. 4.9. To measure the skew the authors propose to use a ramp as trainingsignal, but this can be difficult to do in a communication environment. In [74] itis shown that the degree of the polynom becomes high if the input frequency isclose to the Nyquist frequency. For instance for an SFDR of 100 dB an interpola-tion degree of 32 is needed when the input signal is oversampled 1.5 times.

In [75] a method for perfect reconstruction in the presence of timing mismatch ispresented. The method uses a modified DFT, which modifies the distortion termsin Eq. 4.18 to become frequency independent, and the correct spectrum can thenbe computed. The method requires a modified and computationally heavy DFT tocorrect the samples. Also here a training signal is used for identification of thetiming mismatch. See also [76] where a timing estimation algorithm based on atraining signal and a DFT is given, and [77] where a method on how to determinethe standard deviation of the timing error is discussed.

A statistical method is presented in [78]. The method is based on the fact that instatistical meaning the change in amplitude is approximately proportional to thedistance between the samples. By calculating the mean square of the amplitudedifference between two adjacent ADCs and comparing this with the other ADCs

Figure 4.9 Timing mismatch correction using polynom interpolation.

ADC 0

ADC 1

ADC M-1

x(t)

x0(n)

x1(n)

xM-1(n)

MT(1+r0)

(M+1)*T(1+r1)

(2M-1)*T(1+rM-1

)

polynom interpolatorxTIADC(n)

estimatedtiming mismatch


55

an estimate of the timing mismatch is found. The main limitation for this methodis that in order for the algorithm to work well most of the signal energy should beconcentrated below .

In [26] we propose a method for estimating and correcting both timing and gainmismatch. The proposed method takes the application in consideration, and usesthe decoder in a DMT or OFDM modem for extracting the noise on each fre-quency. An adaptive algorithm estimates the mismatch distortion, which then isused for increasing the SNDR in the modem.

The distortion described by Eq. 4.17 and Eq. 4.18 can be treated as informationleakage from one carrier frequency to another. Since the different carriers in theDMT modem can be considered independent from each other the correlationbetween two carriers are caused by gain and/or timing mismatch (see also 1.3where the DMT technique is described). The correlation between two carriers isidentified using the Least Mean Square (LMS) algorithm, and most of the distor-tion can be cancelled.

4.4 Digital-to-analog conversionAn ideal digital-to-analog converter transforms a digital representation of the sig-nal to an analog representation. The analog representation is normally in terms ofa current or a voltage level. A commonly used DAC architecture is the so-calledcurrent-steering DAC [67], where an analog current is generated by a sum of cur-rent sources that is controlled by the digital input. This operation can in the staticcase be described as

(4.21)

where is the settled output amplitude at the time instants , is thenumber of bits in the input word, which contain the bits , and are theinternal DAC weights. is referred to as the most significant bit (MSB) and

is the least significant bit (LSB). For a binary offset input word, we have that and . For a thermometer code input, we have that

and . An example of a current-steering binary offset codedDAC and a thermometer coded DAC is shown in Fig. 4.10.

fs 6⁄

Aout nT( ) bm nT( ) wm⋅

m 1=

M

∑=

Aout nT( ) nT Mbm nT( ) wm

bMb1M N= wm 2m 1–=M 2N 1–= wm 1=


56

4.4.1 Error sourcesThere are two error sources in a DAC structure that are considered in this work,glitches and mismatch of current sources. A glitch occurs when the output code-word temporarily is wrong between the transition from one sample to the next.For instance, the binary offset coded word <00111> may become <11111> for ashort time when toggling to <11000>. A common solution to this problem is touse thermometer coding at the input of the DAC. A thermometer code is charac-terized by that all current sources have equal weights and that all bits that havethe value one in the code are concentrated to a continuous part of the thermome-ter code, see Table 4.1. A transition from one sample to the next cannot causeintermediate values to occur at the output since there is only one type of transi-

Figure 4.10 Example of a) a binary weighted and b) a thermometer coded current-steering DAC.

decimal representation

thermometer code representation

0 000

1 001

2 011

3 111

Table 4.1. Example of thermometer code.

(a) (b)

21I0

bN

2N-1

I0 2

0I0

b2 b1

Iout

I0

b2N-1

I0 I0

b2 b1

Iout


57

tion from one sample to the next. That is either a number of bits toggles fromzero to one, or a number of bits toggles from one to zero, but never both in thesame transition.

Normally only the most significant bits in the binary offset coded input to theDAC are converted to thermometer code, since this is an expensive operation. Acommon configuration is to use thermometer code for 5-6 of the most significantbits, and binary offset coding for the remaining bits [67], see Fig. 4.11. Thishybrid solution is called an M-bit segmented DAC, where M refers to the numberof binary bits that have been translated to thermometer code.

Mismatches in the sizes of the current sources will, as in the case of the referencevoltage mismatch in the ADC, cause DNL errors.

4.4.2 ScramblingIt is difficult to measure the output from a DAC without using an ADC with evenbetter performance, and therefore it is also difficult to use pure digital methods tocalibrate a DAC. In order to improve the SFDR it has instead been suggested touse scrambling, so that the direct relation from an input value to the size of theDNL error and the glitch will be removed. A scrambler will select which currentsources to use for a given input value in a random way, and therefore the size ofthe error will be less correlated to a given input value, and is therefore spread inthe frequency domain. This method is commonly referred to as Dynamic Ele-ment Matching (DEM) and was originally a mixed analog-digital method [79],but today it is most common to use the digital DEM technique [80-84]. A com-parison between different DEM methods is made in [85], see Fig. 4.12.

Figure 4.11 Multi-segmented DAC structure with the M MSBs thermometer coded.

Delay

Therm.codedDAC

Thermometerencoder

Binary-weightedDAC

+X Aout

M

N-M N-M

2M-1


58

In Fig. 4.13 a simulation of a 12 bit DAC with a mismatch of 1% in the weights isshown. The simulation shows the effect of spreading of the distortion when usinga DEM technique. Note that DEM not reduces the total noise power caused bythe matching errors, instead the noise is spectrally moved and can be either fil-tered away or “hidden” in the quantization noise floor.

An efficient way of realizing a scrambler is by using a set of switches with twoinputs and two outputs. In Fig. 4.14 an example with a 3 bit thermometerencoded DAC is shown. An extra zero is put into one of the switches making thenet of switches more symmetric. Each switch is controlled by a signal that con-trols which of the inputs is fed to which output, . If the control signal

is chosen in a random way the matching error will be decorrelated with the sig-nal value.

Figure 4.12 DAC with scrambler.

1-bitDAC

1

1-bitDAC

1-bitDAC

N

Scr

ambl

er

The

rmom

eter

Enc

oder

y(n)x(n)

Digital Encoderx1(n) y1(n)

y2(n)

yM(n)

1x2(n)

1xM(n)

a b,( ) x y,( )p


59

Another solution is to choose such that noise shaping is achieved. In [81,83]noise shaping DAC architectures are shown, and in [84] a more general discus-sion on how to choose for various shaping techniques is given.

Figure 4.13 Simulation a) without and b) with DEM.

0 0.1 0.2 0.3 0.4 0.50

20

40

60

80

100

120No DEM

Normalized frequency

PS

D [d

B/H

z]

0 0.1 0.2 0.3 0.4 0.50

20

40

60

80

100

120With DEM


PS

D [d

B/H

z]

(a)

(b)

distortion terms

p

p


60

A problem arises when trying to combine scrambling and glitch reduction. Ifthermometer coded data is scrambled in an random way the glitch power willincrease, and the main advantage of using thermometer code has disappeared. Tocombine scrambling with glitch reduction the scrambling must be restricted sothat glitch power not become too large. One possibility is to reduce the rate atwhich the scrambling is done by for instance only selecting a new every sec-ond time, which will decrease the glitch power compared with if toggles everytime. But, the randomization effect will also decrease since the matching errorbecomes less random.

In [27,28] we propose two architectures for scrambling of data where 1) scram-bling of the distortion is achieved and 2) the glitches are kept at the minimum.The key idea is to only scramble the difference between two adjacent samples.As many as possible of the current sources are kept in their old state and the cur-rent sources that have to be turned off (or on) are randomly selected among allpossible ones. The advantage with the method is that glitches will be kept at alow level but there is also a drawback with the method caused by the fact that thedegree of randomization is dependent on the input signal. The error from aslowly varying signal also changes slowly, while a fast varying signal will ran-domize the distortion better.

The method was originally developed aiming at improving the performance ofDACs targeted to the VDSL application. But, since the DMT signal consists of alarge number of carriers the total distortion at a given frequency is the sum alarge number of distortion terms. Many of the distortion terms can be consideredto be independent of each other, which makes the distortion look much like addi-tive gaussian noise. This makes the proposed method less suitable for DSL appli-cations since the method converts distortion that already looks like noise to noiseagain.

Figure 4.14 Scrambler for a 3-bit DAC with thermometer encoded input.

pp

t1t2t3t4t5t6

0t0

p

p

p

p

p

p

p

p

p

p

p

p

x

y

a

b

p


61

One application where it has turned out to be interesting to use restricted scram-bling is radio architectures where the first up-conversion stage is done in the dig-ital domain, Fig. 4.15, [86]. The relatively narrow signal band is located at highfrequencies, while there is a large frequency band without signal in which thedistortion will be spread into.

Figure 4.15 Radio transmitter with digital IF mixer.

+

cos(wift)

sin(wift)

I

Q

digital mixer

IF-DAC PA

RF frontendsin(wrft)

H(s)

H(z)

H(z)

M

M


62

5 Author´s Contribution to Published Work

63

5 Author´s Contribution toPublished Work

In this section the Author´s contribution to the published work is clarified foreach publication.

Pub. 1. New Approaches to High Speed Huffman Decoding [15]The publication presents an architecture idea that origins from the Author.

Pub. 2. Implementation of a Fast MPEG-2 Compliant Huffman Decoder [16]

This publication presents an implementation of the architecture idea shown inPub. 1. The Author was responsible for all simulation and implementation work.

Pub. 3. High Speed Pipelined Parallel Huffman Decoding [17]In this publication the high speed Huffman architectures are further developed bythe Author. All main contributions to this work origins from the Author.

Pub. 4. Design of A JPEG DSP Using the Modular Digital Signal Pro-cessor Methodology [19]

This work was made within a cooperation project between Linköping Universityand Ericsson Microelectronics AB. A new design methodology was going to beevaluated in a case study. The Author made the initial hardware partitioningtogether with K-G Andersson. The Author was also responsible for the design ofone of the two processor cores that were designed (the IDCT core).

5 Author´s Contribution to Published Work

64

Pub. 5. Design and Implementation of an FFT Processor for VDSL [20]

The FFT processor was designed by the Author, but with help from people atEricsson Microelectronics AB with the design flow and design environment.Anders Wass et. al. supported the design environment which made it possible toinclude new features in the MDSP methodology during the project. All authorsparticipated in the implementation process.

Pub. 6. Application driven DSP Hardware Synthesis [21]The idea of how to make a synthesis tool for the MDSP methodology came fromthe Author while the implementation was made by Mikael Hjelm. Mikael Hjelmalso contributed with valuable ideas of how to solve some of the problems thatarise during the work.

Pub. 7. ADC Offset Identification and Correction in DMT Modems [24]

All work, from idea to simulations was, carried out by the Author.

Pub. 8. Correction of Mismatch in Time Interleaved ADCs [25]All work, from idea to implementation, was carried out by the Author.

Pub. 9. Glitch Minimization and Dynamic Element Matching in D/A Converters [26]

The idea to reduce glitches when doing scrambling came from J. Jacob Wikneret. al. while the Author came up with the architecture presented in this publica-tion. The analysis and simulations were made by the Author.

Pub. 10. Dynamic Element Matching in D/A Converters with Restricted Scrambling [27]

This publication presents an alternative architecture for reducing glitches whenperforming scrambling in DACs. Mark Vesterbacka came up with the architec-ture and made the simulations verifying the ideas. The Author contributed to thiswork with the DMT simulations and discussions how to use the method in a sys-tem.

6 Conclusions

65

6 ConclusionsSignal processing is used in all electronic communication systems and it is there-fore important to have architectures that are efficient for implementing DSP algo-rithms. It is also important to have an efficient design flow for implementing thealgorithms.

In three papers we propose architectures suitable for VLC decoding [15,16,17].We have shown how the critical loop in the VLC decoder can be broken up,which in theory increases the achievable decoding rate. We have also shown howto parallelize the symbol decoder parallel to reduce the data rate through eachdecoder, which makes them easier to implement. One implementation has beenmade to verify the ideas.

A hardware-software co-design methodology aimed for application specificDSPs has been verified and improved. A JPEG decoder DSP has been designedwhich shows how to combine programmability and performance [19]. An FFTDSP for the VDSL application has been designed, implemented, and verified[20]. A tool for improving the hardware-software co-design methodology hasalso been implemented [21]. The tool supports the designer with hardware gener-ation by trying to capture the designer´s intentions by having a simple set of ruleseasily can be overridden by the designer. The DSP work presented in this thesisshould be seen as an additional bit to the big puzzle creating a more efficient DSPdesign methodology.

Digital methods to improve the performance of data converters have also beenproposed. Increasing data converter performance using digital methods is attrac-tive since modern CMOS technology allows high processing capability. It hasbeen shown how it is possible to co-optimize an A/D converter with the applica-tion, giving an efficient way to cancel errors in a time-interleaved A/D converter.A method to identify and cancel offset differences in the time-interleaved A/D

6 Conclusions

66

converter is proposed in [24], and a method for cancelling time and gain mis-match is proposed in [26]. Both proposed methods are targeted against systemsusing DMT modulation, but can also be used in OFDM based systems. Using theproposed methods for correcting mismatch in time-interleaved ADCs it is possi-ble to look for more wide band receiver architectures since the effective samplerate can be increased without the usual problems with performance degradationwhen using a time-interleaved ADC. New receiver architectures can providegreater flexibility since more functionality can be placed in the digital domainwhich in turn will require more efficient programmable DSP architectures.

Another purely digital method that is proposed is the restricted DEM methodwhere glitch performance and weight mismatch can be balanced against eachother [27,28]. It has been shown how current sources can be dynamicallymatched while preserving a low glitch energy. We believe that the restrictedDEM technique is very well suited for high frequency DACs aimed at radioapplications.

The presented ADC and DAC work are examples of how it is possible to get aperformance increase in data converters using pure digital techniques. When thedata converter requirements continue to increase and the process technologieslimit the performance, techniques as the ones presented in this thesis may be theway forward.

67

References[1] C.E. Shannon, “Communication in the Presence of Noise,” Proc. IRE, Vol.

37, pp. 10-21, Jan. 1949.[2] ETSI, Group Speciale Mobile or Global System of Mobile Communication

(GSM) Recommendation, 1988, France.[3] ISO/IEC 10918-1: Digital Compression and Coding of Continuous-Tone

Still Images (JPEG), Feb. 1994.[4] ISO/IEC DIS 13818-2: Generic Coding of Moving Pictures and Associated

Audio Information, part 2: Video, (MPEG-2), June 1994.[5] S. Haykin, Digital Communications, John Wiley and Sons, 1988.[6] J. Gibson, The Mobile Communications Handbook, CRC Press, 1996.[7] ANSI T1.413-1998, “Network and Customer Installation Interfaces:

Asymmetrical Digital Subscriber Line (ADSL) Metallic Interface,”American National Standards Institute.

[8] “VDSL Coalition Technical Draft Specification (Version 5),” Tech. Rep.983t8, ETSI TM6, Luleå, Sweden, June 1998.

[9] T. Starr, J. M. Cioffi, and J. Silverman, Understanding Digital SubscriberLine Technology, Prentice-Hall, 1999.

[10] W. Y. Chen, DSL Simulation Techniques and Standards Development forDigital Subscriber Line Systems, Macmillan technical publishing, 1998.

[11] D. J. Rauschmayer, ADSL/VDSL Principles, Macmillan TechnicalPublishing, 1999.

[12] F. Sjöberg, The Zipper Duplex Method in Very High-Speed DigitalSubscriber Lines, Luleå University of Technology, 2000.

[13] K. K. Parhi, VLSI Digital Signal Processing Systems - Design andImplementation, Wiley, 1999.

[14] L. Wanhammar, DSP Integrated Circuits, Academic Press, 1999.

68

[15] M. K. Rudberg and L. Wanhammar, “New Approaches to High SpeedHuffman Decoding,” Proc. of IEEE Intern. Symp. on Circuits and Systems,ISCAS'96, Vol. 2, pp. 149-52, Atlanta, USA, May 1996.

[16] M. K. Rudberg and L. Wanhammar, “Implementation of a Fast MPEG-2Compliant Huffman Decoder,” Proc. of European Signal Processing Conf.,EUSIPCO'96, Trieste, Italy, Sept. 1996.

[17] M. K. Rudberg and L. Wanhammar, “High Speed Pipelined ParallelHuffman Decoding,” Proc. of IEEE Intern. Symp. on Circuits and Systems,ISCAS'97, Vol. 3, pp. 2080-83, Hong Kong, June 1997.

[18] M. K. Rudberg, System Design of Image Decoder Hardware, LiU-Tek-Lic-1997:657, Department of Electrical Engineering, Linköping University, Dec.1997.

[19] K-G Andersson, M. K. Rudberg, and A. Wass, “Design of A JPEG DSPUsing the Modular Digital Signal Processor Methodology,” Proc. of Intern.Conf. on Signal Processing Applications and Technology, ICSPAT`97, Vol.1, pp. 764-68, San Diego, CA, USA, Sep. 1997.

[20] M. K. Rudberg, M. Sandberg, and K. Ekholm, “Design and Implementationof an FFT Processor for VDSL,” Proc. of IEEE Asia-Pacific Conference onCircuits and Systems, APCCAS `98, pp. 611-14, Chiangmai, Thailand, Nov.1998.

[21] M. K. Rudberg and M. Hjelm, ”Application driven DSP HardwareSynthesis,” Proc. of IEEE Nordic Signal Processing Symp. (NORSIG2000),Kolmården, Sweden, June 2000.

[22] K-G Andersson, A. Wass and K. Parmar, “A Methodology forImplementation of Modular Digital Signal Processors,” Proc. of Intern.Conf. On Signal Proc. Applications & Technology, ICSPAT ’96, Boston,MA, Oct. 1996.

[23] K-G Andersson, Implementation and Modeling of Modular Digital SignalProcessors, LiU-Tek-Lic-1997:09, Department of Electrical Engineering,Linköping University, March 1997.

[24] M. K. Rudberg, “ADC Offset Identification and Correction in DMTModems,” Proc. of IEEE Intern. Symp. on Circuits and Systems, ISCAS'00,Vol 4, pp. 677-80, Geneva, May 2000.

[25] M. K. Rudberg, “A/D omvandlare,” Swedish patent number 9901888-9, 25May 1999.

[26] M. K. Rudberg, “Correction of Mismatch in Time Interleaved ADCs“, Proc.of IEEE Intern. Conf. on Electronics, Circuits & Systems, Malta, Sept. 2001.

[27] M. K. Rudberg, M. Vesterbacka, N. Andersson, and J.J. Wikner, “GlitchMinimization and Dynamic Element Matching in D/A Converters,” Proc. ofIEEE Intern. Conf. on Electronics, Circuits & Systems, Lebanon, Dec. 2000.

69

[28] M. Vesterbacka, M. K. Rudberg, J.J. Wikner, and N. Andersson, “DynamicElement Matching in D/A Converters with Restricted Scrambling,” Proc. ofIEEE Intern. Conf. on Electronics, Circuits & Systems, Lebanon, Dec. 2000.

[29] M. K. Rudberg, M. Vesterbacka, N. U. Andersson, and J. J. Wikner, “Ascrambler and a method to scramble data words,” Swedish patent appl.0001917-4, 23 May 2000.

[30] M. K. Rudberg, J. J. Wikner, J.-E. Eklund, F. Gustavsson, and J. Elbornsson,“A/D and D/A Converters for Telecom. Applications,” http://www.es.isy.liu.se/staff/mikaelr/downloads/adda_tut_icecs2001.pdf,tutorial held at IEEE Intern. Conf. on Electronics, Circuits & Systems, Sept.2001.

[31] K. Palmkvist, Studies on the Design and Implementation of Digital Filters,Diss. No. 583, Linköping Unversity, Sweden, 1999.

[32] M. Renfors and Y. Neuvo, “The Maximum Sampling Rate of Digital FiltersUnder Hardware Speed Constraints,” IEEE Trans. on Circuits and Systems,Vol. CAS-28, No. 3, pp. 196-202, March 1981.

[33] A. Chandrakasan and R. Brodersen, Low Power Digital CMOS Design,Kluwer Academic Publishers, 1995.

[34] A. Bellaouar and M. Elmasry, Low-Power VLSI Design - Circuits andSystems, Kluwer Academic Publishers, 1995.

[35] J. Melander, Design of SIC FFT Architectures, Linköping Studies in Scienceand Technology, Thesis No. 618, 1997.

[36] T. Widhe, Efficient Implementation of FFT Processing Elements, LinköpingStudies in Science and Technology, Thesis No. 619, 1997.

[37] E. Brigham, The Fast Fourier Transform and Its Applications, Prentice Hall,1988.

[38] J. W. Cooley and J. W. Tukey, “An Algorithm for the Machine Calculationof Complex Fourier Series,” Math Computers, Vol. 19, pp. 297-301, April1965.

[39] W. M. Gentleman and G. Sande, “Fast Fourier Transform for Fun andProfit,” Proc. 1966 Fall Joint Computer Conf., AFIPS’66, Vol.29, pp. 563-678, Washington DC, USA, Nov. 1966.

[40] A. V. Oppenheim and R. W. Schafer, Discrete-Time Signal Processing,Prentice Hall, 1989.

[41] Proakis and Manolakis, Digital Signal Processing - Principles, Algorithmsand Applications, 2nd ed., Macmillian, 1992.

[42] Ericsson Internal document, ETX/XA/NB-97:006.[43] M. Hjelm, Architectural Synthesis From a Time Discrete Behavioural

Language, LiTH-ISY-EX-2000, Linköping, Sweden, Sept. 1998.

70

[44] P. Schaumont, S. Vernalde, L. Rijnders, M. Engels, and I. Bolsens, “AProgramming Environment for the Design of Complex High Speed ASICs,”Proc. of Design Autom. Conf., pp. 915-20, 1998.

[45] K. Wakabayashi, “C-based Synthesis Experiences with a BehaviorSynthesizer, “Cyber” ,” Design Automation and Test in Europe Conf. andExhibition, DATE’99, pp. 390-99, 1999.

[46] http://www.SystemC.org[47] H. D. Man, J. Rabaey, J. Vanhoof, G. Goossens, P. Six, and L. Claesen,

“CATHEDRAL-II - A Computer-Aided Synthesis System for Digital SignalProcessing VLSI Systems,” Computer-Aided Engineering Journal, pp. 55-66, April 1988.

[48] J.M. Rabaey, C. Chu, P. Hoang, and M. Potkonjak, “Fast Prototyping ofDatapath-Intensive Architectures,” IEEE Design and Test of Computers,Vol. 8, Iss. 2, pp. 40-51, June 1991.

[49] E. Martin, O. Sentieys, H. Dubois, and J. L. Philippe, “GAUT: AnArchitectural Synthesis Tool for Dedicated Signal Processors,” Proc. ofEuropean Design Autom. Conf, pp. 14-19, Feb. 1993.

[50] L. Guerra, M. Potkonjak, and J. Rabaey, “A Methodology for GuidedBehavioral-Level Optimization,” Proc. of Design Automation Conf.,DAC’98, pp. 309-14, USA, June 1998.

[51] S. Ramanathan, V. Visvanathan, and S. K. Nandy, “Synthesis of

Configurable Architectures for DSP Algorithms,” Proc. of 12th Intern. Conf.on VLSI Design, pp. 350-57, Jan. 1999.

[52] A.A. Jerraya, I. Park, and K. O’Brien, “AMICAL: An Interactive High LevelSynthesis Environment,” Proc. of European Design Autom. Conf, pp. 58-62,Feb. 1993.

[53] M. Benmohammed and A. Rahmoune, “Automatic generation ofreprogrammable microcoded controllers within a high-level synthesisenvironment,” IEE Proc. Comput. Digit. Tech., Vol. 145, No. 3, pp. 155-60,May 1998.

[54] D.A. Huffman, “A method for the construction of minimum redundancycodes,” Proc. IRE, Vol. 40, No. 10, pp. 1098-1101, Sept. 1952.

[55] S. F. Chang and D. G. Messerschmitt, “Designing High-Throughput VLCDecoder Part I - Concurrent VLSI Architectures,” IEEE Trans. on Circuitsand Systems for Video Technology, Vol. 2, No. 2, pp. 187-196, June 1992.

[56] H. D. Lin and D. G. Messerschmitt, “Designing High-Throughput VLCDecoder Part II - Parallel Decoding Methods,” IEEE Trans. on Circuits andSystems for Video Technology, Vol. 2, No. 2, pp. 197-206, June 1992.

71

[57] S. Ho and P. Law, “Efficient Hardware Decoding Method for ModifiedHuffman Code,” Electronics Letters, Vol. 27, No 10, pp. 855-856, May1991.

[58] S. B. Choi and M. H. Lee, “High Speed Pattern Matching for a Fast HuffmanDecoder,” IEEE Transactions on Consumer Electronics, Vol. 41, No 1, pp.97-103,Feb. 1995.

[59] R. Hashemian, “High Speed Search and Memory Efficient HuffmanCoding,” Proc. IEEE Intern. Symp. on Circuits and Systems., ISCAS ‘93,Vol. 1, pp. 287-290, 1993.

[60] R. Hashemian, “Design and Hardware Implementation of a MemoryEfficient Huffman Decoding,” IEEE Trans. on Consumer Electronics, Vol.40, No. 3, pp. 345-352, Aug. 1994.

[61] H. Park and V. Prasanna, “Area Efficient VLSI Architectures for HuffmanCoding,” IEEE Trans. on Circuits and Systems - II Analog and Digital SignalProcessing, Vol. 40, No. 9, pp. 568-575, Sept. 1993.

[62] K. Parhi, “High-Speed Architectures for Huffman and Viterbi Decoders,”IEEE Trans. on Circuits and Systems - II, Analog and Digital SignalProcessing, Vol. 39, No. 6, pp. 385-391, June 1992.

[63] E. Komoto and M. Seguchi, “A 110 MHz MPEG2 Variable Length DecoderLSI,” 1994 Symp. on VLSI Circuits, Digest of Technical Papers, pp. 71-72,1994.

[64] D.-S. Ma, J.-F. Yang, and J.-Y. Lee, “Programmable and Parallel Variable-Length Decoder for Video Systems,” IEEE Trans. on Consumer Electronics,pp. 448-454, Vol. 39, No, 3, Aug. 1993.

[65] Y.-S. Lee, B.-J. Shieh, and C.-Y. Lee, “A Generalized Prediction Method forModified Memory-Based High Throughput VLC Decoder Design,” IEEETrans. on Circuits and Systems - II Analog and Digital Signal Processing, pp.742-754, Vol. 46, No. 6, June 1999.

[66] M. J. M Pelgrom, A. C. J. Duinmaijer, and A. P. G. Welbers, “MatchingProperties of MOS Transistors,” IEEE J. of Solid-State Circuits, Vol. 24, No.5, pp. 1433-9, Oct. 1989.

[67] M. Gustavsson, J. J. Wikner, and N. N. Tan, CMOS Data Converters forCommunications, Kluwer Academic Publishers, 2000.

[68] K.-S. Tan, et.al., ”Error Correction Techniques for High-PerformanceDifferential A/D Converters,” IEEE J. of Solid-State Circuits, Vol. 25, No.6, pp. 1318-27, Dec. 1990.

[69] J.-E. Eklund, and F. Gustafsson, “Digital Offset Compensation of Time-Interleaved ADC Using Random Chopper Sampling,” Proc. IEEE Intern.Symp. on Circuits and Systems, ISCAS’00, Vol. 3, pp. 447-50, Geneva,May, 2000.

72

[70] M. Gustavsson, CMOS A/D Converters for Telecommunications, Diss. No.552, Linköping Unversity, Sweden, 1998.

[71] Y.-C. Jenq, “Digital Spectra of Nonuniformly Sampled Signals:Fundamentals and High-Speed Waveform Digitizers,” IEEE Trans. onInstrumentation and Measurement, Vol. 37, No. 2, pp. 245-251, June, 1988.

[72] H. Johansson and P. Löwenborg, “Reconstruction of Nonuniformly SampledBandlimited Signals Using Digital Filter Banks,” Proc. of IEEE Intern.Symp. on Circuits and Systems, ISCAS'01, Sydney, 2001.

[73] H. Jin and E. Lee, “A Digital Technique for Reducing Clock Jitter Effects inTime-Interleaved A/D Converter,” Proc. of IEEE Intern. Symp. on Circuitsand Systems, ISCAS'99, Vol. 2, pp. 330-33, 1999.

[74] H. Jin and E. Lee, “A Digital-Background Calibration Technique forMinimizing Timing-Error Effects in Time-Interleaved ADC’s,” IEEE Trans.on Circuit and Systems - II: Analog and Digital Signal Processing, Vol. 47,No. 7, pp. 603-13, July 2000.

[75] Y.-C. Jenq, “Perfect Reconstruction of Digital Spectrum from NonuniformlySampled Signals,” IEEE Trans. on Instrumentation and Measurement, Vol.46, No. 7, pp. 649-52, Dec. 1997.

[76] Y.-C. Jenq, “Digital Spectra of Nonuniformly Sampled Signals: A RobustSampling Time Offset Estimation Algorithm for Ultra High-SpeedWaveform Digitizeers Using Interleaving,” IEEE Trans. on Instrumentationand Measurement, Vol. 39, No. 1, pp. 71-75, Feb. 1990.

[77] Y.-C. Jenq, “Digital Spectra of Nonuniformly Sampled Signals: Theoriesand Applications - Measuring Clock/Aperture Jitter of an A/D System,”IEEE Trans. on Instrumentation and Measurement, Vol. 39, No. 6, pp. 969-71, Dec. 1990.

[78] J. Elbornsson and J.-E. Eklund, “Blind Estimation of Timing Errors inInterleaved AD Converters,” IEEE Intern. Conf. on Acoustics, Speech, andSignal Processing, May 2001.

[79] R. J. van de Plassche, “Dynamic Element Matching for high-accuracymonolithic D/A converters,” IEEE J. Solid-State Circuits, Vol. SC-11, pp.795-800, Dec. 1976.

[80] P. Carbone and I. Galton, “Conversion error in D/A converters employingdynamic element matching,” Proc. of ISCAS‘94, Vol. 2, pp. 13-16, 1994.

[81] L.R. Carley, “A noise-shaping coder topology for 15+ bit converters,” IEEEJ. of Solid-State Circuits, Vol. 24, no. 2 , pp. 267-273, April 1989.

[82] H.T. Jensen and I. Galton, “An analysis of the partial randomization dynamicelement matching technique,” IEEE Trans. of Circuits and Systems II, Vol.45. No. 12, pp. 1538-1549, Dec. 1998.

73

[83] I. Galton, “Spectral Shaping if Circuit Errors in Digital-to-AnalogConverters,” IEEE Transaction of Circuits and Systems II, Vol. 44. No. 10,pp. 808-817, Oct. 1997.

[84] L. Hernández, “A Model of Mismatch-Shaping D/A Conversion forLinearized DAC Architectures,” IEEE Trans. of Circuits and Systems I, Vol.45, No. 10, pp. 1068-76, Oct. 1998.

[85] N. U. Andersson and J.J.Wikner, “Comparison of Different DynamicElement Matching Techniques for Wideband CMOS DACs,” Proc. ofNORCHIP, Oslo, Norway, Nov. 1999.

[86] M. Helfenstein and G. S. Moschytz, Circuits and Sysems for WirelessCommunications, Kluwer Academic Publishers, 2000.

75

Part 2: Publications

Paper 1 - New Approaches to High Speed Huffman Decoding

77

PAPER 1

New Approaches to High Speed Huffman Decoding

Mikael Karlsson Rudberg and Lars Wanhammar

Proceedings of IEEE International Symposium on Circuits and Systems, ISCAS’96, Atlanta, USA, May 1996.


79

New Approaches to High Speed Huffman Decoding


Department of Electrical Engineering, Linköping University, SE-581 83 Linköping, Sweden

[email protected] [email protected]

ABSTRACT

This paper presents two novel structures for fast Huffman decoding. Thesolutions are suited for decoding of symbols at rates up to several hundredMbit/s. The structures are built using the principle of pipelining, whichwhen applied to the length decoder unit makes it possible to remove the onlyrecursive loop in the basic structure. In this way a structure with a high the-oretical speed is obtained. Another attractive property of the solutions is thesimplicity of the structures and control logic.

1. INTRODUCTION

The Huffman coding technique is a lossless coding method that assigns shortcodewords to frequently used symbols and longer words to less frequently usedsymbols. If the codebook is good enough this will lead to a near entropy optimalresult. Huffman coding are a part of several important image coding standards,for instance the JPEG [1] and MPEG [2] standards.

80

Since the coded data has different sized codeword it is difficult to perform thedecoding in parallel. This is maybe not a problem when dealing with still imagesbut moving images put entirely different requirements at the decoding process.The MPEG-2 standard requires the data to be decoded at 100 Mbit/s and above.

In this paper we introduce a new principle for fast Huffman decoding. The pre-sented algorithm is a hybrid between a constant input, variable output decoder,and a variable input, constant output decoder. In section 2 an overview of previ-ous work is given. In section 3 we discuss modifications of the algorithm inorder to speed up the decoding. Finally two new structures are presented withslightly different properties.

2. PREVIOUS WORK

There are two main approaches for hardwired Huffman decoders with fixedcodebooks. If one or several bits at a time are decoded at a constant rate it willresult in a sequential solution which tray erses the Huffman tree until a leaf isreached ana then outputs the symbol (Fig. 1).

This type of decoder has a constant input rate and a variable output rate. If largecodebooks are used, the constant input rate solution tend to give very large statemachines which limit the speed. Some ways to get around this problem are givenin [3] and [4], but most solutions lead to complicated control logic.

The other approach is to decode one codeword in each cycle, hence it will deliverone symbol every cycle (Fig. 2). However, since the codewords have differentlengths the input rate will be variable. This solution consists of two main blocks.The first block finds the length of the next codeword. This is necessary since thedifferent codewords must be kept apart to be able to feed the symbol decoderwith correct data. The symbol decoder finds the corresponding symbol accord-ing to the codeword. This pattern matching can be done in several ways. Usu-

Figure 1. A constant input rate Huffman decoder.

symbolinputbuffer

outputbuffer

reg-ister

logic

next state

inputKbit/cycle

symbol

indicator


81

ally some kind of PLA structure is used to perform both the length decoding andsymbol decoding. In some solutions [5] sophisticated memory partition methodsare used to get access to the symbol and its length in an effective way.

3. TWO NEW FAST HUFFMAN DECODER STRUCTURES

In our solution we modify the length decoder and shifting buffer in the constantoutput rate decoder shown in Fig. 2. We then get a decoder with a structure simi-lar to the constant output decoder but with a variable output rate and a constantinput rate.

3.1. The basic Huffman decoder

The algorithm for a constant output rate Huffman decoder is described below.

1. Feed the symbol decoder with a coded vector from the input register. The length of this vector must be equal to the length of the longest possible code-word to assure that the vector contains at least one codeword. At the same time feed the length decoder with the same vector as the symbol decoder.

2. The length of the decoded word that is found by the length decoder is used for finding out how many new bit that must be shifted into the input register.

3. Repeat from 1.

The structure of the basic Huffman decoder is shown in Fig. 2. The critical path isfrom the input shifting buffer through the length decoder.

Figure 2. A constant output rate Huffman decoder.

input

length

Mbit/cycleshifting

buffer

criticalpath

lengthdecoder

outputbuffer

symbolsymboldecoder

82

The decoder can not run at a higher speed than it takes for the length decoder tofind the length of the codeword. The symbol decoder can be designed in severalways and can always be pipelined to reach sufficient speed. Hence, we will focuson the length decoder and the input register.

3.2. Huffman length decoder with relaxed evaluation time

The basic algorithm can easily be modified to not perform the length decodingand symbol decoding at the same time. The length decoder can find the length ofcodeword i at the same time that the symbol decoder decodes codeword i-1.Since the codewords have different lengths it is also reasonable to assume that itis usually more time consuming to evaluate the length of long codewords thanshorter ones. These two observations can be utilized to design a more effectivelength decoder.

The basic circuit is modified by changing the input shifting buffer to a shift regis-ter and add a register with a load signal between the length decoding logic andthe shift register (Fig. 3).

Then let the length decoder indicate the length for the codeword by one-hot cod-ing (i.e. one dedicated signal for every possible length). The algorithm will nowlook like this:

1. Shift data into the shift register until it is full.2. Copy all the data, but the bit most to the right, from the shift register to the

length decoder register (with the load signal).3. The length decoder shall now, in one cycle, determine if the symbol is of

length one and feed this to the control unit. Symbols of length two must be found in no more than two cycles, and so on with lengths of three and four up to the maximum code length M. When the length signal indicates that the length is found the shift register has passed the next coded symbol to the sym-bol decoder and shifted in the next codeword. Thus, it is possible to continue from 2.

We have here utilized the fact that we can allow the longer codewords length tobe decoded at a slower rate than the shorter ones. Notice that the constant outputrate decoder now has got a constant input rate. Instead the symbol decoder willnow not get a new codeword every cycle and hence it will have a variable outputrate.


83

The critical path will be from the length decoder register through the lengthdecoder to the load signal. But the only thing that must be found in one cycle isif the length is equal to one. It is often possible to further reduce the critical delayby placing some of the length decoder logic between the shift register and theregister.

If a comparison between this modified decoder and the basic decoder is donethere are a few important differences to note. This new structure decodes shortcodewords very fast but will be slower for longer codewords. Since the basicdecoder that we started from decodes symbols at a constant output rate it willprobably be more effective for long codewords. Fortunately, the nature of Huff-man coding makes it more likely that short codewords will dominate.

3.3. Pipelined Huffman length decoder

In this version of the Huffman decoder all recursive loops are removed and then,in principle, the maximum clock rate is enhanced to be limited by the delay of asingle logic gate. Hence, decoding rates of several hundreds MHz is feasible. Toobtain this pipelined structure the structure in Fig. 3 is modified as describedbelow.

First we remove the loadable register. As a consequence, it must for a momentbe assumed that all outputs from the length decoder are evaluated in one cycle.Since only one bit of the length vector are considered every cycle D flip-flopsmust be added before the multiplexer to equalize the delay (Fig. 4). For the'length = 2?' signal one D flip-flop is needed, for the signal 'length = 3?' two Dflip-flops are needed, and so on.

Figure 3. Huffman decoder with relaxed evaluation time for the length decoding unit.

outputbuffer

symbolsymboldecoder

input shift register

registerM bits

M-1 bits

length=M?

length=2?

length=1?

load

length decoderlogic

select

length

length decoder

84

Further we can add D flip-flops after the length decoder as long as it is donebefore the symbol decoder as well (Fig. 4). All the flip-flops can be propagatedinto the multiplexer and the length decoder logic. By this the delay through thedecoder logic is reduced to Tcritical/N where Tcritical is the critical, not maximum,delay through the length decoder logic and the multiplexer and N is the numberof added flip-flops.

The resulting structure is shown in Fig. 5 below. This structure tries to evaluatethe length of the codeword at the input vector every cycle instead of only when itactually are a codeword present at the input, as in the first solution. Since thereare no limitations on how much the structure is pipelined, the length decoder willno longer be the time critical part of the design and the speed can be increasedsignificantly. The theorethical speed limit is now set by the delay from a flip-flop through one logic gate to the following flip-flop.

3.4. Symbol decoder

The actual implementation of the symbol decoder is not discussed in this paper.However, some notes of the data input interface will be done. In our decoderstructures the symbol decoder is always feeded with serial data, but we want thesymbol decoder to be bit parallel to avoid time critical recursive loops. The serialto parallel conversion can easily be done by having a register at the input of thesymbol decoder with a load signal for every individual bit. The first bit can thenbe stored at position one, the next at position two and so on. When the length isfound by the length decoder all necessary bits are stored in the input register ofthe symbol decoder and the symbol decoding can start.

Since it is more complicated to calculate the symbol than finding its length onemight want this part to run at a lower speed. This can in our structures easily berealized with a FIFO buffer inserted between the length and symbol decoders,

4. CONCLUSIONS

We have in presented two new structures for Huffman decoders. Both structuresare based on a simple constant output rate decoder with a length decoder and asymbol decoder. Since the speed limiting unit in this structure is the lengthdecoder we have suggested how it can be modified to reach higher speed.

Our first structure contains a length decoder with relaxed evaluation time thatmakes it possible to significantely reduce the critical path delay and in this waydesign faster Huffman decoders. We have simulated a standard cell implementa-tion of the MPEG-2 huffman tables in 120 MHz using a 0.8 µm CMOS process.


85

In the pipelined structure we have shown how the time limiting recursive loop inthe length decoder can be completely eliminated. This structure should be suit-able for Huffman decoders with very high decoding rates, for example in futurewideband transmission systems, and HDTV.

5. REFERENCES

[1] ISO/IEC 10918-1 Digital compression and coding of continuous-tone stillimages (JPEG), Feb. 1994.

[2] ISO/IEC DIS 13818-2 Generic coding of moving pictures and associatedaudio information, part 2: Video, (MPEG-2), June 1994.

[3] S. F. Chang and D. G. Messerschmitt, Designing High-Troughput VLCDecoder Part I - Concurrent VLSI Architectures, IEEE Transactions onCircuits and Systems for Video Technology, Vol. 2, No. 2, pp. 187-196, June1992.

[4] H. D. Lin and D. G. Messerschmitt, Designing High-Throughput VLCDecoder Part II - Parallel Decoding Methods, IEEE Transactions on Circuitsand Systems for Video Technology, Vol. 2, No. 2, pp. 197-206, June 1992.

[5] S. B. Choi and M. H. Lee, High Speed Pattern Matching for a Fast HuffmanDecoder, IEEE Transactions on Consumer Electronics, Vol. 41, No 1, pp.97-103, Feb. 1995.

Figure 4. Huffman decoder with delay elements in the length decoder unit.

symboloutputbuffer

symboldecoder

input shift registerM bits

select

length

length decoder

length=M?length=2?length=1?

length decoderlogic

DDD

length=3?

D DD

DDD

(M-1)D

N

N

86

Figure 5. Huffman decoder with pipelined length decoder unit.

outputbuffer

symbolsymboldecoder

input shift registerM bits

length=1?

length decoder

N D

length=M?length=2?

select

length

pipelinedlength decoderlogic

Paper 2 - Implementation of a Fast MPEG-2 Compliant Huffman Decoder

87

Paper 2

Implementation of a Fast MPEG-2 Compliant Huffman Decoder


Proceedings of European Signal Processing Conference, EUSIPCO’96, Trieste, Italy, Sept. 1996.


89

IMPLEMENTATION OF A FAST MPEG-2 COMPLIANT HUFFMAN DECODER

Mikael Karlsson Rudberg ([email protected])

and Lars Wanhammar ([email protected])

Department of Electrical Engineering, Linköping University, S-581 83 Linköping, Sweden

Tel: +46 13 284059; fax: +46 13 139282

ABSTRACT

In this paper a 100 Mbit/s Huffman decoder implementation is presented. Anovel approach where a parallel decoding of data mixed with a serial inputhas been used. The critical path has been reduced and a significant increasein throughput is achieved. The decoder is aimed at the MPEG-2 Videodecoding standard and has therefore been designed to meet the requiredperformance.

1. INTRODUCTION

Huffman coding is a lossless compression technique often used in combinationwith other lossy compression methods, in for instance digital video and audioapplications. The Huffman coding method uses codes with different lengths,where symbols with high probability are assigned shorter codes than symbolswith lower probability. The problem is that since the coded symbols have

90

unequal lengths it is impossible to know the boundaries of the symbols withoutfirst decoding them. Therefore it is difficult to parallelize the decoding process.When dealing with compressed video data this will become a problem since highdata rates are necessary.

The architecture of the Huffman decoder presented in this paper is based on anovel hardware structure [1] that allows high speed decoding.

The decoder can handle all Huffman tables required for decoding MPEG-2 Videoat the Main Stream, Main Level resolutions [2]. The design is completelyMPEG-2 adapted with automatic handling of the MPEG-2 specific escape andend of block codes. In total our decoder supports 11 code tables with more than600 different code words. Since the code books are static in the MPEG-2 stan-dard the Huffman decoder has been optimized for these specific MPEG-2 codes.A decoding rate of 100 Mbit/s is required and also achieved in our implementa-tion.

2. HUFFMAN DECODER

Huffman decoding can be performed in a numerous ways. One common princi-ple is to decode the incoming bit stream in parallel [3, 4]. The simplified decod-ing process is described below:

1. Feed a symbol decoder and a length decoder with M bits, where M is the length of the longest code word.

2. The symbol decoder maps the input vector to the corresponding symbol. A length decoder will at the same time find the length of the input vector.

3. The information from the length decoder is used in the input buffer to fill up the buffer again (with between one and M bits, Fig. 1).

The problem with this solution is the long critical path through the lengthdecoder to the buffer that shifts in new data (Fig. 1).

In our decoder the shifting buffer is realized with a shift register that continu-ously shifts new data into the decoder (Fig. 2). The length decoder and symboldecoder are supplied from registers that are loaded every time a new code word ispresent at the input. The decoding process is described below:

1. Load the input registers of the length and symbol decoder.2. If the coded data has a length of one go back to point 1.3. If the coded data has a length of two go back to point 1.

and so on with codes of length three and four up to M.


91

This structure allows longer evaluation times for longer code words. The delay inthe critical path is reduced to the time it takes for evaluating the length of codewords with a length of one or two bits. Codes with other lengths are allowed to beevaluated in several cycles, i.e. code words with lengths of three must be evalu-ated in two cycles and so on.

Comparing this algorithm with the previous one we note the following:

• The input rate of our new structure is constant while the original has a vari-able input rate.

• The new structure evaluates short code words in a few cycles but requires more cycles for longer words. The original structure has a constant evaluation time for all code words.

Figure 1. A constant output rate Huffman decoder.

Figure 2. Huffman decoder with relaxed length evaluation time.

input

length

Mbit/cycleshifting

buffer

criticalpath

lengthdecoder

outputbuffer

symbolsymboldecoder

input

M bits

M-1 bits

length=M?

length=2?

length=1?

load

length decoderlogic

select

length

length decoder

symbol

register

load

shift register

symbol decoder

symboldecoderlogic

outputbuffer

registers

>1

force load

D

92

• The new structure allow higher clock rate since the critical path is reduced. But this also means that the symbol decoder must be faster since it in the worst case will receive new data every clock cycle.

• The new structure has a variable output rate while the original one has a con-stant output rate.

The new structure require higher clock rate to perform the same amount of work.But, if the average code length is short enough the new structure will have ahigher speed due to the significantly higher clock rates that can be achieved. Nor-mally the shorter code words will dominate in Huffman coded data and thereforethe new decoder is faster during normal circumstances.

2.1. Handling of special markers

Special markers are placed in the data stream to indicate for example end of block(eob) at the end of a coded block of data. After this marker other types of datalike uncompressed stream information will follow (Fig. 3). In the Huffmandecoder it is essential to detect the presence of this marker to be able to stop thedecoding process and let other units process the data that will follow. Thisdecoder detects the eob marker in the length decoder and halts the decoding pro-cess until a new start signal is applied.

The mb_escape marker is also important. After this symbol the following data isof fix length. Also this marker is detected in the length decoder and results in thatthe following data is passed through the symbol decoder unchanged (Fig. 3).

3. IMPLEMENTATION

The MPEG-2 standard requires that the input data must be decoded at a rate ofabout 100 Mbit/s. During the implementation special care had to be taken duringthe partitioning of the symbol decoder and a few critical paths had to be opti-mized manually. A few modifications of the new decoding algorithm had to bemade to make it possible to achieve the targeted performance.

Figure 3. Markers in the MPEG-2 stream requiring special decoding.

Huffman codes mb_escape Huffman codes

Huffman codes eob Huffman codesheader data

fix length data

t


93

3.1. Improvements of the length decoder

The length decoder turned out to be to slow when evaluating codes with lengthsof one or two bits ('length = 1?' and 'length = 2?' in Fig. 2). These paths had to bebroken up. How this was done is shown in Fig. 4 below. The evaluation of the'length = 1?' signal is done by taking data one step earlier (i.e. from position i—1instead of i) from the shift register and add a flip flop after the evaluation. For the'length = 2?' signal the register was moved to after the evaluation logic.

Note that this way of breaking up the critical loops can be generalized to remov-ing all critical loops in this structure, see [1].

3.2. Symbol decoder

The symbol decoding task is more complicated than the length decoding. Thesymbol decoder could not be designed to receive data in 100 Mbit/s. Code wordswith a length of one bit are rare in MPEG coded data. The most frequent usedcode tables do only contain codes with more than two bits. Therefore the symboldecoder is fed with data no more often than every second clock cycle. The inputshift register is halted one clock period every time a one bit code is found, andhence, the symbol decoder only need to process 50 Mbit/s without a significanceloss in performance. However, this modification causes the input rate to vary.

The symbol decoder were split into five separate units that takes care of theirown part of the code tables (Fig. 5). Every unit consists of an input register thatholds the data and a combinatorial block that maps the input vector to the sym-bol. One of the five units' output are chosen and passed to the output.

Data is in two's complement after the mb_escape marker while other data isdecoded to signed magnitude format. A post processing stage converts the two'scomplement data to signed magnitude representation.

Figure 4. Optimization of critical paths in the length decoder.

length=1?

D DD

length=2?

D D

logic logic

loadload

Before optimization

D

length=1?length=2?

D

D DD

logic logic

load

After optimization

ii-1i-2ii-1i-2

94

3.3 Interface

The interface of the Huffman decoder consists of an eight bit, parallel input portfor coded data. A signal indicates when a new input vector can be applied. Thedecoded data is delivered with a maximum of 50 Msymbol/s. The 'symbolpresent' signal (Fig. 5) indicates when data is valid at the output.

The shift register at the input of the decoder (Fig. 2) can be read and controlledexternally. This is necessary since the Huffman coded data is interleaved withother information.

3.3. Synthesis

The decoder has been described in VHDL and then transformed to a circuit usingsynthesis tools mapping to an 0.8 µm CMOS standard cell library. Some postprocessing had to be done after the synthesis step to achieve the necessary perfor-mance. The main problem was to get the symbol decoder to work fast enough.Therefore the symbol decoding has been split into five separate units. The corearea is about 8.4 mm2 the total area is 14.5 mm2 (3.9 X 3.75 mm2). About twothird of the area is occupied by the symbol decoder (Fig. 6). The power supply is5V and the transistor count is 26900.

3.4. Symbol tables

To get all symbol tables correct the VHDL code for the symbol decoding as wellas for the length de-coding has been generated from a thoroughly verified tem-plate file. However, this method also yielded a sub-optimal symbol decoding that

Figure 5. Realization of symbol decoder.

symbol

decoderunit 1

decoderunit 2

decoderunit 3

decoderunit 4

decoderunit 5

post processing

register

D

D

load

D

D

D

D

symbol present

current table

register register register registerregister

register register register register register

>1


95

can be further optimized, but with an increasing probability to introduce designerrors in the code tables. This was not done in this implementation because oflack of time and that it was considered more important to get a functional correctimplementation.

4. CONCLUSIONS

In this paper an implementation of a novel Huffman decoder architecture hasbeen presented. We have shown that the new structure can be used for fast Huff-man decoding while still keeping a simple architecture. The throughput has beenincreased by using a serial input combined with a serial/parallel length evalua-tion. Since the current implementation uses standard cells it is reasonable tobelieve that a full custom version of the same circuit can reach significantlyhigher speed.

5. REFERENCES

[1] M. K. Rudberg and L. Wanhammar, New Approaches to High SpeedHuffman Decoding, IEEE Proc. ISCAS ´96, May 1996.

[2] ISO/IEC DIS 13818-2 Generic coding of moving pictures and associatedaudio information, part 2: Video, (MPEG-2), June 1994.

[3] S. F. Chang and D. G. Messerschmitt, Designing High-Throughput VLCDecoder Part I – Concurrent VLSI Architectures, IEEE Trans. on Circuitsand Systems for Video Technology, Vol. 2, No. 2, pp. 187-196, June 1992.

[4] H. D. Lin and D. G. Messerschmitt, Designing High-Throughput VLCDecoder Part II – Parallel Decoding Methods, IEEE Trans. on Circuits andSystems for Video Technology, Vol. 2, No. 2, pp. 197-206, June 1992.

96

Figure 6. Layout of the Huffman decoder.

SymbolDecoder Length

Decoder

ControlUnit

ShiftRegister

Clock Buffer

Paper 3 - High Speed Pipelined Parallel Huffman Decoding

97

Paper 3

High Speed Pipelined Parallel Huffman Decoding


Proceedings of IEEE International Symposium on Circuits and Systems, ISCAS’97, Hong Kong, June 1997.


99

High Speed Pipelined Parallel Huffman Decoding


Department of Electrical Engineering, Linköping University, S-581 83 Linköping, Sweden

email: [email protected], [email protected]

ABSTRACT

This paper introduces a new class of Huffman decoders which is a develop-met of the parallel Huffman decoder model. With pipelining and partition-ing, a regular architecture with an arbitrary degree of pipelining is devel-oped. The proposed architecture dramatically reduces the symbol decoderrequirements compared to previous results, and still is the actual implemen-tation of the symbol decoder not treated. The proposed architectures alsohave a potential of realizing high speed, low power Huffman decoders.

1. INTRODUCTION

The Huffman coding method is a method for lossless data compression. Themethod is used in a variety of fields as for instance in the JPEG image codingstandard and the MPEG Video coding standards. With the introduction of HighDefinition digital television (HDTV) the throughput requirements of the Huff-man decoder will be increased several orders of magnitude. Unfortunately the

100

Huffman decoding process is difficult to parallelize since the symbols are ofunequal length. It is not possible to know where the symbol boundaries arebefore actually decoding them in sequence.

The Huffman code uses variable-length code words to compress its input data.Frequently used symbols are represented with a short code while less often usedsymbols have longer representation. The Huffman codebook forms an unbal-anced binary tree with the symbols at the leaves. The Huffman decoding processstarts at the root node in the binary tree and stops at a leaf.

In this paper we will extend previous work reported in [1] and [2] where architec-tures for high speed Huffman decoders are described. In this paper we generalizethe concept of pipelined Huffman decoders and discuss the theoretical potentialof this class of decoders. An improvement for dramatically decreasing the sym-bol decoding speed requirements is also presented.

2. HUFFMAN DECODER MODELS

There are two main classes of Huffman decoders, the parallel decoder and thesequential decoder [3, 4], Fig. 1. The sequential decoder has a constant input datarate with just a few bits width (normally one or two bits). The code-tree is repre-sented as a state machine that traverses the code-tree until a symbol is found. Acommon realization that belongs to this class of decoders is based on lookuptables stored in a memory [5]. This type of solution can be made very memoryefficient, but since a state machine representation always has several feed backloops and just a few bits per cycle input data rate the potential for high speeddecoding is limited for this type of decoder.

The parallel decoder consists of three different units, a symbol decoder that mapsa bit-vector with a coded symbol to a fix length representation, a length decoderthat calculates the length of the current code so that the shifting buffer knows

Figure 1. Fundamental Huffman decoder models.

coded data1 to Wcode,max bits/cycle

shiftingbuffer

symboldecoder

lengthdecoder

lengthsymbol

Parallel decodermodel

critical pathstatemachine

nextstate

symbol

coded data1 bit/cycle

Sequential decodermodel

critcal path


101

how many bits that has been consumed and is able to fill the buffer again. Theparallel decoder has a varying input rate of 1 to Wcode,max bits/cycle dependingon the length of the latest decoded symbol. Wcode is the length of the present codeand Wcode,max is the length of the longest code in the codebook. The output rateis constant with a fixed delay for all symbols. The critical loop in the paralleldecoder is through the length decoder to the shifting buffer. Before a new symbolcan be decoded the length of the previous code has to be found and the consumedbits must be thrown away.

This paper will show that the parallel decoder has a potential of reaching a highdecoding rate. In Fig. 1 the two discussed models are shown. In the remaining ofthis paper we will focus on the parallel Huffman decoder model.

One drawback with the parallel Huffman decoder in Fig. 1 is that the symboldecoder and the length decoder operates in parallel on the same code. Thereforethe length of the code is not available when the symbol decoder starts the decod-ing, which makes the symbol decoding more difficult than necessary. This prob-lem can however be solved by inserting a buffer in front of the symbol decoder asshown in Fig. 2. Since the length and symbol decoders here operates on differentcodes the symbol decoder can take advantage of the fact that the length of thecode is known.

3. PIPELINED PARALLEL HUFFMAN DECODING

In [1] we have shown that it is possible to completely remove the critical loop inthe parallel decoder. This is done by replacing the shifting buffer with a shift reg-ister, and by using a pipelined length decoder. The resulting architecture willhave a structure similar to the parallel pipelined decoder but a behavior more likethe sequential decoder with a constant input data rate and a varying output rate(Fig. 3).

Figure 2. Pipelined parallel decoder model.

shiftingbuffer

symboldecoder

lengthdecoder

buff

er

coded data

length symbol

102

The decoder in Fig. 3 operates as follows: The shift register continuously shiftthe coded data from left to right. The codelength is evaluated in the pipelinedlength decoder unit and is represented with one separate signal for every length,i.e. Wcode,max signals. In every cycle one codelength is checked. In the first cycleit is checked if the code is a one bit code, in the second cycle it is checked if it isa two bit code and so on until a matching length is found. At this time the codehas been shifted out from the shift register and stored in a register feeding thesymbol decoder. The symbol decoder starts and the length decoder starts toexamine if the next code is a one bit code and so on. Note that the feed-back loopfrom the length decoder to the shifting buffer is not needed any longer, but isreplaced by a synchronous reset signal to a counter.

A major disadvantage with this structure is that the symbol decoder must bedesigned for a worst case sampling rate of fs,max = fclk to be able to handle suc-ceeding one bit codes. This yields a low utilization degree of the symbol decodersince the sampling rate is lower when longer codes are decoded (utilization n = 1/Wcode,ave, where Wcode,ave is the average codelength).

3.1. Reducing symbol decoder requirements

One way of increasing the utilization degree of the symbol decoder is to insert abuffer between the length decoder and the symbol decoder, and then use a slowersymbol decoder. However, one can never guarantee that a buffer overflow neveroccurs when long sequences of codes with short codelengths arrive.

Figure 3. Loop free pipelined parallel decoder.

k pipelinestages

Lcode=1 bit

Wcode,max

Lcode=Wcode,max

Length decoder

shift registerinput stream

pipeline register

pipeline register

equalizingdelay

regi

ster

symbol decoderfor Wcode,max bits

counter reset

counter

startreset

decodedsymbols

fs fclk


103

Another solution is to stop the length decoder and the shift register when fs,max isexceeded [2]. This can for instance be done by halting the length decoder and theshift register a number of cycles as soon as a code with a length of less than Mbits are found, where fs,max in the symbol decoder is fs,max = M/fclk. The penaltyfor this is that no symbol will be decoded in less than M cycles, i.e. the decoderwill be less effective on short codes, which also are the most frequent ones. How-ever, this can in some cases be accepted since if the average codelength is low theaverage throughput will be high anyway. Unfortunately, halting the shift registerwill result in that the constant input data rate property is lost.

In the next section we propose another method for reducing the requirements onthe symbol decoder without any loss in efficiency. This is accomplished by tak-ing advantage of the fact that the length of the codes are available and use this topartition the symbol decoder.

3.2. Symbol decoder partitioning

Only when the code stream contains a one bit code will the symbol decoder inFig. 3 be fed with a new code in two successive clock cycles (i.e. a code withlength k is followed by a code with length 1). If the one bit codes can be sortedout before the symbol decoder, the maximum necessary sampling rate can behalved. In this case this can be done by switching to another decoder during oneclock cycle as soon as a new code is fed into the decoder.

If there is no one bit code following, the switch is restored, making it possible forthe original symbol decoder to receive a new code with a length of 2 or more bits.The one bit symbol decoder can obviously de made very simple since there isonly one possible one bit code. Though, the one bit decoder must have a maxi-mum sampling frequency of fs,max = fclk. The original symbol decoder only needto consider codes with a length of two bits and more.

In a similar way it is possible to partition the symbol decoders so one decodertakes care of all codes in the range of 1 to N-1 bits and the other one codes in therange of N to Wcode,max bits. The first symbol decoder has a maximum samplingfrequency of fs,max = fclk and the second decoder has a sampling frequency ofmaximum fs,max = fclk/N. Each one of the two decoders can be optimized to han-dle only a sub-set of the complete codebook. If N is chosen reasonably small it ispossible to have one fast but simple symbol decoder and one more complicatedbut also slower symbol decoder. Different decoding methods can be chosen forthe decoders, a fast method for the small decoder and an area efficient solutionfor the large decoder. In Fig. 4 an architecture with a two symbol decoder solu-tion is shown.

104

The partitioning can be repeated, splitting the symbol decoder into K partitions.If K is chosen to be equal to Wcode,max there will be one dedicated symboldecoder for every codelength, and every symbol decoder operates with a sam-pling frequency of maximum fs,max = fclk/Wcode,j, where Wcode,j is the length ofthe code that symbol decoder j is optimized for. The resulting architecture can beseen as a sorter that sorts the codes according to their length followed by a sim-plified symbol decoding step. In Fig. 5 an architecture with the maximum parti-tioned symbol decoder is shown. The architecture consists of a length decoderwith a k stages pipeline, a buffer with a depth of n, a sorter for sorting the codesand a set of symbol decoders. The size of the buffer can be as low as zero. Thecontrol is carried out by counters and logic blocks that checks for the start condi-tions for the symbol decoders.

4. DISCUSSION

In this section the advantages and drawbacks of the proposed methods are dis-cussed. The biggest advantage with the loop free pipelined parallel decoder withpartitioned symbol decoding is the potential of doing really fast Huffman decod-ing at relatively low power consumption. Fast because the critical length decodercan be pipelined to reach almost arbitrary speed. Low power consumptionbecause of the partitioned symbol decoding. Symbol decoders that not are usedcan be put in an idle state which will save quite a lot of power if the partitioningis well balanced. Note that using many partitions do not lead to much increase inthe control structure which would consume power. The reduced maximum sam-

Figure 4. Partitioned symbol decoders with reduced requirements.

k pipelinestages

Lcode=1 bit

Wcode,max

Lcode=Wcode,max

Length decoder

shift registerinput stream

pipeline register

pipeline register

equalizingdelay

regi

ster

symbol decoderfor 1 to N-1 bits

counter reset

counter

reset

decodedsymbols

symbol decoder forN to Wcode,max bits

cnt N-1and reset = 1

�

�

start

cnt > N-1and reset = 1 start

cnt

cnt

fs fclk

fs fclk/N


105

pling rates in the symbol decoders also saves power since a lower clock fre-quency can be used, and also because more power efficient but slower symboldecoders can be used. Unfortunately, a heavily pipelined length decoder will con-sume some power, but the length decoding unit is significantly smaller than thesymbol decoder unit [2] and consumes therefore a minor part of the total power.

There are two types of codebooks that are commonly used. In the MPEG stan-dards the codebook is fixed and can therefore be hardwired into the decoderlogic. It is more difficult when the codebook is changed from time to time, as isthe case of the JPEG image coding standard. However, in this paper we have notdiscussed the actual realization of neither the length decoder nor the symboldecoders (even if the length decoder must conform to the pipelined model). Itshould be possible to successfully implement both fixed and dynamic codebooksusing the proposed architectures.

5. CONCLUSIONS

In this paper we have discussed different Huffman decoder models and theirspeed potential. The pipelined parallel decoder model is transformed to a fastloop free architecture by using a shift register as replacement for the normallyused shifting buffer. Further, we have developed an architecture that enables ahighly partitioned symbol decoder which can be used for combining high speeddecoding with a power efficient solution. The proposed architectures does notimply that there must be a fixed codebook or that the symbol decoders must berealized in a particular way. Different solutions can be chosen depending on thesampling rate and the size of the codebook.

6. REFERENCES

[1] M. K. Rudberg and L. Wanhammar, "New Approaches to High SpeedHuffman Decoding", IEEE Proc. ISCAS ´96, Atlanta, USA, May 1996.

[2] M. K. Rudberg and L. Wanhammar, "Implementation of a Fast MPEG-2Compliant Huffman Decoder", Proc. EUSIPCO ´96, Trieste, Italy,September 1996.

[3] S. F. Chang and D. G. Messerschmitt, "Designing High-Throughput VLCDecoder Part I - Concurrent VLSI Architectures", IEEE Transactions onCircuits and Systems for Video Technology, Vol. 2, No. 2, pp. 187-196, June1992.

106

[4] H. D. Lin and D. G. Messerschmitt, "Designing High-Throughput VLCDecoder Part II - Parallel Decoding Methods", IEEE Transactions onCircuits and Systems for Video Technology, Vol. 2, No. 2, pp. 197-206, June1992.

[5] S. Ho and P. Law, "Efficient Hardware Decoding Method for ModifiedHuffman code", Electronics Letters, Vol. 27, No 10, pp. 855-856, May 1991.

Figure 5. Fully partitioned loop free pipelined parallel Huffman decoder.

k pipelinestages

shift register (n+k bits)

Lcode=1 bit

Wcode,max

Lcode=Wcode,max

shift register (n bits)

reset

Buffer

regi

ster

fs fclk

fs fclk/2

symbol decoderfor Wcode,max-1 bits

symbol decoderfor Wcode = 1 bit

symbol decoderfor Wcode = 2 bit

fs fclk/(Wcode,max-1)

symbol decoderfor Wcode,max bitscnt = Wcode,max

and reset = 1 start

start

startcnt = 2and reset = 1

cnt = 1and reset = 1

start

�

��

��

�

�

�

decodedsymbols

Symbol decoderLength decoder

shift register

fs fclk/Wcode,max

input stream

pipeline register

pipeline register

equalizingdelay

cnt = Wcode,max-1and reset = 1

counterreset

countercnt

Sorter

Paper 4 - Design of a JPEG DSP using the Modular Digital Signal Processor Methodology

107

Paper 4

Design of a JPEG DSP using the Modular Digital Signal Processor Methodology

K.-G. Andersson, Mikael Karlsson Rudberg, and Anders Wass

Proceedings of International Conference on Signal Processing Applications & Technology, ICSPAT’97, San Diego, USA, Sept. 1997.


109

Design of a JPEG DSP using the Modular Digital Signal Processor Methodology

K-G Andersson1, Mikael Karlsson Rudberg1,2 ([email protected]), Anders Wass3 ([email protected])

1) Ericsson Components, Microelectronics Research Center, Stockholm, Sweden.

2) Department of Electrical Engineering,

Linköping University, Linköping, Sweden.

3) Ericsson Components, Microelectronics Division,

ASIC & ASSP Sector, Stockholm, Sweden.

Abstract

In this paper we present the design of a JPEG decoder using the ModularDSP Methodology (MDSP). It is shown that the MDSP methodology is apowerful tool for doing hardware-software co-design. The hardwareresources have been chosen to match the frequently used operations in theJPEG standard to increase performance. The JPEG decoder has been real-ized using a dual core solution where irregular and static algorithms havebeen separated.

110

1. INTRODUCTION

The Modular DSP (MDSP) Methodology is a method for modelling of Applica-tion Specific DSPs (ASDSP). The MDSP methodology aims at tackling some ofthe most important issues in bridging the gap from algorithms down to siliconand move the two levels closer [1,2,3].

This paper discuss how the MDSP Methodology was used during the design of aJPEG decoder.

Common for all wideband communication and storage systems is the need forcompression of speech, image, data, audio, and video. International organiza-tions, such as CCITT and ISO/IEC JPEG (Joint Photographic Experts Group)[4], have standardized compression algorithms and formats for images. Repro-grammability is important for the adaptation to different applications and mar-kets. A JPEG DSP should contain the arithmetic functions needed for the specificalgorithm and should be designed for the accurate wordlength in the differentparts of the architecture. The memory requirements (size, wordlength), and thepartitioning of the memory structure have to be considered as well. The JPEGdecoder has been modelled to fulfill the CCIR601 requirements.

The JPEG algorithm consists of four stages: Data is transformed to the frequencydomain using the Discrete Cosine Transform (DCT), Quantized to remove fre-quencies in the picture that are of minor interest, Run-Zero encoded to replacesequences of zeroes with a shorter representation and finally Huffman encodedwhich results in a variable length code. The decoding is in principle a reversal ofthe operations in the encoder. The frame to be encoded is split into 8x8 pixelslarge blocks that then are individually coded.

2. METHODOLOGY

Why do we see the need of a new methodology and what problems do we solvewith the MDSP methodology?

First of all we se a rapid growth of the need to do early design trade-offs and per-formance estimations. To do that we must be able to have a powerful modellingmethodology where different algorithmic and architectural solutions can bequickly evaluated. We also want an environment where the designers experienceis captured, i.e. the environment must provide a high degree of interactivityinstead of leaving important tasks as scheduling and resource allocation entirelyto the tools.


111

Future consumer electronics put requirements on the hardware that today can behard to fulfill: high speed, high complexity, low power and low cost. To be ableto meet these architecture goals it is obvious that the level of integration must beincreased, the hardware must be matched to the algorithms and the design pro-cess must be shortened. We believe that using the MDSP Methodology forApplication Specific DSPs matched to the algorithms are the way to go to handleincreased complexity and reduce power consumption.

2.1. Modelling with the MDSP methodology

An MDSP model provides a bit-true and cycle-true model using a hardwaredescription language called µC. The language is mainly a subset of the C lan-guage extended with some features. There are four types of storage elementsdefined: input ports , output ports, memories and registers. Parallelism hasbeen introduced by redefining the ‘,’-operator in C to mean parallel operations inµC. The statement delimiter ‘;’ in C is redefined to delimit clock cycles in µC.

The architecture and algorithm can be concurrently developed since the µC-model contains the algorithm, the hardware resources (memories and registers)and also implicitly ALU:s and the control unit. When additional resources areneeded they are just added in the model. The µC-model defines a virtual DSPthat supports the set of operations actually performed in the µC-code. The finalhardware implementation can then have a different architecture as long as it sup-ports the operations contained in the code, i.e. the µC-model defines a minimalarchitecture.

The tool environment consists of a compiler that generates a simulation modelfrom the µC-code and checks the code against the target architecture to assurethat it can execute the µC program. Furthermore there is a simulator with thecapability to cosimulate several µC cores.

3. HARDWARE PARTITIONING

There are two types of algorithms, data dependent and static. The data dependentalgorithms are characterized by having much data or parameter dependent pro-cessing branches. A typical data dependent algorithm is the parsing and controlof a JPEG-coded data-stream. There are several types of data blocks that requiredifferent kind of decoding. Most parameters are located at the beginning of thedatastream, which must be parsed and then used to select the appropriate decod-ing algorithm. A static algorithm, on the other hand, is the Inverse DiscreteCosine Transform (IDCT) which is a part of the JPEG standard.

112

These two types of algorithms require different types of architectures. A datadependent algorithm requires hardware that is well supplied with control regis-ters and control operators like comparing to registers and conditional branches,and possibly a stack to enable function calls. The performance of hardware exe-cuting data dependent algorithm can in some degree be measured by the ability toeffectively implement compare-and-branch operations. The program memory isoften large for the data dependent algorithm due to many alternative processingsteps.

The static algorithm is less control oriented. Normally, a static algorithm repeat-edly executes a relatively small number of lines. The performance is mainly lim-ited by the degree of parallelism in the hardware.

The JPEG decoder is partitioned into two cores, one optimized for the datadependent parameter parsing and Huffman decoding. This core also performmost of the control task. The other core is the IDCT processor that implementsthe static IDCT algorithm in a pipelined, hardware intensive, datapath. The parti-tioning is natural due to the discussion above, where static and data dependentalgorithms require different types of architectures.

3.1. Interface design

There are two ways of I/O-modes defined using the MDSP Methodology, paral-lel and serial. The I/O is made using a hand-shake protocol to simplify synchroni-zation.

The coded datastream is fed byte by byte into the Huffman processor. Thedecoded pixels are found at the output from the IDCT core. The JPEG datastreamcontains a quite extensive set of parameters that are of interest when displayingthe decoded images. In order to avoid doing a separate decoding of these param-

Figure 1. The MDSP Methodology design flow

Scheduling, Assignment

µC model

C++ Code Generation

Simulatormodel

VHDL Generation

VHDLmodel

H/W code

USER LIB LIB USER


113

eters in the display device, the parameters are decoded in the Huffman core andare then accessable through a DMA port when the parameter memory not is usedinternally. See Fig. 2.

The internal interface between the cores consists of a parallel data port, Outp_RZ,that outputs run-zero coded data. A synchronization signal, DC, that is activatedin the beginning of every block is provided in order to synchronize the two cores.A stop signal halts the Huffman processor when the IDCT core can not receivedata at the required rate.

4. HARDWARE/SOFTWARE TRADE-OFFS

It is important to use the right kind of hardware resources in an architecture. Theperformance can be significantly reduced if the architecture is register limited sovariables have to be stored in a memory and then read back into the datapathrepeatedly. It is also performance limiting to do multiplication using a shift-addapproach, if this is done often. A trade-off between when to add dedicatedresources to the hardware and when to solve a problem using the already avail-able hardware and software. It might for instance be more efficient to add anadder and a few registers instead of a multiplier-accumulator in the datapath ifthe multiply-accumulate operation is seldom used.

4.1. Huffman processor

The Huffman processor core consists of one datapath, two memories and oneaddress processor. The data and address paths are built from MDSP templates.The different parts of the processor can be seen in Fig. 4. There are two memo-

Figure 2. The JPEG MDSP dual core processor.

Reset

Ready_St

EI_St

INP_St

INP_RZ

EI_RZData_Ready RZ

Outp_RZblock_statDC

INT_A

Data_Ready

Outp_Pix

Ci_

dim

Ci_

addr

_dim

Rel

ease

_dim

Display parameters

stop

Outp_Adr_

MCLK66 MHZ

IDCTprocessor-core

HUFFMANprocessor-core

114

ries, one for the storage of the Huffman code book, Run/Size and Code length of29-bits words. The smaller memory is used for the storage of quantization tables,temporary data and various parameters used by the JPEG algorithm.

In the Huffman core we have chosen to use dedicated hardware to detect a spe-cial marker byte (FF) in the datastream. This is done since this marker byte canoccur anywhere in the datastream and instead of a time consuming software testthere is hardware that generates a trap that forces a jump to a software routinethat can handle the marker byte. To make efficient header decoding we use com-parators and special masking hardware. A Barrel shifter is also used. The Huff-man decoding is programmed in software using a table look-up technique.

4.2. IDCT processor

The IDCT processor core consists of two memories where an input process storesthe input in one memory while the other one is used for calculation. The algo-rithm used in the IDCT require 11 multiplications and 29 additions per 1 dimen-sional IDCT (1D IDCT)[7]. The 2D IDCT is calculated by doing 8+8=16 1DIDCT:s.

The IDCT core, which executes a static algorithm, has been allocated enoughresources to continuously start the processing of a new block every 160:th cycle.Three multiply-accumulate elements and four adders are used in the main datap-ath, and all are fully utilized. The register usage has been optimized for a calcula-tion in three stages with separate register files.

The Huffman core delivers run-zero coded data in a zig-zag order. The data isexpanded and written into a memory. The datapath consists of three multiply/accumulate blocks (macc) and five adders (see figure 3). The IDCT on an 8x8block is performed by doing a 1-dimensional IDCT on each column followed bya IDCT on all eight rows. An offset of 128 is added during the read out stage.

5. CONCLUSIONS AND FURTHER WORK

In this paper we have presented the MDSP modelling methodology for Applica-tion Specific DSPs. The methodology has been found to be an efficient way ofperforming hardware-software co-design since the hardware and software isdeveloped in the same model and consequently also simulated and verified in thesame design environment. A JPEG decoder with two processor cores, modelled


115

using the MDSP Methodology has been introduced showing the strength of theMDSP concept. Capabilities to achieve efficient architectures even with complexirregular algorithms as the JPEG standard is demonstrated.

The MDSP Methodology is today used in the regular design flow at EricssonComponents. The design environment is continuously evolving. Features areadded to the modelling language and the methodology when needed by newdesign projects.

6. REFERENCES

[1] K-G Andersson, Anders Wass, Karam Parmar: A Methodology forImplementation of Modular digital Signal Processors, ICSPAT ’96, Boston,MA, Oct. 7-10, 1996.

[2] K-G Andersson: A Design Environment for Modular-dsp Architectures,Electronic Design Autom. Conf., Kista, Stockholm, March 15, 1994.

[3] K-G Andersson, Implementation and Modeling of Modular Digital SignalProcessors, LiU-Tek-Lic-1997:09, Department of Electrical Engineering,Linköping University, March 1997.

Figure 3. IDCT processor datapath.

+

Memories

Read Write

macc

reg

reg

reg

reg

reg

regregregregregregregreg

reg

reg

reg

reg

reg

reg

out_

reg128

+

macc

macc

flow

+

+

+

flow

Ou

tp_P

ix

stage 1

stage 2stage 3

INP

RZ

116

[4] ISO/IEC 10918-1: Digital compression and coding of continuous-tone stillimages, 1994-02-15.

[5] C. Liem, T May, P Pauline: Instruction-Set Matching and Selection for DSPand ASIP Code Generation, Proceedings of the European Design and TestConferance, February 1994.

[6] Gert Goossens, Jan Rabaey, Joos Vandewalle, Hugo De Man : An EfficientMicrocode Compiler for Applications Specific DSP Processors. IEEETransactions on Computer-Aided Design. vol. 9 NO. 9, September 1990.

[7] Z. Wang, "Fast Algorithms for the Discrete Cosine Transform and for theDiscrete Fourier Transform", IEEE Transactions on ASSP, Vol ASSP-32,No.4, pp. 803-816, Aug. 1984.

Figure 4. Huffman processor datapath.

ALU

Mereg

Jreg

Ireg

Pred

Acc11 -10

0

0 1

0 1

SLL1

LSB

16

Areg

16

8

0

DC

0 1

DC UDM

1

161

01

0

1

Inp_

St

8

EN_St

FFCU

, Tra

p

8

BSH

Sh

_co

nM

ask

_le

n

Ampl

Code_len

CompareJp

eg_

reg

Stream_

CU, Trap

0

23

Mask

23

AND16

0 Inp_St

Code_reg Code_reg

Run

Size

29

29

8

Para

m_re

g

4

4

16

16

5

816

CU Constants

8

8

Acc

Out

p_R

Z

ma

cc

Ci_dim

AP

Jpeg_mem320 x 8

4 Q-tab + 64

Huff_tab858 x 29

4 DC + 4 AC

Inp_St

4

8

8

15

statusflags

Paper 5 - Design and Implementation of an FFT Processor for VDSL

117

Paper 5

Design and Implementation of an FFT Processor for VDSL

Mikael Karlsson Rudberg, Martin Sandberg, and Kent Ekholm

Proceedings of IEEE Asia-Pacific Conference on Circuits and System, APCCAS’98, Chiangmai, Thailand, Nov. 1998.


119

Design and Implementation of an FFT Processor for VDSL

Mikael Karlsson Rudberg ([email protected]), Martin Sandberg ([email protected]),

Kent Ekholm ([email protected])

Ericsson Components AB, 164 81 Kista, SWEDEN.

Phone +46 8 757 5295, Fax +46 8 757 5032 e-mail [email protected]

Abstract

In this paper we present an implementation of an FFT processor for VDSLapplications. Since no standard yet are available for VDSL high require-ments on flexibility were put at the design. A concurrent hardware and soft-ware design methodology made it possible to trade between hardware andsoftware realizations in order to get an effective and flexible architecture.

1. INTRODUCTION

The Fast Fourier transform (FFT) is an effective way of calculating the discretefourier transform and is often used in multicarrier communication systems. TheFFT processor presented in this paper is aimed at VDSL (Very high speed DigitalSubscriber Line) applications, which is one of the candidates for providing wide-band communication capabilities to the consumers, Fig. 1. VDSL systems use thealready installed base of twisted pair copper cables for the last few hundred

120

meters to the homes. The VDSL system is a multicarrier system which is attrac-tive for wideband transmission because of its capability to adapt to differentchannel characteristics.

Today there is no standard for VDSL and there are several candidates that putdifferent requirements on the FFT processing. Currently the number of carriers isunspecified, and it is still uncertain if data shall be time multiplexed or frequencymultiplexed transmitted over the channel. This uncertainty made it important tohave a programmable FFT processor. The computational requirements excludeda solution with standard DSPs. Therefore the FFT has been realized as an appli-cation specific signal processor (ASSP) targeting FFT processing.

The worst case processing requirements that can be handled is two streams ofcontinuous 50 MHz real input data with simultaneously processing of both FFTand IFFT:s with lengths up to 1024 points. One of the output ports are equippedwith a multiplier that can be used as a frequency equalizer. Cyclic prefix of arbi-trary length can be added at the output of the IFFT and automatically discarded atthe input of the FFT.

2. ALGORITHM

The implemented algorithm is a well known decimation in frequency radix-4FFT algorithm [1]. The primitive operation is the radix-4 butterfly shown in Fig.2.

Since the input data to the FFT and the output data from the IFFT is real valuedand the FFT is a complex transform it is possible to calculate a 2048 points FFTby first doing a 1024 points complex FFT and then perform a separation pass toend up with the same result as would have been the case if a full 2048 points FFT

Figure 1. VDSL transmission system.


121

had been calculated [2]. This extra separation has a structure close but not identi-cal to a radix-2 butterfly. To be able to support other FFT lengths than 4n alsoradix-2 butterflies has to be supported.

3. DESIGN FLOW

The FFT project is the first project where Ericsson’s Modular DSP methodology(MDSP) has been fully used throughout the entire design. The design methodol-ogy is aimed at programmable ASSP:s and have previously been reported in [3].A case study resulting in a JPEG decoder architecture that never was made in [4]and we have also been studying other algorithms.

The methodology encourage the designer to do trade-offs between hardware andsoftware realizations by offering a unified design environment and modeling lan-guage for both hardware and the software.

Figure 2. Radix-4 decimation in frequency butterfly.

Figure 3. Design flow.

+

+

+

+

+

+

+

+

-

-

-

-

x(0)

x(1)

x(2)

x(3)

X(0)

X(2)

X(1)

X(3)

W2p

Wp

W3p-j

Spec.

µC modelHW

SWRTL FormalVer.

ASICImpl.

µ codeGen.

LIB

122

The architecture and application program is concurrently evolved from therequirement that are put on the application. The description language, µC, isderived from the C language with some modifications. An RTL description of thehardware is manually or automatically derived from the application program.The software can be refined after the hardware extraction, but to assure that itstill is possible to execute the application program on the architecture, there is aformal verification tool available. See Fig. 3 for the design flow. An example ofthe design language is given in Fig. 4 below.

The RTL description is taken to a traditional ASIC flow and the microcode thatshall be run on the processor is generated by a compiler.

Important to note is that translating the µC model to RTL and microcode is amapping process. Information about resource allocations and scheduling is foundin the model. Advantages with this approach is that the designer has full controlof both the architecture and the scheduling, and can therefore get maximal per-formance from the design.

The key benefits with the design methodology are:

• An effective design language which enables short design time.• Concurrent modeling of hardware and software.• Fast simulation compared with Verilog and VHDL.• Results in a programmable DSP architecture that can be re-programmed

after processing.

4. DESIGN SPACE EXPLORATION

The specification for the FFT processor have been changed several times duringthe project. The reasons for that are the lack of standard for VDSL and that theimplementation work was made concurrently with the design of the entiremodem. It was essential to have a methodology that made it possible to quicklytry different design alternatives. Essential for providing this is an effective mod-eling language and a fast simulator.

Some of the solutions evaluated were for instance where to handle the cyclic pre-fix. From being realized in pure hardware it was moved to the software in theDSP core. This reduced the complexity and increased the flexibility, though itrequires a reboot to change cyclic prefix. The memory mapping functionality thatresided in the DSP hardware and software turned out to be more efficient to mapdirectly to hardware in the memory system.


123

After an FFT calculation a rearrangement of the memory must take place sincedata will be stored in bit reversed address order (i.e. output data number 010112is stored at location 110102). This is normally made by either making a rear-rangement pass in the DSP or by performing I/O in bit reversed order. In this casethere is a continuous stream of both input and output data. Reading data in bitreversed order led to that the input data was stored in bit reversed order. To hidethis for the address generators in the DSP there exist two addressing modes, nor-mal and bit reversed. In bit reversed mode the actual address fed to the memory isthe bit reversed address.

5. ARCHITECTURE

The FFT processor is divided into five cores; two FFT datapaths, two IO blocksand one memory system, see Fig. 5. The IO blocks handle the input and output ofdata and are not programmable but parametrized to be able to handle differentcyclic prefix and FFT lengths. The memory system contains six sets of memorieswhere each memory set contain 1024 complex words. The datapath blocks per-

Figure 4. µC description of simple squaring DSP with on chip memory.

INPUT in(10);

OUTPUT out(10);

REG acc(10);

REG cnt;

RAM mem(16,10); 16 words, 10 bit wide

void main()

{

// init

acc=0;

// fill memory

for(cnt=0; cnt<=16;cnt++)mem[cnt]=in;

// calculate square

for(cnt=0; cnt<=16;cnt++)

out=mem[cnt]*mem[cnt];

}

124

form the actual FFT calculations and the control functionality. The two datapathsare identical and operates individually. All communication between the blocksare from register to register.

The internal clock is generated on chip and operates at two or four times theexternal clock. The maximum internal clock rate is 100 MHz. The applicationprogram is loaded at boot time through a bit serial port using a separate programloading clock.

The wordlength used is 18 bits for data and 16 bits for the coefficient (wp in Fig.2).

5.1. IO

The IO consists of an address generator and a complex multiplier which wasincluded as a part of an equalizer that was needed in the application.The IO coreis controlled from the datapath block.

5.2. Memory

The memory block has four read ports and four write ports. There are six mem-ory sets with two physical memories enabling concurrent read and write accessesusing single port memories. That we are using six sets of memories is forced bythe proposed time multiplexed transmission method that require buffering.

5.3. Datapath

The datapath block consists of a datapath for the calculations, three address gen-erators, a control block and a control unit, see also Fig. 6 where the calculationunit of the datapath block is outlined.

Figure 5. Block partitioning.

Memory System

I/O B

FFT datapath

A B

FFT datapath

I/O A

Port A Port B


125

The datapath is a complex datapath where all instructions operates on complexwords. The instruction types are of Very Long Instruction Word (VLIW) type,i.e. all control signals are stored in the program memory and no instructiondecoding is necessary. The benefit is that it is possible to fully utilize the parallel-ism in the hardware which allow a high degree of hardware utilization.

The drawback with a VLIW architecture is the inefficient memory usage sinceeven seldom used control signals have their own position in the instruction wordand valuable memory space is wasted. In this project we introduced a hybridsolution where some instruction decoding where made for seldom used instruc-tions, i.e. control type of instructions. The DSP also have different modes whichenabled us to multiplex some control signals. These tricks with the control sig-nals gave a 30-50% decrease in the instruction word width which finally endedup in 144 bits. The ASSP approach resulted into an architecture with a resourceutilization of 90% in the address generators and the actual datapath.

The critical loop in an FFT calculation is the calculation of a radix-4 butterfly. Inour architecture the can be done in four clock cycles. To achieve this a hardwiredloop controller is included as well as parameter registers that control some opera-tions in the datapath during execution of the critical loops (e.g. offset registers foraddress calculation).

6. IMPLEMENTATION

The implementation has been made in a 0.35 µm process. The design mainly ofstandard cells, memories, and a PLL. The complex multipliers had to be inter-nally pipelined in two stages to reach sufficient speed.

Necessary for the success of this project was the availability of a good timingdriven place and route tool. In a 0.35 µm process the wire capacitance contributetoo much to the delays in order to get a good correlation between the estimateddelays from the synthesis tool and the actual layout. A photo of the final chip isgiven in Fig. 7.

6.1. Key data

In table 1 below some key data of the FFT processor is summarized. The chip hasbeen successfully tested.

126

Figure 6. Datapath outline.

Power supply 3.3 V

Power consumption 3 Watt

Chip size: active area 46 mm2

Chip size: total area 56 mm2

Process UMC 0.35µm

On chip memory 360 Kbit

Number of gates (mem-ory excluded)

~ 150000

Maximum clock rate 100 MHz

Computation time for a 2048 point FFT with real input data (using one of the two datapaths)

80 µs

Table 1. FFT processor data.

reg. file

+/-

+/-

reg. file

reg. file

reg. file

*

from memory

to memory

ROM

addr. calc

addr. calc

write addr.

addr. calc

read addr.

Controller


127

7. CONCLUSIONS

In this paper the design and implementation of a high performance FFT proces-sor has been described. A new design methodology with concurrent hardwareand software development have been proven to work. It has been shown possibleto design and implement an ASSP starting without specification in a short timeperiod using the MDSP design flow.

8. REFERENCES

[1] Gentleman W. M. and Sande G. “Fast Fourier Transform for Fun and Profit”,Proc. 1966 Fall Joint Computer Conf. (AFIPS), Vol. 29, pp. 563-678,Washington DC, Spartan, Nov. 1966.

[2] Brigham, “The Fast Fourier Transform and its Applications”, Prentice Hall,1988.

[3] K-G Andersson, “Implementation and Modeling of Modular Digital SignalProcessors”, LiU-Tek-Lic-1997:09, Department of Electrical Engineering,Linköping University, March 1997.

[4] K-G Andersson, Mikael Karlsson Rudberg, and Anders Wass. “Design of aJPEG DSP using the Modular Digital Signal Processor Methodology“,ICSPAT ’97, San Diego, CA, USA, Sept. 14-17, 1997.

Figure 7. FFT chip.

Standard cell area

Program memory

Data memory

PLL

Paper 6 - Application Driven DSP Hardware Synthesis

129

Paper 6

Application Driven DSP Hardware Synthesis

Mikael Karlsson Rudberg and Mikael Hjelm

Proceedings of IEEE Nordic Signal Processing Symposium, NORSIG’00, Kolmården, Sweden, June 2000.


131

Application Driven DSP Hardware Synthesis

Mikael Karlsson Rudberg1,2 and Mikael Hjelm1

1) Ericsson Microelectronics AB, 164 81 Kista, SWEDEN

2) Department of Electrical Engineering, Linköping University, S-581 83 Linköping, SWEDEN

[email protected], [email protected]

ABSTRACT

In this paper we present a synthesis tool aimed for application specific DSPprocessors. The purpose with the presented work has been to develop a toolwhere it is easy for a designer to try different approaches in order to achievea well balanced architecture. In the paper we discuss the algorithms in thetool and show, by example, the intended way of operation.

1. INTRODUCTION

DSP processing in modern communication systems are today normally carriedout either in programmable DSP processors or dedicated ASICs with little or noprogrammability.

An ASIC solution offer high performance in terms of processing power andpower consumption. The ASIC is targeted against one or a few tasks and cantherefore be optimized to meet the desired computational requirements and mem-ory bandwidth etc.

132

The DSP processor is made to support a wider range of applications and musttherefore have an extensive instruction set and more on-board memory. In thispaper we focus on applications that require a high degree of flexibility, but for agiven application. Examples of applications include FFT processing, Viterbi andReed-Solomon decoding. Each one of these examples exist in various variants,working with different block sizes etc.

For these applications we want to be able to evaluate various instruction sets aswell as different degrees of parallelism in a hardware-software co-design pro-cess. Instead of a automatic synthesis tool we need an interactive environmentthat gives the designer the opportunity to describe different architectures in anefficient way and then fast get the resulting netlist.

In this paper we show a solution that makes it possible to synthesize a DSP pro-cessor from an executable cycle true, model of a processor and the application.This is done using a synthesis tool where most of the design choices are done bythe designer.

2. RELATED WORK

Synthesis of DSP processors have been studied by several groups in the world.The main difference between our approach and previously reported synthesissystems, as for instance in [1,2] is that we instead of having advanced algorithmsin the tool we leave most of the design choices to the designer. That is thedesigner is used as the intelligent component in the system and the synthesis tooljust performs the hard work.

3. SYNTHESIS FRAMEWORK

The synthesis tool has been designed to fit into the MDSP design flow which is adesign methodology that allows the designer to use a C-like description languagecalled µC for defining the DSP, [3,4]. The synthesis tool takes as input the cycletrue µC model and gives as output an architecture that is able to execute the algo-rithms described in the model. The program memory image is then created usingother tools in the framework. The generated architecture is later on passed to aVHDL compiler in order to generate a netlist suitable for the layout tool, Fig. 1.


133

4. THE DSP SYNTHESIS TOOL

The synthesis tool takes as input the simulation model written in µC that containsall information about the desired instruction set as well as the desired parallelism.What not are given in the description is how many and which types of ALUs thatare wanted. The description also lack explicit information about how theresources should be connected together.

4.1. Target architecture

The DSP synthesis tool has a target architecture consisting of a number of ALUs,register files, memories, I/Os, busses, a control unit and a program memory, Fig.2. One difference between a general purpose DSP architecture and a dedicatedone is how registers are used. A general purpose DSP have large register bankswith general purpose register files and ALUs that can be used for everythingfrom addressing to normal data processing. In a dedicated architecture it is possi-ble to have dedicated registers and ALUs for addressing and different types ofdata. Therefore we have chosen to target an architecture with dedicated resourcesfor different types of tasks in the DSP.

4.2. Synthesis library

The synthesis tool maps the µC description to structural VHDL containing prim-itives found in a synthesis library. The library consists of registers, memories,various types of I/O blocks and arithmetic logic units (ALU).

The I/Os are either plain registers, an asynchronous port that communicates usinghandshaking, or user defined I/O. The RAM have separate read and write buseswhich is common in many on-chip RAMs.

Figure 1. Synthesis design flow.

uC model

Synthesis

lib

programimagegen.

vhdl toASIC std.design flow

vhdl

netlistDSPprogram

134

There is also a control unit available in the library that support a set of instruc-tions such as for instance jump, conditional jump and sequential execution. Allcontrol signals needed in the datapath is taken directly from the control unit. Theinstruction decoding is supposed to be made inside the control unit and is notdone by the synthesis tool.

Any kind of functional block can easily be included in the synthesis library bydescribing the block in VHDL and adding a description where the supportedinstructions are listed.

4.3. Synthesis

The synthesis process is divided into a number of stages that analyze the resourceneed and then creates an architecture that is matched against the algorithms toimplement.

In the first stage the µC model is analyzed to find out which hardware that areexplicitly declared, i.e. all memory and registers. Secondly, the tools analyzewhich operations that are made in the program flow. The target and destinationregisters for each instruction is also stored.

In the third stage the operations are mapped to ALUs. This can basically be donein two ways realizing either a minimal architecture with as few ALUs as possi-ble, or a maximal architecture where little or no resource sharing is made. A min-imal architecture will require ALUs supporting many instructions, while amaximal architecture gives many, but simple ALUs.

This tool creates an architecture where each destination register in the architec-ture gets a dedicated ALU. The ALU is chosen from the synthesis library by find-ing the ALU that supports all operations that has the given register as targetregister. Hence, the resulting architecture will be an architecture with one ALU

Figure 2. Target architecture.

regfile

op1, op2,...

imm op

RAM

datapathI/O

Control unit

status signalsfrom datapath

control signals

Programmemory

address instruction


135

attached to each register. In order to optimize the architecture the number ofALUs must be reduced. Typically, each ALU should have a number of registersattached to it, i.e. a register file. Therefore it is possible to define register files,telling the synthesis tool to attach the same ALU to each register in the registerfile, Fig. 3.

Normally the selected ALU is the ALU that most closely matches the neededinstruction set, however it is also possible to control the tool such as the mostpower or area efficient or even fastest ALU is chosen. This is accomplished bystoring relative power, area and speed weights in the synthesis library. Finally theinterconnections are created by analyzing the model in order to see which blocksthat has to be able to communicate. Only necessary communication paths are cre-ated. The communication paths only contains multiplexers and wires, i.e. no tri-state buses are used.

4.3.1. User control

The degree of inter activity during the design process is intended to be high. Themain goal has been to provide a tool that makes it easy for the designer to get theintended architecture. Therefore the tool contains little inherent intelligence, butis easy to control by flags fed to the synthesis tool, by modifying the synthesislibrary and/or rewriting the model.

Figure 3. Synthesis of ALUs for register and register files.

acc1 acc2

reg_breg_a

+ -

reg_breg_a

+/-

acc2

acc1

+,-

Synthesis of:acc1=reg_a+reg_b;acc2=reg_a-reg_b;

become:

normal synthesis: acc1 and acc2 declaredas register file:

136

The type of ALUs that are chosen for a given register may not become the onethe designer wants to have. The type of ALU can therefore be explicitly assignedusing a configuration file as input to the synthesis tool. In this way it is possibleto add a more powerful ALU that supports more instructions than required by thepresent application.

5. EXAMPLE

In this section an example how to use our synthesis tool in the design flow isgiven.

In Fig. 4 an example of µC code of a 32 tap FIR filter is given. Passing thisdescription through the synthesis tool without any ALUs declared gives thearchitecture shown in Fig. 5 (control unit excluded). The tool creates an architec-ture that can execute the given task, and nothing more. In order to achieve animplementation that are easier to reuse, for instance if we want to support any fil-ter length up to 32 taps, the instruction set has to be extended. This has to bemade such as it become possible to realize an addressing scheme other than mod-ulo 32.

To realize this we may for instance include circular buffers for the calculation ofdata and coefficient addresses. Since a circular buffer may be useful in the futurewe decide to add a circular buffer ALU into our synthesis library and then instan-tiate it into the µC model. In Fig. 6 it is shown how to change the µC model andwhat to add to the synthesis library. The new more general datapath is shown inFig. 7.

6. FUTURE WORK

The implemented heuristics with an ALU selection based on target registers leadto an architecture that normally works nice for dedicated DSPs. In modern gen-eral purpose DSPs there is normally a number of parallel ALUs that are con-nected to one register file. This is an architecture that can not be supported in thepresent version of the synthesis tool. One of the problems with the synthesis ofsuch an architecture is that it is difficult to decide which instruction to put inwhich ALU. In order to decide how many ALUs to attach to a register file theparallelism within the register file has to be analyzed.

The parallel ALU problem can today be worked around by explicitly instantiateit into the uC model. But a more smooth way making it easier to elaborate withdifferent solutions would be preferred.


137

The instruction coding is today put into the control unit, which just is instantiatedby the synthesis tool. A future extension would be to include an instruction cod-ing stage in the tool in order to further reduce the design effort.

Figure 4. µC model of 32 tap FIR filter.

1: // Declaration part 2: MDSP fir 3: { 4: 5: INPUT inp(14, PARALLEL);// input port, 14 bits 6: OUTPUT outp(14, PARALLEL); // output port, 14 bits 7: REG acc(30), i(6), ca(5), da(5); // different registers 8: RAM d(32,16); // RAM with 32 16 bit words 9: ROM c(32,16, "rom.data"); // ROM 10: 11: PROCEDURE compfir ();// procedure declaration 12: } 13: 14: // Code part 15: 16: PROCEDURE main() 17: { 18: for(;;){ // loop forever 19: do {;} while(!inpF) ; // While no input on the input 20: // port inp do nothing 21: inpF=0, d[da]=inp; // Reset input by setting inpF=0, 22: // store inp in RAM. “,” means that 23: // this is made in parallel 24: 25: compfir(); // call procedure compfir 26: outp=acc; // place the value of acc on outp port 27: } 28: } 29: PROCEDURE compfir() // compute fir 30: { 31: acc=0,ca=0; 32: 33: i=30; 34: do { 35: acc+=d[da++]*c[ca++], 36: i--; 37: } while (i>0) 38: acc+=d[da]*c[ca++]; 39: return;

138

7. CONCLUSIONS

In this paper we have demonstrated a synthesis tool where a µC model is trans-lated to a DSP processor. The nice thing with the tool is not the optimization rou-tines in the tool since they do not contain anything advanced. Instead we haveshown a design flow, using the synthesis tool, where it become easy for adesigner to evaluate different architecture. We have, by an example, shown howan FIR filter can be synthesized and redesigned to support a wider applicationwithout too much work.There are things that can be improved, such as the userinterface. The tool is this far just at a prototype showing the possibility to work asdescribed.

Figure 5. Datapath for 32 tap FIR filter.

ca

ROM c

0

1

+,passacc

outp

+,pass

*0

RAM d d

a

1

+,-

inp

i

imm op

1

+,-,pass

>

0to control unit


139

Figure 6. Changes to support arbitrary circular addressing.

The µC model is changed such as

acc+=d[da++]*c[ca++]is replaced by:

acc+=d[da]*c[ca], da=circ_add(1,da,firl), ca=circ_add(1,ca,firl);

and

acc+=d[da]*c[ca++];is replaced by:

acc+=d[da]*c[ca], ca=circ_add(1,ca,firl),

a register called firl that holds the wantedfilter length is created and the line

firl=12;

is added to the model.

The VHDL description of the circ_addblock is stored in the synthesis library andthe library description file is added with:

ALU_circ_add: operations: circ_addweight: power=100, size=100, delay=100,default=100

140

8. REFERENCES

[1] T. Hollstein, J. Becker, A. Kirschbaum, M.Glesner, “HiPART: A NewHierarchical Semi-Interactive HW-/SW Partitioning Approach with FastDebugging for Real-Time Embedded Applications”, Proc. of Workshop onHardware/Software Codesign, CODES/CASHE’98, March, 1998.

[2] P. Duncan, et al., “HI-PASS: A Computer-aided Synthesis System forMaximally Parallel Digital Signal Processing ASICs”, Proc. of IEEE Intern.Conf. on Acoustics, Speech and Signal Processing, ICASSP’92, March,1992.

[3] K-G Andersson, “Implementation and Modeling of Modular Digital SignalProcessors”, LiU-Tek-Lic-1997:09, Department of Electrical Engineering,Linköping University, March 1997.

[4] K-G Andersson, Mikael Karlsson Rudberg, and Anders Wass. “Design of aJPEG DSP using the Modular Digital Signal Processor Methodology“, Proc.of ICSPAT ’97, San Diego, CA, USA, Sept. 14-17, 1997.

Figure 7. Datapath for a programmable FIR filter.

ROM c

outp

+,pass

*0

RAM d d

a

inp

i

1

+,-,pass

>

0to control unit

acc

imm op

firl

imm op

circ_add0

1

ca

circ

_add0

1

from firl

Paper 7 - ADC Offset Identification and Correction in DMT Modems

141

Paper 7

ADC Offset Identification and Correction in DMT Modems


Proceedings of IEEE International Symposium on Circuits and Systems, ISCAS’00, Geneva, Switzerland, May 2000.


143

ADC Offset Identification and Correction in DMT Modems


Ericsson Components AB, 164 81 Kista, SWEDEN.

Tel: +46 13 28 1676 Fax: +46 13 13 9282 E-mail: [email protected]

ABSTRACT

In this paper the possibility to identify and correct DC offset errors in timeinterleaved ADCs are investigated. It is shown how the offset introduced bythe ADC can be identified and corrected by utilizing the knowledge aboutthe target application. As target application the ADSL standard has beenused. It is shown that an offset error from a time interleaved ADC can behandled efficiently in a wideband communication system as ADSL.

1. INTRODUCTION

With the increasing demands on bandwidth in communication systems thedemands on AD converters (ADCs) also increase. One way of increasing thesampling rate is to use time interleaved parallel ADCs [1]. A time interleavedADC consists of N ADCs, where each ADC only samples every N:th value. Forinstance, with two time interleaved ADCs the first sample will be taken byADC1, the second by ADC2, the third by ADC1 and so on. In this case the effec-tive sample rate for each ADC is reduced to fs/N, while the total sample rateremains fs. In Fig. 1 the principle of time interleaved AD conversion is shown.

144

1.1. Mismatch between ADC channels

Due to process variations the ADCs in a time interleaved ADC have small differ-ences in gain and DC offset. This mismatch cause the overall ADC performanceto be worse than the performance of each individual ADC channel. In this paperwe only consider the effects of the DC offset mismatch, hence the gain mismatchis assumed to be zero.

Assume that the sampled signal is a sequence of values {s(1), s(2), s(3), ....}. Ifthe signal is passed through a time interleaved ADC with four channels the resultwill become a sequence with a DC offset (oi) contribution from each ADC chan-nel. Together with a quantization noise the resulting sequence will become.

(1)

The offset signal, o(n) is only dependent of which ADC channel that are used andnot the input signal. Hence the offset signal is a periodic signal with period N,where N is the number of ADC channels. In the frequency domain the offset willcause tones located at m * fs/N, where m is an integer in the range [0,N-1].

The SNDR of a signal affected by an offset error can be expressed as the ratio ofthe energy at the input signal, s(n), and the offset signal, o(n), [2].

(2)

Assuming that the offsets can be regarded as normally distributed random vari-ables with a mean of zero and a variance of σ2 and with a sinusoidal with theamplitude A the SNDR can be expressed as.

(3)

Figure 1. Time interleaved parallel ADC.

ADC 1

ADC 2

ADC N

s(t) sdist(n)

N*T

(N+1)*T

(2N-1)*T

s1(n)

s2(n)

sN(n)

s 1( ) o1 s 2( ) o2 s 3( ) o3 s 4( ) o4 s 5( ) o1…+,+,+,+,+

SNDRE s2 n( )[ ]E o2 n( )[ ]----------------------=

SNDRdB 10A2

2σ2---------

log=


145

Using a 16 channel time interleaved ADC, the offset error has been measured tobe in the range of 30 codes, and with a variance around 50, which corresponds toalmost three bits performance degradation in a 12 bit ADC.

2. IDENTIFICATION OF OFFSET

2.1. Communication system

A simple digital communication system can be viewed as in Fig. 2, [3]. A pieceof information (a symbol) is passed through an encoder that creates a signal thatis sent over a channel. At the receiver the signal is converted to the digitaldomain and the transmitted information is recreated in the decoder. The task ofthe decoder is mainly to find which information that most probably was transmit-ted. This can somewhat simplified be described as comparing the differencebetween the received signal and all possible symbols. The symbol that minimizethe difference between possible symbols and received signal is the most probablesymbol. In order to further increase the performance filters and error correctionalgorithms are used in the decoder.

A commonly used line coding is the Quadrature Amplitude Modulation (QAM)line coding. A QAM coded signal consists of a sine and a cosine wave, whereeach one can have a number of different phases and amplitudes. Every code haveone combination of amplitude and phase, making it possible to detect the trans-mitted information at the receiver.

In the complex plane the received data can be shown as in Fig. 3. When the SNRallows the number of bits transmitted in one QAM constellation is increased,resulting in more points in the constellation diagram shown in Fig. 3.

Figure 2. Simple communication system.

DECODER

ENCODER

data to transmit

receiveddata

AnalogFrontend

line

ADC

DAC

146

In the presence of an additive offset the QAM constellations will be moved insome direction, e.g. as in Fig. 4. A large offset error will cause too large displace-ment of the constellation, causing the decoder to make the wrong guess aboutwhich symbol that were sent. Even with a small offset error the displacement ofthe constellation cause an increased error probability, making it necessary toidentify and correct the offset

3. CORRECTION OF OFFSET IN DMT MODEMS

Today there exist several standards for data communication based on discretemultitone modulation (DMT). The most well known of those standards is theADSL standard [4,5] which may provide the user with data rates up to about 8MBit/s in downstream data rate over a twisted pair copper cable.

3.1. DMT based communication system

The difference between a single carrier QAM based modem and one based ondiscrete multitone modulation (DMT) is that many carriers, or tones, are usedsimultaneously. This is done in an efficient way using the inverse discrete fouriertransform (IDFT) or its fast variant inverse fast fourier transform (IFFT) in thetransmitter, Fig. 5. Each carrier is optimized to carry as much information as pos-sible, thus the carriers have varying constellation sizes. To optimize the perfor-mance an echo canceller (EC), and frequency (FEQ) and time domain equalizersare used (TEQ).

The symbol decoding is made by calculating the DFT of a batch of samples andthen decode the carriers individually. Each carrier is coded using QAM encodingas described in section 2.1.. Normally the received signal consists of a mix of thesignal that were sent from the modem at the other end of the line, an echo signalwhich is caused by echo from the data sent from the receiving modem, and noise.The echo signal can be removed using echo cancellation, or separation filters

Figure 3. QAM constellation

R e

Im

01 00

11 10


147

(when frequency multiplexing is used). The received data is then reconstructedby subtracting an estimate of the echo, followed by a filter that compensates forthe channel impulse response, Eq. 4.

(4)

It does not matter if the offset signal is removed in the time or the frequencydomain. Hence, when an offset is present Eq. 5 becomes.

(5)

Obviously, H(O(ejω)) can be identified instead of the individual contribution tothe offset from each individual ADC, making the identification easier. The signallevel of the received data may, due to damping in the transmission, be 20 dBbelow the echo signal. Hence, the wanted signal has a signal level similar to theoffset, and will therefore be difficult to detect. It is however possible to detectand correct the offset error using the knowledge about the application and utiliz-ing adaptive techniques.

Since the data from the ADC is used in pairs, the disturbed tones will be 2m/N *Ntones, where m is an integer in the range [0,N/2-1], N is the number of ADCchannels, and Ntones is the number of tones used in an ADSL modem.

Figure 4. QAM constellation with additive offset.

Figure 5. Outline of a DMT modem.

Re

Im

Sinfo ejω( ) H Srec ejω( ) Secho ejω( )–( )=

Sinfo ejω( ) H Srec ejω( ) Secho ejω( )–( ) H O ejω( )( )–=

DECODER

ENCODER

FFT

IFFT AnalogFrontend

line

ADC

DAC

TEQ ECFEQ

data to transmit

receiveddata

148

3.2. Correction of offset before connection

Before two DMT modems are connected only noise are present at the input to themodem. If the input is assumed to be a normal distributed noise signal, e(n) withan average of zero the offset will be found by just averaging the received signal,Eq. 6.

(6)

As described in Eq. 5 this operation can just as well be performed in the fre-quency domain, Eq. 7.

(7)

One problem with this method is that an input signal with a frequency that is amultiple of fs/N will be cancelled. But, the only situation when there is a risk thata wanted signal is removed is if the remote modem tries to get the attention fromthe receiving modem. This problem is discussed in the following section.

3.3. Correction of offset during initialization

During the start-up phase of a modem connection a number of known initializa-tion sequences are sent. These sequences are used for measuring the channelquality and training of the adaptive algorithms inherent in a DMT modem. Sinceit in this phase is known what data that actually are sent, this can be used fortraining.

3.3.1. Activation

The first stage during initialization is to activate the remote modem by sendingan activation signal which consists of a single tone. The remote modem isanswering by replying with another tone. The tones lasts for 32 ms and might bemistaken for an offset error if the offset error generate a tone with a frequencythat is the same as the activation frequencies. The tones that are used in this phasein the downstream direction are tone 44, 48, 52 and 60. In the upstream directiontone number 8, 10 and 14 is used. This occurs when the transmitted tone have afrequency such as it is possible to find a positive integer m that fulfills Eq. 8.

(8)

o n( ) E s n( )[ ] E e n( ) o n( )+[ ] E e n( )[ ] E o n( )[ ]+ E o n( )[ ]= = = =

Oˆ

ejω( ) E E ejω( ) O ejω( )+[ ] E E ejω( )[ ] E O ejω( )[ ]+ E O ejω( )[ ]= = =

2m NFsig

Ntones----------------⋅=


149

Fsig is the tone with an activation signal, Ntones is the total number of tones, andN is the number of ADC channels. The smallest value of N that may give rise toan offset error in a tone that is used during this phase is N=32. Fortunately is 32ADC channels more than needed for the sampling rates used in ADSL (the band-width is 1.1 MHz).

3.3.2. Modem training

A DMT modem must be trained in order to adjust gain, echo canceller and equal-ization filters. Since the offset error is an additive signal, which from the trainingpoint of view will be regarded as noise the adaptive training algorithms are stilluseful, but it might result in a longer adaptation time.

In several of the training sequences, the same symbol are continuously repeated.This cause a correlation between the received repetitive symbol and the offseterror. Hence, it is not possible to identify what is the actual signal and what is theoffset error. One of the training sequences are used for estimation of the SNR oneach tone. This sequence has a length of 16384 symbols and is not repetitive asthe other ones. Since this sequence consists of a known pseudo random sequenceit can also be used for offset estimation. The offset is found by taking the differ-ence between the received data and the expected data.

3.4. Correction of offset during transmission

When the modems are connected the offset are hopefully identified and cantherefore be subtracted. But, there are always small changes in the offset that arecaused by changes in temperature. In order to keep a good quality there is there-fore a need to continuously measure the offset when the modems are connected.

Since only some of the channels are affected by an offset error, and the offseterror is an additive signal, a change in the offset error will change the averageerror between the received data and the actual symbol. According to Eq. 7 it ispossible to identify the offset if the signal is removed. Hence, if the informationfirst is removed, for instance by the decoder, the remaining signal consists ofnoise and offset and Eq. 7 can be used.

If the offset identification not is fully adapted before the modem traffic is startedsome of the tones will have a degraded performance, using a smaller constella-tion size than what actually is possible. However, the ADSL standard support online change of the constellation size (i.e. the number of bits transmitted on eachtone). Hence, when the offset identification is fully adapted and the SNR for thedisturbed tone has increased the modem can increase the bit rate on this tone inorder to fully exploit the channel capacity.

150

4. SIMULATION RESULTS

In order to verify the ideas of how to identify and correct offset errors caused bean interleaved ADC architecture the different cases have been simulated usingADSL as the application. A 12 bit time interleaved ADC consisting of eightchannels with the DC offsets {-2, -11, 3, 5, 3, -8, 1, 14} giving a variance of 54has been used. The offset will in this case affect the tones {0, 64, 128, 192}.

A suitable method to update the offset estimate is to use a running average as inEq. 9 below. The size of λ will control the adaptation rate and is chosen close toone.

(9)

Fig. 6 is showing how the adaptation is made during the SNR measurementsequence. Since the input signal is known, only the noise will disturb the adapta-tion. λ = 0.999 has been used and the simulation shows how much that remainsof the offset error at the disturbed tones. 7000 symbols, which corresponds toabout 1.6 second in real adaptation time. Around 5% of the errors remains afterthis period, i.e. 26 dB decrease in offset errors at the disturbed tones. A simula-tion with only noise as input will result in the same result since the known signalis removed before Eq. 9 is applied.

Figure 6. DC offset identification during SNR measurement sequence.

Oˆ

i 1+ ejω( ) λOˆ

i ejω( ) 1 λ–( )E ejω( )+=

0 2000 4000 6000 8000 10000 12000 14000 16000 18000−0.01

−0.005

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04Offset adaptation

symbol number

rela

tive

offs

et e

rror


151

5. HARDWARE ARCHITECTURE

In Fig. 7 below an outline of how the offset identification and correction can beput into an ADSL modem is outlined. The offset correction unit is realizing Eq. 9with either the error coming from the decoder or the noise coming from the FFT.Equation 9 contains two multiplications and one accumulation. Only the tonesthat may be disturbed by the offset need to be taken into account, hence one mul-tiplier is enough sine the disturbed tones are separated by 2N, where N is thenumber of ADC channels. The offset for each tone can be kept in a register filesince they are quite few. The offset estimation stored in the register file is sub-tracted from data coming from the FFT. The complexity of the compensation unitcan be kept low since there are only a few tones that are affected.

6. ACKNOWLEDGEMENTS

I would like to thank Jan-Erik Eklund at Microelectronics Research Center, Eric-sson Components AB, for the help with finding typical values of the DC offseterror in a time interleaved ADC.

7. CONCLUSIONS

In this paper it has been shown how an offset error in a wideband data transmis-sion system as ADSL can be identified and corrected. By treating the ADC as asystem component that can be optimized together with the rest of the system andutilize what is known about the target application we have shown how the offseterror can be handled in all the important phases during modem initialization anddata transmission in the ADSL modem.

Our methods should be possible to use in other communications systems as wellsince it is common to have various types of training sequences that can be uti-lized for offset identification.

8. REFERENCES

[1] J. Yuan and C. Svensson, ”A 10-bit 5MS/s Successive Approximation Cellused in a 70 MS/s ADC Array in 1.2υm CMOS”, IEEE Journal of Solid stateCircuits, vol. 29, no. 8, pp 866-872, Aug. 1994.

[2] M. Gustavsson, “CMOS A/D Converters for Telecommunications”, Ph.D.thesis, Diss. No 552, Linköping University, Sweden, Dec. 1998.

[3] S. Haykin, Digital Communications, Wiley, 1988.

152

[4] ANSI T1.413-1998, “Network and Customer Installation Interfaces:Asymetrical Digital Subscriber Line (ADSL) Metallic Interface”, AmericanNational Standards Institute.

[5] T. Starr, J. M. Cioffi, and P. J. Silverman, Understanding Digital SubscriberLine Technology, Prentice-Hall, 1999.

[6] M. Karlsson Rudberg, “A/D omvandlare”, pending Swedish patent no.9901888-9.

Figure 7. Offset correction architecture in an ADSL modem.

registerfile* ac

c λ 1-λ

ErrorNoise

+

+

−

from FFT

only subtractdisturbed tones

To decoder

Paper 8 - Calibration of Mismatch Errors in Time Interleaved ADCs

153

Paper 8

Calibration of Mismatch Errors in Time Interleaved ADCs


Proceedings of IEEE International Conference on Electronics, Circuits and Systems, Malta, Sept. 2001.


155

Calibration of Mismatch Errors in Time Interleaved ADCs


Microelectronics Research Center, Ericsson Microelectronics AB, SE-581 17 Linköping, Sweden

Department of Electrical Engineering, Linköping University, SE-581 83 Linköping, Sweden

Tel: +46 13 32 2523 Fax: +46 13 13 9282 E-mail: [email protected]

ABSTRACT

An efficient way of increasing the sample rate in an A/D converter (ADC) isto use a time-interleaved structure. The effective sample rate can beincreased without increase in sample rate for the individual ADCs. Thereare however problems with this architecture caused by differences in gain inthe ADCs as well as timing mismatch in the sample-and-hold circuits. Thesemismatch errors will degrade the performance of the time interleaved ADC.In this paper we propose algorithms for both identification of the mismatcherrors on-line and cancelling of the distortion. The proposed algorithms aresuitable for applications that use the Discrete Multi-Tone modulation(DMT) or the Orthogonal Frequency Division Multiplex (OFDM) technique.

156

1. INTRODUCTION

Fast and accurate analog-to-digital converters (ADCs) are key components inpresent and future communication systems. An increasing demand of bandwidthand an increased use of digital signal processing both put higher demands on theADCs.

One way of increasing the sample rate that has been proposed is to use severalADCs in a time interleaved way [1]. A time interleaved ADC (TIADC) consistsof ADCs, where each ADC only samples every M:th value. The effectivesample rate for each ADC is reduced from to , while the total sample rateremains unchanged. In Fig. 1 the principle of time interleaved A/D conversion isshown.

1.1. Error sources in a TIADC

Even if there are advantages using a TIADC when it comes to conversion speed,there are also problems if the accuracy at the same time must be kept high. Alldifferences between the ADCs that form the TIADC will turn up as distortion inthe signal spectrum. The two mismatches of interest in this paper are when thereare gain differences between the ADCs in the TIADC, and when the delay of thesample clock to each sample circuit is unequal. Another important problem notwithin the scope of this paper is mismatch in offset between the ADCs [2].

1.2. Gain Mismatch

A gain mismatch can occur from differences in the reference voltages in theADCs or from gain differences in the sample-and-hold circuit which in the mostsimple case can be modeled as an RC link and a switch, Fig. 2.

The gain from each of the ADCs will in this paper be denoted as where is the current ADC which may vary between 0 and .

Figure 1. Time interleaved ADC.

Mfs fs M⁄

x(t) xTIADC(n)

MT

(M+1)*T

(2M-1)*T

x0(n)

x1(n)

xM-1(n)

ADC M-1

ADC 0

ADC 1

Gm ω( )m M 1–


157

1.3. Timing Mismatch

Keeping the delay from the sample clock to each sample-and-hold circuit equal isdifficult due to variations in path length and process parameters. This mismatchin timing will cause the sampling to be non-uniform with a period of samples,Fig. 3.

The timing error is in this paper modeled relative to the average sampling period so that the sample times for ADC is described by

. (1)

Considering the two effects, gain mismatch and nonuniform sampling the distor-tion with a band limited input signal can be modeled as [3]

(2)

Figure 2. Sample-and-hold circuit.

Figure 3. Non-uniform sampling.

R

C

xin(t) xs(nT)

nT

M

T m

tm mT rmT–=

T(1+r0)T 2T 4T3T t

x(t)

Amplitude

DA

2T(1+r1)

3T(1+r2)

4T(1+r3)

Xtiadc ejωT( ) 1

T--- Ak e

jωT( ) X ω k2πMT--------⋅–

⋅

k ∞–=

∞

∑=

158

where is

(3)

In the summation in Eq. 2 only terms are non-zero if the input is band limitedto which will be assumed in this paper.

To keep the distortion low, both the gain and the timing errors must be kept at alow level. In [3,4] approximations of the effects on SNDR by gain and timingmismatch have been derived. Assuming a nominal gain of with a standarddeviation of the SNDR in a TIADC with gain mismatch can be approximatedby

. (4)

A timing mismatch with a standard deviation will give the SNDR

. (5)

Assuming that 12 bits accuracy is needed, and that the input frequency is11 MHz (the VDSL standard which use the DMT technique has 11 MHz band-width) must be kept smaller than 9 ps, which may be difficult to achieve in aCMOS process. The gain error, must be kept smaller than 0.03% for the sameresolution.

1.4. Methods to cancel gain and timing mismatch

If can be identified it has been shown in [3] that the :s can be calcu-lated. In [5] a method based on polynomial interpolation to correct the spectrumis proposed, and in [6] a method based on a variant of the discrete fourier trans-form is used. In both papers a test signal is used for identifying the errors, andnone of the methods consider a frequency dependent gain mismatch.

Ak ejωT( )

Ak ejωT( ) 1

M----- Gm ω k

2πMT--------⋅–

e⋅j ω k 2π MT⁄⋅–( )rmT–

ejkm 2π M⁄⋅–⋅

m 0=

M 1–

∑=

Mfs 2⁄

gσg

SNDR 20g

σg------

log 10 11M-----–

log–≈

σr

SNDR 201

2πfinσr-------------------

log 10 11M-----–

log–≈

M 4=

σrσg

Ak ejωT( ) rm


159

Using a test signal is usually not desired for in circuit calibration since a highaccuracy test signal must be generated, and it is also necessary to interrupt thecommunication while performing the calibration.

In this paper we will focus on the applications that use the Discrete Multi-Tonemodulation technique (DMT) or the Orthogonal Frequency Duplex Multiplex(OFDM) technique. The DMT technique is used for digital subscriber lines, e.g.ADSL [8]. The OFDM technique is similar to DMT with the main difference thatOFDM is proposed for radio transmission.

2. THE DMT MODEM

In a DMT modem the signal is modulated using the Inverse Fast Fourier Trans-form (IFFT) to form a signal with carriers. In the receiver the Fast FourierTransform (FFT) is used which separates the information on the different carrierfrequencies. An outline of a DMT modem is shown in Fig. 4.

The blocks of importance for this paper are found in the receive path. The

block is an echo canceller and the block is the time domain equalizer. Both

these blocks together with the frequency domain equalizer will here be

referred to a filter with the frequency response .

The output from the decoder is an estimate of which symbol that was received.

The equalized input to the decoder is denoted , and the estimated sym-

bol is called . where is the index to one of the carriers which may

vary between 0 and .

The filtered signal received by the decoder will be

. (6)

Figure 4. DMT modem outline

N

DACIFFT

FFT

encoder

decoder TEQ ADCEC

analogfrontend

FFT

IFFTencoder

decoder TEQ ADC

DAC

EC

analogfrontend

transmit path

receive path

line

TEQFEQ

EC

TEQ

FEQ

Heq ejωT( )

Xeq ejωT( )

S ejωnT( ) n

N 1–

Xeq ejωT( ) Heq e

jωT( ) Xtiadc ejωT( )⋅=

160

The information on each carrier is coded using M-ary QAM. That is the bits aremapped in a two-dimensional plane where the positions represent the transmittedbits, Fig. 5.

In order to take full benefit from in circuit calibration there must be a possibilityto utilize an increased SNR to increase the amount of transmitted data. The DMTtechnique as realized in the ADSL standard support this feature.

3. IDENTIFICATION OF ERRORS

In the ideal case when no distortion is present the received information on eachcarrier is independent of each other. When the distortion is present some interfer-ence between the frequencies is present. Each carrier is interfered by the maxi-

mum of other carriers, which are described by when in

Eq. 2.

An error in the transmission on carrier may occur when the noise plus the totaldistortion become larger than . That is

(7)

where is the minimum distance between two points in the constellation dia-

gram in Fig. 5, and is the noise contribution.

The similarities between the distortion with the signal on one carrier leaks intoanother carrier is similar to what is happening when a transmitted signal is ech-oed into the received signal in for instance an ADSL system, and it is thereforepossible to use a similar method for removing the distortion as is used for echocancelling. The most well known method for adapting an echo canceller is the

Figure 5. QAM mapped information.

(00) (10)

(01) (11)

Re

Im

S(ejwT)D

M 1– Ak ejωT( ) k 0≠

mD 2⁄

N ejωmT

( ) S ejωlT( ) Ak e

jωmT( ) Xeq e

jωmT( )⋅ ⋅

k 0≠∑+

D2---->

D

N ejωmT

( )


161

Least Mean Square (LMS) method which use the gradient of the error in thereceived signal to update the coefficients ( ) in an adaptive filter according to[7]

(8)

where is the error between the wanted signal and the one that actually wasreceived, . is a parameter that controls the adaptation rate.

3.1. Error identification

The distortion that leaks from carrier into carrier forsome , is estimated as

(9)

where is the estimation of

. (10)

The distortion estimation is subtracted from the received signal

. (11)

As estimation of the remaining distortion from carrier into carrier we use

. (12)

The estimation of the remaining distortion is used for updating of the estimated

leakage coefficients, .

(13)

3.2. Signal reconstruction

The proposed algorithm can in the presented form not be used as it is to correctthe distorted signal since the decoded symbol is used in the equations for estimat-ing the distortion terms, that are subtracted from the received signal (Eq. 9-12).The signal reconstruction should be done before the decoding stage since it is theprobability for the decoder to make the right decision that we want to improve.Therefore Eq. 9 is modified to

Ci

Ci k, 1+ Ci k, µ ek xk i–⋅ ⋅+=

ekxk i– µ

ωl ωm k 2π MT⁄⋅–= ωmk

Ukˆ e

jωmT( ) S e

jωlT( ) Ckˆ e

jωmT( )⋅=

Ckˆ e

jωmT( )

Ck ejωmT

( ) Heq ejωmT

( ) Ak ejωmT

( )⋅=

Xeq2 ejωmT

( ) Xeq ejωmT

( ) Ukˆ e

jωmT( )–=

l m

Uˆ

k rem, ejωmT

( ) Xeq2 ejωmT

( ) S ejωmT

( )–=

Ckˆ e

jωmT( )

Ckˆ e

jωmT( ) Ck

ˆ ejωmT

( ) µ Uˆ

k rem, ejωmT

( ) S* ejω lT( )⋅ ⋅+=

162

. (14)

contain some distortion, but is the best possible estimation of

available without performing the symbol decoding.

The mismatch cancelled signal sent to the decoder will be

. (15)

Which is the contribution from all distortion terms that leaks into the current car-rier. The proposed method do not require more than that the input signal is bandlimited to . Alternative methods for timing error correction in a TIADC usu-ally work less good when the signal bandwidth gets close to , [5,9]. Themethod presented in [6] performs perfect reconstruction of the signal spectrum,but the use of a special DFT makes the algorithm computationally heavy.

3.3. Implementation aspects

The complexity of the proposed algorithm is dependent on the number of signalcarriers that are used, , and the number of ADCs in the TIADC, .

The number of complex multiplications in the coefficient update loop are

(16)

and the total number of additions and subtractions are

. (17)

The corresponding values for the distortion correction are

(18)

and

. (19)

Ukˆ e

jωmT( ) Xeq e

jωmT( ) Ck

ˆ ejωmT

( )⋅=

Xeq ejωmT

( )

S ejωmT

( )

Xeq2 ejωmT

( ) Xeq ejωmT

( ) Ukˆ e

jωmT( )

k 0≠∑–=

fs 2⁄fs 2⁄

Ncarr M

Ncmult 2Ncarr M 1–( )=

Nadd/sub 3Ncarr M 1–( )=

Ncmult Ncarr M 1–( )=

Nadd/sub Ncarr M 1–( )=


163

4. SIMULATIONS

A TIADC with four ADCs have been simulated with a 256 carrier DMT signal asinput. No quantization effects have been considered. The timing mismatch hasbeen randomly selected with a standard deviation of 8%. The gain mismatch hasa standard deviation of 2%. In Fig. 6 the adaptation process is shown consideringthe distortion that leaks into carrier 96. The simulation is made using 105 sym-bols.

In Fig. 7 it is shown how the received QAM-encoded constellation points looklike before and after cancellation of the noise on carrier 96. The improvement inSNDR is about 13 dB.

5. CONCLUSIONS

In this paper we have proposed a method to both identify and correct mismatcherrors caused by gain and timing errors between ADCs in a TIADC. The methodcan be applied to the OFDM and the DMT transmission techniques which areused in for instance ADSL and VDSL. The method work all the way up to theNyquist frequency, and can handle gain mismatch that is frequency dependent aslong as the gain can be considered linear.

Figure 6. Adaptation of .

0 1 2 3 4 5

x 104

0

0.02

0.04

0.06

0.08

0.1

0.12Adaptation of distortion coefficients

carrier 34 carrier 162carrier 224

Ckˆ ω( )

164

In Fig. 7 it is shown how the received QAM-encoded constellation points looklike before and after cancellation of the noise on carrier 96. In the simulation theimprovement in SNDR at this carrier is around 13 dB.

6. REFERENCES

[1] J. Yuan and C. Svensson, ”A 10-bit 5MS/s Successive Approximation Cellused in a 70 MS/s ADC Array in 1.2υm CMOS”, IEEE Journal of Solid stateCircuits, Vol. 29, No. 8, pp 866-872, Aug. 1994.

[2] M. K. Rudberg, "ADC Offset Identification and Correction in DMTModems", Proc. of IEEE Intern. Symp. on Circuits and Systems, ISCAS'00,Geneva, May, 2000.

[3] Y.-C. Jenq, “Digital Spectra of Nonuniformly Sampled Signals:Fundamentals and High-Speed Waveform Digitizers”, IEEE Trans. Instrum.Meas., Vol. 37, pp. 245-251, June 1988.

[4] M. Gustavsson, “CMOS A/D Converters for Telecommunications”, Ph.D.thesis, Diss. No. 552, Linköping University, Sweden, Dec. 1998.

[5] H. Jin, and E. Lee, “A Digital-Background Calibration Technique forMinimizing Timing-Error Effects in Time-Interleaved ADC’s, IEEE Trans.on Circuit and Systems - II: Analog and Digital Signal Processing, Vol. 47,No. 7, July 2000.

Figure 7. Received constellation before and after cancellation of gain and timing mis-match.

−1 −0.5 0 0.5 1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1Corrected constellation diagram

TIADC with mismatch TIADC with cancelled mismatch


165

[6] Y.-C. Jenq, “Perfect Reconstruction of Digital SPectrum from NonuniformlySampled Signals”, IEEE Trans. Instrum.. Meas., vol. 37, pp. 245-251, June1988.

[7] T. Starr, J. M. Cioffi, and P. J. Silverman, Understanding Digital SubscriberLine Technology, Prentice-Hall, 1999.

[8] ANSI T1.413-1998, “Network and Customer Installation Interfaces:Asymetrical Digital Subscriber Line (ADSL) Metallic Interface”, AmericanNational Standards Institute, 1998.

[9] J. Elbornsson and J.-E. Eklund, "Blind estimation of timing errors ininterleaved AD converters", Submitted to International Conference onAcoustics, Speech, and Signal Processing 2001.

Paper 9 - Glitch Minimization and Dynamic Element Matching in D/A Converters

167

Paper 9

Glitch Minimization and Dynamic Element Matching in D/A Converters

Mikael Karlsson Rudberg, Mark Vesterbacka, Niklas Andersson, andJ. Jacob Wikner

Proceedings of IEEE International Conference on Electronics, Circuits and Systems, Lebanon, Dec. 2000.


169

Glitch Minimization and Dynamic Element Matching in D/A Converters

Mikael Karlsson Rudberg1,2, Mark Vesterbacka2, Niklas Andersson1,2, and J. Jacob Wikner1,2

1) Microelectronics Research Center, Ericsson Microelectronics AB, SE-581 17 Linköping, Sweden

2) Department of Electrical Engineering, Linköping University, SE-581 83 Linköping, Sweden

{mikaelr, markv, niklasa, jacobw}@isy.liu.se, Phone: +46 708 488 418

ABSTRACT

In this paper we present a novel method for combining thermometer codingand dynamic element matching (DEM) in a digital-to-analog converter(DAC). The proposed method combine DEM with a minimization of glitchpower. The glitch power may in a DEM solution give a significant contribu-tion to the total noise power. The switch based solution provides a structuralsolution where it is possible to implement parts of the method, which reducethe area required for implementation.

170

1. INTRODUCTION

The requirements in terms of accuracy in digital-to-analog converters (DACs) areincreasing with the introduction of wide-band access services as for instanceADSL. In order to increase the accuracy we want to reduce the influence of bothstatic and dynamic errors. Considering the static case, a DAC will in general per-form the following operation

(1)

where is the settled output amplitude at the time instants , is thenumber of bits in the input word containing the bits , and are theinternal DAC weights. is referred to as the most significant bit (MSB) and

is the least significant bit (LSB). For a binary offset input word, we have that and . For a thermometer code input, we have that

and .

In a current-steering DAC the internal DAC weights can be implemented byusing a number of weighted current sources. The switches that determine whichcurrent sources that should be connected to the output are controlled by the inputbits, . A binary weighted- and a thermometer coded current-steering DAC areillustrated in Fig. 1 a) and b) respectively.

1.1. Reducing glitches

One problem with a binary weighted DAC is that the glitch power between twoadjacent input samples may be high. This is caused by the fact that the transitionbetween two sample may contain an intermediate output value that differs fromthe final value. If the output is to be changed from the binary word {011} to theword {100} the output may become {111} for a short time before settling to thefinal output value {100}. The intermediate value adds unwanted glitch power tothe output.

The problem with intermediate values is avoided by using thermometer codeddata. This because it in a thermometer coded signal only exists transitions from azero to a one or vice versa, and never both types of transitions in the same sam-ple.

Aout nT( ) bm nT( ) wm⋅

m 1=

M

∑=

Aout nT( ) nT Mbm nT( ) wm

bMb1M N= wm 2m 1–=M 2N 1–= wm 1=

bm


171

1.2. Reducing influence from matching errors

A current-steering DAC will suffer from matching errors caused by the non-idealmanufacturing process of the circuits. The matching error can be represented as adeviation in the weights in the DAC. Considering the matching error Eq. 1become

(2)

where is the matching error. Since the error caused by the matching error isdependent of the input signal this error occurs as a static signal distortion. It ispossible to reduce the influence from the matching error if it for every input wordis possible to combine weights in different ways to form the same output value.Each combination will contain an error term of the size:

(3)

But by choosing which combination of that will be used randomly from timeto time the size of the error term will be uncorrelated with the output value. If forinstance thermometer coded input data is used ( ) and which combinationof to use for a given code is chosen in a random way, the average repre-sentation of code will approach the mean value, , and the error termwill be uncorrelated with the signal.

Figure 1. Example of a) a binary weighted- and b) thermometer coded current steering DAC.

21I0

bN

2N-1

I0 2

0I0

b2 b1

Iout

I0

b2N-1

I0 I0

b2 b1

Iout(a) (b)

Aout nT( ) bm nT( ) wm εm+( )⋅

m 1=

M

∑=

εm

Aout nT( ) bm nT( ) εm⋅

m 1=

M

∑=

bm

wm 1=wm Ci

Ci Ci w⋅ m

172

In Fig. 2 a block diagram realizing the described randomization is given. Therandom selection of current sources is done in the scrambler which can be real-ized as a net of switches whose setting is determined of the value of a pseudo ran-dom signal, p, Fig. 3. The switch is either passing data right through the switch,or exchange the two signal lines, dependent on the setting of p. Using a net ofswitches arranged in a matrix makes it easy to vary the degree of randomizationby modifying the number of columns that are used. There are many possibleways to connect the different columns in the switch matrix, the one used in Fig. 3use a radix-2 butterfly interconnect style. This way of randomizing data toimprove integral linearity is called dynamic element matching (DEM) [1].

Different aspects of DEM are also discussed in for instance [2,3]. An alternativesolution to the glitch minimization problem is shown in [4].

Combining randomization and glitch minimization require that a) bits in the ran-domized sample toggle from zero to one or vice versa, but not both in the samesample, and b) if there are more ones in the current sample than in the previousone, the new ones shall have a random position. The same also applies for zerosif there are more zeros in the current sample than in the previous one. Hence, the

Figure 2. DAC architecture with randomization.

Figure 3. Example of scrambler with seven thermometer encoded bits as input.

1-bitDAC

1

1-bitDAC

1-bitDAC

N

Scr

ambl

er

The

rmom

eter

Enc

oder

y(n)x(n)

Digital Encoderx1(n) y1(n)

y2(n)

yM(n)

1x2(n)

1xM(n)

t1t2t3t4t5t6

0t0

p

p

p

p

p

p

p

p

p

p

p

p


173

previous state of the randomization must be remembered in order to find out howto randomize the new sample. An example of how to randomize thermometercoded data with glitch minimization is shown in Tab. 1.

Table 1. Example of glitch minimized randomization.

1.3. Scrambler

To realize a scrambler, using a net of switches, requires a switch that can remem-ber the previous state in order to not randomize positions that should be pre-served. In Tab. 2 a truth table for a switch that can be used in a glitch minimizingscrambler is shown. ai+1 and bi+1 are the inputs to the switch, ai and bi are theinputs from the previous sample. Bits are to be randomized only if a new zero orone occurs at the input. A logic realization of the truth table require three flip-flops since both inputs from the previous samples as well as the previous settingof the switch must be saved.

Since flip-flops are expensive logic elements area can be saved if the number offlip-flops can be reduced. Since thermometer coded data is used the situationwhen <ai,bi>=<0,1> (or <1,0>) become <ai+1,bi+1>=<1,0> (<0,1>) in the nextsample never occurs. Therefore it is possible to set don’t care in Tab. 1 at thepositions marked (*). Another thing to notice is that since the transition directlyfrom <1,0> (<0,1>) to <0,1> (<1,0>) never occurs at the input of a switch, avalue of <1,1> or <0,0> will be present at the input at least one sample betweenthe two cases <1,0> and <0,1>. It is therefore enough to randomly set the switchwhen the input data is <1,1> or <0,0> to keep the same degree of randomization.

previous randomized

sample

input sample (thermometer

encoded)

new randomized sample

00000 00001 00001, 00100, ...

00100 00111 11100,10101, ...

01110 00011 00110, 01100, ...

174

Table 2. Truth table for glitch minimization.

A simplified truth table is shown in Tab. 3. Notice that the setting of the switchno longer is dependent on the input value from the previous sample period(<ai,bi>). Hence, only one flip-flop that saves the state of the switch is needed. Apossible realization of the switch is shown in Fig. 4.

1.4. Scrambler with unordered thermometer code

Instead of using the thermometer code described earlier it is possible to use anunordered thermometer code where the bits just are copied according to theirweight in the binary offset code (e.g. the binary offset code {101} become theunordered thermometer code {1111001}). The advantage with this code is thatthe conversion from binary offset code to unordered thermometer code is trivial.The disadvantage is that it is not possible to make the simplifications in the truthtable of the switch as described in the previous section. There may be a transitionat the input of a switch from <0,1> (<1,0>) to <1,0> (<0,1>) between two sam-ples.

The proposed solution is to convert <1,0> to <0,1> with some extra logic in frontof the main switch. In Tab. 4 and Fig. 5 a’ i+1 and b’ i+1 are the inputs before con-version, while ai+1 and bi+1 are after the conversion from <1,0> to <0,1>. Allother codes are passed unchanged from <a’ i+1,b’ i+1> to <ai+1,bi+1>. To guaran-tee minimal number of glitches when using unordered thermometer encoding,one must make sure that each bit at the input of the scrambler has a path that iscrossed by all other bits paths. Interesting to note is that for each extra bit, , in

ai+1 bi+1 ai bi switch setting

0 0 X X don’t care

0 1 0 0 random

0 1 0 1 keep previous

0 1 1 0 inverse of previous (*)

0 1 1 1 random

1 0 0 0 random

1 0 0 1 inverse of previous (*)

1 0 1 0 keep previous

1 0 1 1 random

1 1 X X don’t care

i


175

the binary offset coded input data, a group of bits are added to the unorderedthermometer code. All these added bits always have the same value and noswitches are needed when only bits within this group are scrambled (i.e. shadedswitches in Fig. 6 are unnecessary). using a radix-2 butterfly architecture of thescrambler require at least switch layers to guarantee that all paths in the group

cross at least one path in each of the groups . If this condition is ful-filled the output will be glitch minimized. Switch layers placed after layer maybe needed to increase the randomization, but since the output from layer isglitch minimized the more simple switch shown in Fig. 4 can be used for theselayers.

Table 3. Simplified truth table for glitch minimization.

The added logic for converting <1,0> to <0,1> can be seen as a two-bit unorderedto ordered thermometer encoder. If the switches are kept fixed (i.e. p is fixed),the proposed architecture work as a normal thermometer encoder. Hence, thearchitecture is a thermometer encoder with included glitch minimized scram-bling.

ai+1 bi+1 ai bi switch setting

0 0 X X random

0 1 X X keep previous

1 0 X X keep previous

1 1 X X random

Figure 4. Switch logic.

2i

k2

k2j j k<{ , }

kk

D

a

bx

y

=1

p

176

2. SIMULATIONS

In Tab. 5 the relative glitch power has been estimated for four different DACarchitectures. As input signal a multicarrier ADSL signal with 256 carriers havebeen used. As can be expected the proposed glitch minimized thermometer cod-ing technique performs just as well as plain thermometer coding. Randomizationof thermometer code is about as bad as binary offset coding from the glitchpower aspect. In Fig. 7 a) and b) the effect of mismatch on distortion is comparedbetween thermometer coding and thermometer coding with glitch minimization.

The simulation show an improvement of the SFDR, when only considering thematching error, compared with normal thermometer coding (13 dB). In the simu-lations a 6 bit DAC with a random matching error of have been used.Note, that all harmonics disappear using the proposed method.

It is important to be aware of is that a fast varying input signal become more ran-domized than a slowly varying signal, this because only the difference betweentwo samples become randomized.

a’i+1 b’i+1 ai+1 bi+1 switch setting

0 0 0 0 random

0 1 0 1 keep previ-ous

1 0 0 1 keep previ-ous

1 1 1 1 random

σ 0.02=


177

Table 4. Truth table for presorting logic.

3. CONCLUSIONS

In this paper we have presented a novel method where dynamic element match-ing is combined with glitch minimization. We have presented an architecture,similar to the commonly used scrambler with a number of switch layers with thedifference that we use a modified switch that has the advantage to remember theold path through the scrambler to minimize glitches. By simulations it has beenshown that the proposed method both reduces the number of glitches and de-cor-relates the mismatch in the current-sources from the signal.

4. REFERENCES

[1] L.R. Carley, and J. Kenney, “A 16-bit 4’th order noise-shaping D/Aconverter”, Proc. of 1988 Custom Integrated Circuits Conf., USA, May,1988.

[2] H.T. Jensen and I. Galton, "An analysis of the partial randomization dynamicelement matching technique," IEEE Trans. of Circuits and Systems II, vol.45. No. 12, pp. 1538-1549, Dec. 1998.

Figure 5. Switch logic with presorter.

Figure 6. Simplified switch matrix with presorter switch.

D

a

b

x

y

=1

p

&

>1

a�

b�

20

21

22

0

p

p

p

p

p

p

p

p

p

p

p

p

178

[3] N.U. Andersson and J.J. Wikner, "Comparison of different dynamic elementmatching techniques for wideband CMOS DACs," In Proc. of the NorChipConf., Oslo, Norway, Nov. 8-9, 1999.

[4] M. Vesterbacka, M. K. Rudberg, J.J. Wikner, and N. Andersson, “DynamicElement Matching in D/A Converters with Restricted Scrambling”, acceptedto ICECS’00, Beirut, Lebanon, Dec. 2000.

Type of codingNormalized SNDR

(dB)

offset binary code 0

thermometer code 11

thermometer code + randomization

0

thermometer code + randomization + glitch minimization

11


179

Table 5. Relative glitch power for different DAC structures.

Figure 7. Simulation of thermometer coded DAC (a) without and (b) with glitch mini-mization.

(b )

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5−80

−60

−40

−20

0

20

40

60Thermometer coded DAC with glitch minimization


PS

D [d

B/H

z]

(a )

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5−80

−60

−40

−20

0

20

40

60Thermometer coded DAC


PS

D [d

B/H

z]

Paper 10 - Dynamic Element Matching in D/A Converters with Restricted Scrambling

181

Paper 10

Dynamic Element Matching in D/A Converters with Restricted Scrambling

Mark Vesterbacka, Mikael Karlsson Rudberg, J. Jacob Wikner, andNiklas Andersson

Proceedings of IEEE International Conference on Electronics, Circuits and Systems, Lebanon, Dec. 2000.


183

Dynamic Element Matching in D/A Converters with Restricted Scrambling

Mark Vesterbacka1, Mikael Rudberg1,2, J. Jacob Wikner1,2 and Niklas U. Andersson1,2

1Department of Electrical Engineering, Linköping University, 581 83 Linköping, Sweden

2Microelectronics Research Center, Ericsson Microelectronics AB, Box 1885, 581 17 Linköping, Sweden

E-mail: {markv, mikaelr, jacobw, niklasa}@isy.liu.se

ABSTRACT

Inaccurate matching of the analog sources in a D/A converter causes a sig-nal-dependent error in the output. This distortion can be transformed intonoise by assigning the digital control to the analog sources randomly, whichis a technique referred to as dynamic element matching. In this paper, wepresent a dynamic element matching technique where the scrambling isrestricted such that the glitches in the converter are minimized. By this, boththe distortion due to glitches is reduced, and the signal-dependent error dueto matching is suppressed. A hardware structure is proposed that imple-ments the approach, and the operation of the hardware is described. Simula-tion results indicate that the method has a potential of yielding as goodreduction of glitches as the optimal thermometer-coded converter and a sig-nal-dependent error level that is almost as low as achieved with priordynamic element matching techniques.

184

1. INTRODUCTION

A major problem in design of high-resolution communication D/A converters isthe inaccuracy in the fabrication process. This imperfection introduces mismatchamong the sources to the analog output, resulting in non-linear behavior of theconverter [1, 2]. To overcome this problem, a technique referred to as dynamicelement matching (DEM) has been suggested where digital signal processing isused to control the switching of the analog sources so that the distortion is trans-formed into noise [1, 3, 4, 5]. Hence, signal-dependent errors are suppressed, andif we combine this technique with oversampling, we can reduce the error causedby the noise by low-pass filtering the output [3].

However, converters in many modern communication applications need to oper-ate at high speed. At high speed, glitches caused by delay variations in differentpaths will have a significant impact on the achievable resolution of a converter.To reduce the glitches, thermometer code can be used, which yields a minimalamount of glitches compared with other codes, but requires complex hardware.In practice, a segmented converter structure is used for high resolution converterswhere the least significant source weights are binary scaled and the most signifi-cant weights are thermometer-coded. Hence, the thermometer-coding used in thepresented DEM encoders applies to segmented converters as well.

The use of a thermometer encoder suits the DEM techniques well. However, aproblem is that the current DEM techniques use a type of scrambling that ruinsthe good glitch property that can be achieved with thermometer code. In thispaper we present an approach to scramble thermometer code so that the glitchenergy associated with a code transition is minimized, while we maintain theproperty of having a low sensitivity to matching errors in a converter. In the fol-lowing, we will also suggest a hardware structure that implements the presentedapproach. We will also explain the function of the hardware with a simple exam-ple, where a 4-bit converter is used for the sake of simplicity.

2. A DEM APPROACH

The operation of an N-bit thermometer-coded flash converter is characterized by

(1)A wkref

k 1=

n

∑=


185

where A is the analog output, ref is a reference quantity of, e.g., current, voltageor charge that should be added to the output, n = 2N–1 denotes the number ofsources of reference units to add, and w1…wn is a bit vector encoded from a dig-ital input D used to control which sources to add [1]. The name thermometercode implies that a continuous range of bits w1…wi should be one, while theremaining bits are zero. However, by relaxing the last constraint and allowingany wk to be one as long as the output is correct, we achieve a redundant codewith many possible representations for most numbers. This redundant propertymakes it suitable for use in DEM techniques where we randomize what code touse. By restricting the randomization to only include codes that produce smallglitches it is possible to improve the glitch performance compared with using aconventional DEM technique where a code is selected randomly from the full setof codes. In this work we present an approach that aims at solving this problem.

The key idea in our approach is to construct a subset of codes containing only thecodes that cause a minimal number of bits to be altered in a code transition. Bythis we will minimize the glitches, since they to a significant extent depend onthis parameter. The codes in a subset are identified from an investigation of thetwo cases presented in the following.

2.1. Code selection case A: Bit increase

In case A, the output from the converter is increasing. This case implies that thenumber of bits for which must increase in a thermometer code. For thiscase, a code transition with a minimum number of altered bits is achieved if weselect the new code so that we only set bits.

2.2. Code selection case B: Bit decrease

In case B, the output from the converter is decreasing. This complementary caseimplies that the number of bits for which must increase. For this case, acode transition with a minimum number of altered bits is achieved if we selectthe new code so that we only clear bits.

2.3. DEM approach

Since all bits in a thermometer-coded converter control unit sources, the identi-fied subset of codes in any of the two cases introduces the same minimal amountof glitch energy. Use of any other code that is not within the subset, however stillyielding the correct output, would require more bits to be switched, with thenumber of cleared bits equal to the number of set bits.

wk 1=

wk 0=

186

One approach to implement this idea is illustrated in Fig. 1 where an N-bit D/Aconverter is shown. Compared to the conventional approach, we have added aregister to the output of the DEM encoder that contains D flip-flops. Theuse of this register is two-fold. First, the control signals wk become independentof delay variations in the encoder, improving the glitch situation. Second, the reg-ister stores the current state, which can be used in the encoder to construct theproper subsets. The cost of this solution is an increased complexity of the DEMencoder and, of course, the hardware for the additional -bit wide regis-ter.

In a second paper, also presented at ICECS’00, we present another implementa-tion approach that instead uses a tree structure [6].

3. REALIZATION OF A DEM ENCODER

In the following description of the DEM encoder proposed in the previous sec-tion, we initially consider code selection case A, since the modifications neededto handle the complementary case B are minor. Consider the N-bit converter inFig. 1. Compared to a conventional converter, the state of the thermometer-codeis stored in a register. This state will be updated by the DEM encoder accordingto the following approach.

In Fig. 2 the suggested realization of the DEM encoder is shown. The input W =w1…wn is the input from the register in Fig. 1, and D is an offset binary input thatshould be used to encode a new state W'. In the figure, white arrows have beenused to indicate binary data, and gray arrows have been used to indicate ther-mometer-coded data. There is also one control bit c indicated with a line that sig-nals whether the current code selection case is A or B. The boxes with round

Figure 1. An N-bit D/A converter with a DEM encoder and a register for storing the thermometer-coded state.

2N 1–

2N 1–( )

1

1

1

ref 0

A

� �

D

w1 w2 w2N�1

DEM encoder

�

D D D


187

corners are the additional operations needed to handle the somewhat more com-plex code selection case B. Now we will describe the operations needed to handlecase A.

3.1. Description of the operations

At the top of Fig. 2, there is a block labeled ‘2N–1:N counter’. The purpose ofthis block is to count the number of ones in the current state W, i.e., a thermome-ter-to-binary encoding. The count, denoted B1, is subtracted from the data inputD in the block ‘Subtractor’, yielding a difference B2 corresponding to the numberof zeros that should be changed to ones in the next state. This additional numberof ones is literally created in the block ‘Thermometer encoder’, that converts thebinary count B2 to the thermometer code T1 with a continuous range of ones.

The block ‘M-bit scrambler’ produces T2 by scrambling the position of M bits,including all ones created in the preceeding block. The number M should beequal to the number of zeros in the current state W, which is calculated in theblock ‘Invert’ producing B3. This block inverts the bits in the number of zeros B1,yielding the wanted count of ones since it is given by the relation

, assuming two’s complement arithmetic.

Finally the block labeled ‘Zero distributor’ distributes the scrambled bits to thebits that are zero in the current state W. The bits that are one are unaffected duringthe distribution. The result of this block is output as the next state W'.

Figure 2. Implementation of the DEM encoder.

Subtractor

2N�1:Ncounter

D

Zerodistributor

M-bitscrambler

Thermome-ter encoder

Negate

Invert Invert

Invert

T1

T2 T3

T4

B1

B2

B4

B3

c

W

W'

B3 2N 1– B1– B1= =

188

3.2. Operations in case B

Obviously, the presented scheme is not designed to handle case B where we needto clear ones instead of setting zeros. However, this can easily be achieved bymodifying the described structure slightly. Then we detect case B, e.g., as anoverflow c in the ‘Subtractor’. When this case is detected, we can use the hard-ware to clear ones in case B instead of setting zeros by inverting both the input Wand the output W'. This is accomplished by the blocks ‘Invert’ producing T2 andW' in Fig. 2, that should invert a signal depending on the control input.

Some other modifications are also needed in order to handle case B. The block‘Negate’ is needed to correct the output B2 when we have an overflow from‘Subtractor’, i.e, we calculate the number of ones to clear. The effective opera-tion will be B4 = |B2|. Another modification is needed to the block ‘M-bit scram-bler’ where the input B3 is the number of zeros in case A indicating the numberof bits to be scrambled. In case B we need to scramble a number of bits corre-sponding to the number of ones B1 in the current state. Since B3 is calculated asthe inverted B1, we simply make the inversion operation conditional on that wehave case A, as indicated in Fig. 2.

4. A 4-BIT CONVERTER EXAMPLE

To illustrate the operation of the presented DEM approach further, we will give anumerical example on the operation of a 4-bit converter. Let us assume an arbi-trary initial state of

W = 101011101011111

which corresponds to the decimal value 11. The first operation we perform is tocount the number of ones in W with the block ‘2N–1:N counter’, yielding

B1 = 1110 .

We will also assume an arbitrarily chosen digital input to the converter of

D = 1310

which primarily is used to see how many zeros we need to set in the current stateW to achieve the next state W'. This is achieved in the block ‘Subtractor’ by theoperation

B2 = D – B1 = 1310 – 1110 = 210 .


189

In this case we have no overflow, yielding c = 0. Hence the conditional operation‘Negate’ produces the count B4 = B2. To set the two ones in the current state Wwe first create two ones literally in the block ‘Thermometer encoder’, which con-verts the number of additional ones B2 into thermometer code, i.e.

T1 = 110000000000000 .

Now the strategy is to select as many bits (including all ones in T1) as we havezero bits in the current state W. To obtain this we need to calculate the number ofzeros in W. This is a straightforward operation since we already have counted thenumber of ones as B1, and know the total number of bits in the state to be

. The block ‘Invert’ performs exactly this operation, since inversionof all bits in two’s complement arithmetic corresponds to the operation

B3 = 15 – B1 = 15 – 11 = 4 .

This count is used in the block ‘N-bit scrambler’ to scramble the position of thecorresponding number of bits. We illustrate this operation S by assuming that therandomization process happens to yield

T2 = S(0011-----------) = 0110-----------

where we have indicated the bits not included in the scrambling (15 – 4 = 11 bits)with ‘-’:s. Finally the block ‘Zero distributor’ distributes the scrambled bits T2 tothe zeros in the current state W. Any bit marked with a ‘-’ above is leftunchanged. Below we use arrows to illustrate the distribution of the bits:

The arrows going from the four scrambled bits indicate the zero bits to bereplaced in the current state, and the remaining two arrows indicate which of thetwo zeros that actually is set. The next state becomes

W ' = 101111111011111

which is output to the register.

2N 1– 15=

0110-----------

101011101011111

101111111011111

T2:

T3:

T4:

- distribution to zeros

- changes

190

5. SIMULATION RESULTS

The function of the proposed hardware was verified by a C program that simu-lates the hardware for an N-bit converter, where N is defined at compilation time.To estimate the performance of the presented approach, we modeled four 6-bitconverters in Matlab, assuming that the glitch power is proportional to the num-ber of switching sources. The modeled D/A converters were three conventionalconverters, a binary-scaled, a thermometer-coded, and a thermometer-coded con-verter with conventional DEM, plus a thermometer-coded converter with the pre-sented DEM approach. As a measure of glitch performance we use the ratiobetween simulated glitch power and signal power. In Table 1, power ratiosobtained from simulation with a multi-tone input are listed. The input contained256 tones with equidistant frequency spacing, distributed over the entire Nyquistfrequency range. The power ratios have been normalized with respect to thebinary-scaled converter. In the table, we see that there is an improvement fromusing a thermometer-coded converter over a binary-scaled. However, this gain inperformance is lost when we introduce conventional DEM. The presented DEMapproach is able to regain the glitch performance to the level of the thermometer-coded converter.

To investigate the performance in terms of matching errors, we apply a Gaussianindependent distributed relative matching error with standard deviation of 2% toeach weighted source in all converter structures. In Table 2, the estimated SFDRfrom the simulations is given. In the table, we see that both the converter withconventional DEM and the converter with DEM that uses restricted scramblingare able to improve the SFDR with 13 dB over the other structures.

These results indicate that our DEM technique is able to reclaim the gain inSNDR that is lost with conventional DEM techniques, while the performance interms of matching is maintained.

Table 1. Relative glitch performance for different 6-bit converter structures.

6-bit converterNormalized power

ratio [dB]

Binary-scaled 0

Thermometer-coded

-11

Conventional DEM

0

Restricted DEM -11


191

Table 2. Matching in different 6-bit converter structures.

6. CONCLUSION

A DEM approach was presented that aims at reducing the additional glitchenergy introduced by other DEM techniques. This is achieved by restricting thescrambling in DEM to only include codes that do not increase the glitch energy.

Further, a hardware structure was proposed that implement this approach. Thehardware is realized from two cases depending on the state of the analog output.In the first case the number of bits that are one increases, and in the second casethe number of ones decreases. We start by describing how the first case can beimplemented, and then we reuse the hardware in the second case by introducingsome additional hardware that is activated when the second case is detected. Thiscan be achieved since there is a simple relation between the two cases thatenables a simple transformation of the input and output state.

The functionality of the hardware was verified with a C program that simulatedthe hardware for an N-bit converter, where N is a generic parameter. For the pur-pose of estimating the performance, four 6-bit converters were also modeled inMatlab, using a simple model for the glitches. The simulation results indicatedthat the proposed implementation has the potential of suppressing the glitches aswell as the optimal thermometer-coded converter, while yielding a distortionlevel that is almost as low as conventional DEM implementations.

7. REFERENCES

[1] R.J. van de Plassche, Integrated Analog-to-Digital and Digital-to-AnalogConverters, Kluwer Academic Publishers, Boston, 1994.

[2] M. Gustavsson, J.J. Wikner, and N. Tan, CMOS Data Converters forCommunications, Kluwer Academic Publishers, 2000.

6-bit converter SFDR [dB]

Binary-scaled 54

Thermometer-coded

55

Conventional DEM

68

Restricted DEM 68

192

[3] P. Carbone and I. Galton, “Conversion error in D/A converters employingdynamic element matching”, Proc. 1994 IEEE Int. Symp. on Circuits andSystems, vol. 2, 1994, pp. 13-16.

[4] H.T. Jensen and I. Galton, “A low-complexity dynamic element matchingDAC for direct digital synthesis”, IEEE Trans. of Circuits and Systems II,vol.45.1, Jan. 1998, pp. 13-27.

[5] L.R. Carley, J. Kenney, “A 16-bit 4’th order noise-shaping D/A converter”,in Proc of Custom Integrated Circuits Conference, 1998, pp. 21.7/1-21.7/4.

[6] M. Rudberg, M. Vesterbacka, N.U. Andersson, and J.J. Wikner, “Glitchminimization and dynamic element matching in D/A converters”, to appearin IEEE Proc. The 7th Int. Conf. on Electronics, Circuits, and Systems,Beirut, Lebanon, Dec. 17-20, 2000.

DissertationsDivision of Electronics Systems

Department of Electrical EngineeringLinköpings universitet

Sweden

Vesterbacka, M.: On Implementation of Maximally Fast Wave Digital Filters,Linköping Studies in Science and Technology, Diss. No. 487, Linköpings Uni-versitet, Sweden, June 1997.

Johansson, H.: Synthesis and Realization of High-Speed Recursive Digital Fil-ters, Linköping Studies in Science and Technology, Diss. No. 534, LinköpingsUniversitet, Sweden, May 1998.

Gustavsson, M.: CMOS A/D Converters for Telecommunications, LinköpingStudies in Science and Technology, Diss. No. 552, Linköpings Universitet, Swe-den, Dec 1998.

Palmkvist, K.: Studies on the the Design and Implementation of Digital Filters,Linköping Studies in Science and Technology, Dissertation. No. 583, LinköpingUniversitet, Sweden, June 1999.

Wikner, J. J.: Studies on CMOS Digital-to-Analog Converters, Linköping Studiesin Science and Technology, Dissertation. No. 667, Linköping Universitet, Swe-den, April 2001.

dsp algorithms and architectures for …libvolume3.xyz/electronics/btech/semester7/dsp...asdsp...

Documents