implementation of channel demodulator for dab system11

Upload: ashu4nov

Post on 25-Feb-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/25/2019 IMPLEMENTATION OF CHANNEL DEMODULATOR FOR DAB SYSTEM11

    1/4

    IMPLEMENTATION OF CHANNEL DEMODULATOR

    FOR DAB SYSTEM

    C h i e n - M i n g Wu', Min g-D er Sh ieh ', Hs in-Fu Lo ,

    and

    M i n - H s i u n g

    HuZ

    G r a d u a t e S c h o o l of E n g i n e e r i n g Science T e c h n o l og y , N a t i o n a l Yunlin U n i v e r s i t y of S c i e n c e T e c h n o l og y , T a i w a n

    D e p a r t m e n t of E l e c t r o n i c E n g i n e e r i n g ,

    National Yunlin

    U n i v e r s i t y

    of

    Science

    T e c h n o l o g y , T a i w a n

    D i v i s i o n of Design S e r v i c e, N a t i o n S c i e n c e C o u n c i l C h i p I m p l e m e n t a t io n C e n t e r ( C I C ) , T a i w a n

    2

    ABSTRACT

    This paper describes the VLSI implementation of Fast Fourier

    Transform FIT) for the . Eureka-147 Digital Audio Broadcasting

    (DAB) system. We emphasize how

    ' t o

    m i n i i e t h e ha rd wa re

    requirement and efficiently manage the memory to meet the DAB

    requirement. Implementation results demo nstrate the applicability of

    our w ork w ith the characteristics

    of

    modular design, consuming less

    silicon area, and facilitating the extension for high transmission rate

    applications. The core size of the resulting chip implementation is

    2086x1806 pmz

    based o n the TSMC 0.35 1P4M CMOS

    process. Performance evaluation reveals that our design for the

    targeted channel demodulator o utperform previous solutions.

    1.

    INTRODUCTION

    Th e Digital Audio Broadcasting (DAB) system, described in the

    European Eureka-I47 standard [I], offers high-quality audio

    services, supports multimedia data

    to

    mobile reception and might

    replace the traditional radio system. Basically, two strategies are

    employed to implement the DAB receiver: the DSP-based

    architecture [Z,

    1

    and the ASIC-based implementation [4, 51. The

    former has the characteristics of maximum flexibility, ease of

    use

    and sImple programming, but it can only provide limited processing

    ca pa b ty . 0n .th e contrary, the ASIC-based implementation has the

    potentials of: supporting real-time symbol decoding and low-cost

    Implementation.

    Figure 1: shows an o verview

    of

    the DAB system, in which the

    ISONPEG coding is adopted for source coding and COFDM

    (Coded Orthogonal Frequency Division Multiplexing) for channel

    coding and' modulation [I]. After convolutional coding, the

    generated codewords are interleaved in frequency for the fast

    information channel and in both time and.frequency for the main

    service channel, and then the OFDM modulation is performed.

    In

    this paper, we focus on the design and implementation of the

    channel demodulator, which essentially perform a Fast Fourier

    Transform

    (FFT). In

    general, two basic types of

    F FT

    architectures

    can be found in the literature: the pipelined orchirecture with each

    stage consisting of a butterfly unit 16, 71 and the single burrerfly

    architecture

    1.5, 81 that employs just one radix-r butterfly unit. The

    main concern s the trade-off between hardware o verhead and speed

    requirement.

    Although the pipelined architecture can provide a higher

    throughput rate than the single butterfly implementation, we are still

    interested in the single butterfly architecture because of the

    specifcations of the channel demodulator as well as the hardware

    considerations on the implementation of DAB receivers. For the

    single butterfly Implementation, a basic problem that arises

    is

    how

    to

    eEciently mange memory readwrite accesses for the purposes

    of increasing

    its

    throughput rate. The common

    solutions

    include: (1)

    Use the high-radix implementation to reduce the

    total

    number of

    memory accesses at the expense of increasing the arithmetic

    complexity, i.e., the hardware requirement of a high-radix butterfly,

    unit.

    (2)

    Partition the memory into several banks in order

    to

    allow

    concurrent accesses of multiple data with a more complicated

    addressing scheme, which might correspond to a higher routing area.

    In this paper, we describe the design and implementation of the

    FIT for the DAB ch annel demodulator. We show our experiences

    on

    how to use the conflict-free memory addressing arrangement in

    191

    to minimize the hardware requirement and to match the DAB

    requirement. Implementation results de monstrate the applicability of

    our work

    to

    the targeted channel demodulator and the advantages

    over previous solutions [ 5 , 71 in terms of hardware requirement.

    The rest of this paper

    is

    organized

    as

    follows: Section

    2

    reviews the

    background and ou r previous work [ 9] related to this paper. Section

    3 describes the resulting architecture and design of

    FFT

    processor.

    Then, the corresponding chip implementation and performance

    evaluation are shown in S ection 4. Finally, Section 5 concludes this

    work.

    conYolul ionald i n g

    OFDM transminer

    d i n g

    inrerIEaving

    Chaskd

    N o m and Retlcclion

    Figure

    1 An

    overview of the DAB system [SI.

    2

    PRELIMINARY RESULTS

    x(k) is defmed as

    The N-point Discrete Fourier Transform (DFT)

    of a

    sequence

    where

    n

    =

    0, 1,

    ..., N-l and W

    =

    e-J2 . From Eq.

    ( I ) ,

    we know

    that N2 multiplications and N(N-1) additions are needed to directly

    perform the required computations. By applying the

    FIT,

    the

    computational complexity can be down to a number in O(N log M .

    If the number of sampled points is a power of the radix r, then it

    is easy

    to

    compute the D F I by using a radix-r

    FF'I

    algorithm In

    such a case, the N-point DFT can be decomposed into a set of

    recursively related r-point transforms. The decimation

    in

    time (DIT)

    and decimation in frequency (DIF) are two basic classes of

    FIT

    algorithm [lo]. Specifically, the DIT FF I algorithm is based on

    decomposing the input sequence x(k) into successively smaller and

    smaller subsequences. The DIF

    F FT

    algorithm is

    to

    decompose the

    output sequence

    X n )

    into smaller subsequences in the same way.

    Figure

    2

    shows a DIT 8-point

    FIT

    algorithm, in which the data in

    each stage can be processed based

    on

    the so-called butterfly units.

    E137

    -7803-7761-31031117.002003

    EEE

  • 7/25/2019 IMPLEMENTATION OF CHANNEL DEMODULATOR FOR DAB SYSTEM11

    2/4

    Figure 2. The data flow graph of D IT E-point I omputation

    In

    general, an N-point I computation requires (N/r)xlog,N

    radix-r butterfly computations and either the pipelined architecture

    or the single butterfly architecture can be selected for a dedicated

    application. For the single butterfly implementation, it implies

    2Nxlog,N mem ory accesses, which are the main bottlenecks for fast

    FFT computation. Therefore, we need an efficient memory

    management strategy to overcome this problem, i.e., to reduce the

    number of memory accesses

    or

    to increase the memory bandwidth.

    In

    our previous work

    [91,

    we have presented a set of simple but

    efficient equations to partition the memory into a number

    of

    memory banks such that the equivalent memory bandwidth can be

    increased with simple interconnection networks.

    As

    known, let be the numbe r of stage s for the

    FFT

    computation, then the value can be compu ted by

    m = l l o g r

    (2)

    Following the notation of conventional number system, it is

    assumed that the original memory address

    4, is

    expressed in

    unsigned radix-r representation defined as

    where

    di is an

    integer and

    0 5 di 5 r-I. In

    consequence, a feasible

    solution to partition the memory into

    r

    banks can be easily obtained

    as

    shown in

    Eq. 4),

    which implies that the original address

    (4,

    will

    be distributed into the bank number

    B d, r .

    The correctness of

    Eq.

    (4) is assured by observing that for a

    given

    butterfly index, the

    equation con tains the distinguishable variable at each stage.

    4.

    & . I . c L . 2 . .

    ..

    a,

    d .

    do , (3)

    B d, r ) = d,,., d,,,. d z d l 4

    mod

    r

    (4)

    Finally, we consider the mapping of 4,nto one of the address

    locations of the selected bank B d,

    r .

    To simp@ the hardware

    implementation, the

    assigned

    address

    BA d, r in

    the bank

    B d, r ) is

    obtained by discarding the least significant digit of the original

    address. Equation 5 ) causes

    no

    conflict due to the fact that for two

    original addresses that differ in only the least signifcant digit, they

    are distributed into different banks based on

    Eq.

    4) because of 0 S

    d0Sr- l .

    BA d. r)

    =

    dn,.t. d,,,.. ...,

    4.

    4

    5 )

    3 FFT DESIGN AND IMPLEMENTATION

    Figure

    3

    depicts the block diagram o f the single butterfly

    architecture for our

    FIT

    processor. It operates

    on

    a 24.576 MHz

    clock and consists of a simple radix-2 DIT butterfly unit, a single-

    port

    FFT

    RAM, a coefficient RO M, a control unit, and an address-

    generate unit (AGU).

    AU

    variables are complex and the intemal

    datapath widths are either 8 o r

    16

    bits. The details of the

    VLSI

    realization are described in the following subsections.

    onlml

    Unit

    eh

    acff. Butlsrt lynit

    ROM

    Figure

    3.

    Block diagram of the Tprocessor,

    3 1MemoryArrangement

    For memory arrangement, first we have to.decide whether the

    ping-pong mode or in-place mode

    is

    to be applied to store the

    intermediate values when implementing the FFT

    RAM.

    The main

    disadvantage of the former is that twice

    as

    many memory spaces are

    required in comparison with the in-place operation, but the control

    circuit is easy, For in-place scheduling, exactly one memory space is

    needed for storing the intermediate values and the old computed

    values are immediately overwritten by the newly computed values.

    This

    is an

    important feature for the realization of long

    FFTs

    due to

    the fact that area for storing the large amount of intermediate results

    will occupy a significant fraction of the avatlable chip area. For this

    reason, we consider only in-place schemes in this work. Basically,

    the memory addresses of the in-place schedule can be generated

    with little hardware overhead based on cyclically rotational property

    [ I l l .

    As known, the.lower hardware cost of the single butterfly

    architecture is achieved at the price of degrading the throughput

    rate of the pipelined version. According to the operational mode

    I

    defined in the Eureka-147 standard. we know that a ZME-point

    FFT

    operation should

    be

    completed within 1.25

    m

    Under such a

    circumstance, it will be not possible to complete the desired

    FFr

    operation based on the radix-2 solution without memory partition

    given

    the chosen operational

    frequency

    of

    4 576 MHz

    In order

    to

    make the single buttemy architecture meet the DAB requirement,

    memory partitioning becomes a cost-effective solution. In our

    implementation, the single-port FFT RAM is divided into

    r

    = 2

    banks to meet the timing requirement and the in-place scheduling

    scheme is applied for saving memory spaces.

    The address-generate unit shown in Figure 4 is designed to

    generate addresses for tw o m emory banks and the coefficient

    ROM.

    The butterfly counter is used to sequentially generate the required

    buttemy indices at stage one. The two barrel shifters first

    concatenate their indices, respectively, with the current butterfly

    index and then emulate the right rotational property of addresses at

    the present stages specified by the stage counter. Finally, the MUX

    is

    to distribute the addresses based

    on Eqs. 2)- 5)

    such that the

    output of each barrel shifter can be directed into the correct

    memory bank. For the radix-? implementation. the control signal

    Bank-index

    is derived by performing bit-wise XOR operation

    on

    the original addresses according to Eq. (4).

    In addition, the contents of the coefficient ROM and the

    corresponding addressing

    rules

    can be easily decided by following

    the data flow graph

    of

    DIT FFT computation. Note that w e only

    need to store half the twiddle coefficients due to their symmetric

    stored in the pth ROM address. Then, the ROM contents can be

    accessed based on the current butterfly index

    BI

    and the present

    property. Let the radix-2 twiddle coefficient

    W p

    = e j l n x P i N e

    U-138

  • 7/25/2019 IMPLEMENTATION OF CHANNEL DEMODULATOR FOR DAB SYSTEM11

    3/4

    itage number r according to following equations.

    k t

    he binary

    representation of the curren t butterfly index

    be

    given by

    B l

    = bn,.2,bn,.3 .b2.bl.bO)2

    6 )

    where m = l o g , N is the number of stages for the rad&-?

    implementation. From the data flow graph. the elements hi s

    of

    B I

    can be used

    as

    variables in conjunction with the value t to generate

    proper ROM addresses. Specifically, we first generate a ve ctor from

    the ,present value based on Eq. (7) and then the desired ROM

    address

    p B1, r

    can be computed by using the vector

    as

    a mask to

    filter out unwanted

    b, s

    according to Eq. 8).

    -1 = [ q , , ,~2 .q n ,~J . . .~ .q 1 ,q O 1 2 .or

    r =

    1, 2 ,

    ....

    m (7)

    r-

    Equation (7) can be easily implemented by resetting a s M register

    and then shifting in a one from the least significant bit when the

    stage advances once. And.

    Eq.

    (8)

    represents the masked output of

    the bit reversal of the current butterfly index. In both cases, their

    implementation cost

    is

    almost negligible.

    Bank-index

    C ,

    Figure 4 . The block diagram of the address-generate unit

    3 2 Buttemy Unit

    The butterfly unit is the core

    of

    F l T processors to determine

    the desired clock speed and the resulting throughput. In this work,

    the butterfly unit was designed with the simple rad&-2 DIT-FFT

    algorithm.

    As

    shown in Figure

    5 ,

    the arithmetic operations consist

    of calculating a pair

    of

    complex values,

    A'=A+BW

    and

    B'=A-BW,

    from a pair

    of

    complex inputs, A and

    B,

    and the twiddle coefficient

    W .

    Mulipliar ~

    L - - - ~ _ _ _ _ - - - _ - _ _ _ _ _ _ _ -

    Figure

    5 .

    The arithmetic

    of

    radix-2 DIT-FFT algorithm

    For

    a

    butterfly unit without employing pipelining, the critical

    path is the summation

    of

    the memory read operation. arithmetic

    operation (multiplication and addition

    of

    complex numbers), and

    memory write op eration. To.r educ e the critical path delay, we divide

    the whole operations of the buttertly unit into

    (s+?)

    different steps

    (the fust step for memory read operation, the following s steps for

    arithmetic operation. and the last step for memory write operation)

    as indicated in Figure

    6.

    Due to th e in-place computation. we have

    to schedule the tasks assigned to the pipelined butterfly unit such

    that no control hazard occurs during memory accesses. A

    control

    hazard

    (see Figure

    7 a))

    results from the conflict when the butterfly

    unit intends to access more than tw o data in the sam e memory

    bank.

    Figure 7(b) shows the schedule to eliminate the control hazard

    providing that only the single-port memory h available in the

    implementation. The arran gemen t of Figure 7(b) results in only 50%

    hardware utilization

    of

    the pipelined butterfly unit.

    On

    the contrary,

    100%hardware utilization can

    be

    achieved

    if

    the dual-port memory

    is

    employed in the design. Note that the area occupied by the

    memory module is not only proportional to the number

    of

    stored

    data, but it is

    also

    proponional to the number of ports. Obviously,

    the chip area of

    a

    dud-port memory is much higher than that

    of

    a

    single-port memory.

    Since we use a 24.576

    MHz

    clock in our

    FFT

    processor, the

    arithmetic operation can

    be

    fnished within one clock cycle

    s

    = I .

    Each buttertly operation. thus, only takes three clock cycles, each

    for memory read operation, arithmetic operation. and memory write

    operation. In addition, only 50% hardware utilization is achieved

    because the single-port memory is employed in our design

    to

    reduce

    the hardware cost.

    iz 02

    w

    m m

    I

    Read Computation

    IWntc

    Figure 6. Radix-2

    DIT

    pipelined butterfly unit

    T . T r . - - , ~ , . , T . . T . - - T T . T

    os.

    I I R

    C~

    I c I4

    (b)

    Figure 7. (a) The control hazard. (b) The reconcile for control

    hazard.

    11-139

  • 7/25/2019 IMPLEMENTATION OF CHANNEL DEMODULATOR FOR DAB SYSTEM11

    4/4

    4

    CHIP REALIZATION AND COMPARISON

    AU

    the modules in our design have been successfully

    implemented based on the

    TSMC

    (Taiwan Semiconductor

    Manufacturing Company) 0.35 jnn lP4M CMOS process and

    simulated using Synopsys and Cadence tool. Based

    on

    the

    speciiicatians of DAB channel demodulator, the resulting

    FFT

    processor is capable of completing the four operational modes

    (mode

    I:

    2048 points, mode II: 512 points, mode I 1 1 256 points,

    and mode

    I V

    1024 points) with

    a

    clock frequency of 24.576 MHz.

    The corresponding physical layout

    is

    shown in Figure 8, in which it

    inc ludes 2x1024~16

    RAMs

    (two banks, each containing

    1 24x16

    bits) and 2 x1 02 4~ 8 O Ms (one for the real part and another for the

    imaginary part). In terms of the 2-input NAND gate, the total

    number

    of

    gate counts is 4351, excluding the used memories. The

    resulting core sue of the chip implementation is about 2086x1806

    pn2and the o verall chip size including

    U 0

    ads is 2856x2594

    pn .

    3*( logy -1) CM '

    4

    *log:

    Adder'

    4* log:

    Sub '

    Figure 8. The layout of the developed FFT processor,

    We compare the performance of our implementation with the

    following T implementations: the pipelined architecture I71 and

    the single butterfly architecture I S ] The circuit complexities of

    these designs are compiled in Table I he pipelined architecture in

    171 migh t be the preferred choice for high-speed applications, but it

    is not suitable for the application of DAB system. The memory

    bandwidth problem of

    [ ]

    is solved

    by

    introducing more

    complicated structure (the radix-4 butterfly unit) and utilizing more

    memory resources.

    Note

    that the operation frequency of [5] is

    12.288MHz.

    By

    taking

    advantages of

    efficient

    memory partition and

    employing the pipelined butterfly unit, our design can reduce the

    required

    area

    complexity and it still fits in the DAB specifications.

    For DAB applications, it

    is

    clear that our design outperforms

    Delaruelle's work.

    5. CONCLUSION

    Up to date, lots of efforts have

    been

    devoted to the

    development of low-cost DAB products. Of the key techniques to

    build

    a

    DAB receiver. the

    FFT

    is one of the key components, which

    is

    very suitable for

    ASIC

    implementation.

    This

    paper

    explores

    efficient solutions for hardware implementations of the T

    processor such that they

    can

    fit in the specification of the Eureka-

    147 standard under limited hardware resources.

    AU

    the functional

    blocks

    are

    designed, simulated, and verified

    using

    the

    Synopsys

    and

    Cadence software and the

    f m d

    layout is ready for

    VLSI

    fabrication

    based on the

    0.35 p n

    TSMC

    process and Compass

    cell library.

    CM 1 C M

    4 Adder

    1

    Adder

    ub

    4 Sub

    4 Registe

    Results show that our implementation has the potentials of

    consuming less silicon area and facilitating the extension for high

    transmission rate requirement.

    REFERENCES

    [ I ]

    ETS

    300 401,

    Radio

    broadcasting system: Digital audio broadcasting

    DAB)o

    mobile. portable and fixed receivers ,

    ETSI, 2'edition..

    May

    1997.

    121

    J

    A. Husiken. F.

    V. Lax.

    A.

    Delaruelle, and

    N.

    .

    L.

    Philips

    Specification. partitioning and design

    of

    a DAB channel decoder. in

    Proc.

    VLSI Signal Processing Workhap, pp. 21-29. 1993.

    131 M.

    B o k .

    D. Clawin,

    K.

    Gieske. F. Hofm nn. T. Mlasko, M.

    J.

    Ruf. and

    G.

    Spreitz The receiver engine

    chipset for

    digital audio broadcasting,

    in hoc. URSI Int. Symp. Signals. System. and Electronics. pp. 338-34

    1998.

    A. Delamelk,

    J.

    Huisken. 1. V. Loan. and F. Welten. A chip set for

    digital audio broadcasting channel decoder. in

    hoc. IEEE

    Custom

    Integrated Circuit Coni..pp. 13.4.1-1 3.4.4. 1995.

    151

    A .

    Delaruelle.

    J. Huisken.

    1. van Laan

    and F. Welten. A channe

    demodulator IC

    for

    digital audio broadcasting,'' in hoc. IEEE Custom

    Integrated Circuits Conf. 1994. pp. 47-50 . 1994.

    161 S.

    He. and M. Torkelson. Design and implementation

    of

    a

    1024-poi

    pipeline F l T processor. in Proc. EEE Custom Integrated Circuits Con

    pp. 131-134,1998.

    171 E. Bidet, D. Castelain. C. Jaanblanq. and P.

    Senn.

    A

    fast

    single-chi

    implementation

    of

    8192 complex paint

    FTT.

    IEEE

    I.

    Solid-Stat

    Circuits, vol. 30. no. 3. pp. 300-305, March 1995.

    E. Cedn. Richard C. S . Morling

    and

    I. Kale.

    An

    extensible complex fas

    Fourier transform processor chip

    for

    real-time

    specmm

    analysis and

    m~suremenf.

    EEE Trans.

    Instrumentation and Measuremnt. vol. 47

    no.

    1.

    pp.95-99, Feb. 1998.

    191

    H. F. Lo, M. D. Shieh. and C. M. Wu, Design of

    an

    efficient FF

    processor far DAB system in Proc. IEEE Inl. Symp. Circuits and

    System. 654-657.2001

    [IO1 E. 0

    righam

    The Fnsf Fourier Tonsform

    and

    ifs Application

    Prentice-Hall

    Inc..

    1988.

    [ I l l M. Biver, H. Kaeslin, and C. TormMsini. In-place updating of pat

    metiics in Viterbi decaders, IEEE J . Solid-State C ircuits.

    vol.

    24.pp

    1158-1159,Aug.1989.

    141

    181

    Table 1. Comparison s of different implementations

    No.

    of butterfly

    unit

    Arithmetic

    components

    Gate counts of

    arithmetic

    components

    Memory size

    No.

    of

    clock

    cycles

    N =

    2048

    A . Delaruelle Proposed

    .

    Bidet

    171

    151

    l o g y , radix-r

    I , radix-4

    I

    . radix-2

    8160*( log: -1)

    +896* log:

    9156 2954

    2048

    (dual- ort) 2x2048

    4xA, )

    2458 1I264 22528

    Note:

    (1)

    C M %bit complex-number multiplier, (2) Ad d 16-bi

    adder,

    (3)

    S u b 16-bit subtractor . (4) A ,

    =

    --log:, and

    5 )

    A2

    4

    11-140