an introduction to digital signal processors (dsp) · 2020. 10. 22. · automatic tools--matlab...

34
An introduction to Digital Signal Processors (DSP) Using the C55xx family

Upload: others

Post on 09-Feb-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • An introduction to Digital Signal Processors (DSP)

    Using the C55xx family

  • There are different kinds of embedded processors

    • There are a fair number of different kinds of microprocessors used in embedded systems– Microcontrollers

    • Small, fairly simple devices. Non-volatile storage. Generally a fair bit of basic I/O (GPIO, SPI, etc.)

    – “Processor”• More-or-less a desktop processor with favorable power

    numbers. Atom, ARM A8, etc.

    – System on a Chip• Generally more CPU power than a microcontroller, but has

    lots of “add-ons” including perhaps analog I/O and specialized devices (Ethernet controller, LCD controller, FPGA) etc.

  • Specialized CPU• I/O capabilities are often the first thing we look for when selecting a

    processor for our embedded system– We also need

    • “Enough” processing power• And maybe power characteristics.

    • But there are chips with specialized processing capabilities– And they can often get the “enough” processing power at a lower

    price point and using (a lot) less power.

    • Examples of specialized processors:– GPUs

    • Good on graphics, certain “regular” math-heavy calculations

    – Network processors• Good at handling massive data in and out while doing minimal processing on

    that data.

    – Digital Signal Processors • Also move lots of data, but usually much cheaper, smaller, and have

    specialized CPU capabilities.

  • Digital Signal Processor (DSP)

    • DSP chips are optimized for high performance/low power on very specific types of computation.

    – Power:

    • C5515 hits 51mW@ 100MHz

    – Tasks:

    • Filtering, FFT are the big ones.

  • Lecture overview

    • Task-specific processors– Network processors– GPUs– DSPs

    • DSPs in detail– Floating point vs. fixed point*– Specialized problem solving

    • Filters (FIR)– MAC– Data movement– Circular buffers

    • FFT– Bit reversed accesses

    • Conclusion

  • Fixed point vs. floating point

    • A reasonable way to break down DSPs:– Floating point

    – No floating point (Fixed point)

    • Floating point – Makes things a lot easier for the programmer.

    • Fixed point– A good DSP programmer can often get better

    power numbers with fixed point.• But can be a ton of work.

    Fixed point

  • Basic fixed point

    • “Qn” is a naming scheme used to describe fixed point numbers.

    – n specifies the digit which is the last before the radix point.• So a normal integer is Q0.

    • Examples

    – 0110 is 6 in binary

    – 0110 as a Q2 is 1.5

    • Numbers are generally 2’s complement

    – 1100 is -4.

    – 1100 as Q3 is -0.5

    Fixed point

  • Factoids

    • Signed x-bit Qx-1 numbers represent values from -1 to (almost) 1.

    – This is the form typically used because two numbers in that range multiplied by each other are still in that range.

    • Multiplying two 16-bit Q15 numbers yields?

    Fixed point

  • Fixed point

  • And this is important…

    Fixed point

    http://courses.ece.uiuc.edu/ece390/lecture/lockwood/l1.htmlhttp://courses.ece.uiuc.edu/ece390/lecture/lockwood/l1.html

  • Lowpass filter template

    11

    Filters

  • FIR filter

    • Basic idea is to take an input, x, but it into a big (and wide) shift register.

    – Multiply each of the x values (old and new) by some constant.

    • Sum up those product terms.

    • Example:

    – Say b0=.5, b1= .75, and b2=.25

    – x is 1, -1, 0, 1, -1, 0 etc. forever.

    • What is the output?][][

    0

    knxbnyM

    k

    k

    Filters

  • Automatic tools--Matlab

    • The figure directly above is from Matlab.– You specify the various parameters (fpass, fstop, p, Ap, As, etc.)

    and it will generate the bx values needed. – If parameters are too difficult, this could be huge (500+ z-1

    blocks)• In that case, we may want to use an IIR filter. Those are feedback-

    based and are a bit more touchy (prone to being unstable etc.).

    Filters

  • Consider a traditional RISC CPU

    • For reasonably large filter, by doesn’t fit in the register file.top: LD x++

    LD b++

    MULT a,x,b

    ADD accum, accum, a

    goto top

    (++ indicates auto increment)– That’s a lot of instructions

    • Plus we need to shift the x values around.– Also a loop…

    • Depending on how you count it, could be 8-10 instructions per Z-1 block…

    Filters

  • Some FIR “tricks”

    • Most obvious is to use a circular buffer for the x values.

    • The problem with this is that you need more instructions to see if you’ve fallen off the end of the buffer and need to wrap around…

    – And it’s a branch, which is mildly annoying due to predictors etc.

    0 1 2 3 4 5

    Filters

  • A slightly different version

    Int16 FIR(Uint16 i)

    {

    Int32 sum;

    Uint16 j, index;

    sum=0;

    //The actual filter work

    for(j=0; j=j)

    index = i - j;

    else

    index = ASIZE + i - j;

    sum += (Int32)in[index] * (Int32)LP[j];

    }

    sum = sum + 0x00004000; // So we round rather than truncate.

    return (Int16) (sum >> 15); // Conversion from 32 Q30 to 16 Q15.

    }

    0 1 2 3 4 5

    0 1 2 3 4 5

    X

    B

    This part is icky

    Filters

  • How fast could one do it?

    • Well, I suppose we could try one instruction.– MAC y, x++, z++

    • That’s got lots of problems.– No register use for the arrays so very heavy memory

    use • 2 data elements from memory/cache• 3 register file changes (pointers, accumulator)

    – Plus we need to do a MAC and mults are already slow—hurts clock period.

    – Plus we need to worry about wrapping around in the circular buffer.

    – Oh yeah, we need to know when to stop.

    MAC

  • Data• I need a lot of ports

    to memory – Instruction fetch

    – 2 data elements

    • I need a lot of ports to the register file– Or at least banked

    registers

    Data movement

  • C55xx Data buses (cont.)

    • Twelve independent buses:– Three data read buses

    – Two data write buses

    – Five data address buses

    – One program read bus

    – One program address bus

    • So yeah, we can move data– Registers appear to go on the same buses.

    • Registers are memory mapped…

    Data movement

  • OK, so data seems doable

    • Well sort of, still worried about updating pointers.

    – 2 data reads, 1 data write, need to update 2 pointers, running out of buses.

    Filters/MAC/Data movement

  • MAC?

    • Most CPUs don’t have a Multiply and accumulate instruction

    – Too slow.

    – Hurts clock period

    • So unless we use the MAC a LOT it hurts.

    • But for a DSP this is our bread and butter.

    – So we’ll take the 10% clock period hit or whatever so we don’t have to use two separate instructions.

    MAC

  • Wrapping around?

    • Seems possible.

    – Imagine a fairly smart memory.

    • You can tell it the start address, end-of-buffer address and start-of-buffer address.

    • It knows enough to be able to generate the next address, even with wrap around.

    – This also takes care of our pointer problem.

    0 1 2 3 4 5

    Circular buffers

  • Circular Buffer Start Address Registers(BSA01, BSA23, BSA45, BSA67, BSAC)

    • The CPU includes five 16-bit circular buffer start address registers

    • Each buffer start address register is associated with a particular pointer

    • A buffer start address is added to the pointer only when the pointer is configured for circular addressing in status register ST2_55.

    Circular buffers

  • Circular Buffer Size Registers(BK03, BK47, BKC)

    • Three 16-bit circular buffer size registers specify the number of words (up to 65535) in a circular buffer.

    • Each buffer size register is associated with particular pointers

    • In the TMS320C54x-compatible mode (C54CM = 1), BK03 is used for all the auxiliary registers, and BK47 is not used.

    Circular buffers

  • By the way…

    • If we know the start and end of the buffer

    – We know the length of the loop.

    • Pretty much down to one instruction once we get going.

    – The TI optimized FIR filter takes 25 cycles to set things up and then takes 1 cycle per MAC.

    Circular buffers

  • FFTs

    • Another common thing we want to do is an “FFT”

    – Tells you about the frequency parts of a signal

    • Breaks down the signal into “sin bins”

    • Useful in a lot of applications

    FFT

  • Discrete Fourier Transform (DFT)

    • The DFT is commonly written as:

    • One might also use

    FFT

  • “The” Fast Fourier Transform (FFT) Algorithm

    • There are many fast algorithms (FFTs) that can be used to compute the Discrete Fourier Transform (DFT).

    – Since the DFT is defined as:

    – How many MACs do we need?• Real or complex?

    • Any algorithm which reduces this can be said to be “fast”

    FFT

  • FFT

  • FFT

  • WN = e-j2π/N

    FFT

  • FFT support

    • FFTs typically take an array in “normal” order and return the output in “bit reversed” order.

    – Or the other way around (as on prev. page)

    • Hardware often able to swap the order of the address bits

    – makes it (much) faster to deal with the bit-reversed data.

    FFT

  • And a bit more

    • Other support?

    – Verterbi is an algorithm commonly used for error correct/communication.

    • Provide special instructions for it

    – Mainly data movement, pointer, and compare instructions.

    • Overflow is a constant worry in filters

    – TI’s accumulators provide 4 guard bits for detection.

    • That’s unheard of in a mainstream processor.

    • Saves instructions for checking for overflow.

    Other

  • Why do I care again?

    • To be clear the main point of this slide set is– For certain special purpose

    tasks, there are processors dedicated to doing those tasks• Those processors can be both

    more powerful and lower-power than a generic CPU at doing that task.

    – The trick is that they are designed to do tasks that are common in that field.• Here we see they have

    optimized a common DSP task (FIR filter stage) from about 8-10 assembly instructions down to 1!

    • Other such devices?– GPUs

    • For graphics• Move large amounts of data with

    simple operations• SIMD

    – Network processors1

    • Pattern matching - the ability to find specific patterns of bits or bytes within packets in a packet stream.

    • Key lookup - the ability to quickly undertake a database lookup using a key (typically an address in a packet) to find a result, typically routing information.

    • Data bitfield manipulation - the ability to change certain data fields contained in the packet as it is being processed.

    • Etc.

    1See https://en.wikipedia.org/wiki/Network_processor for more details. List taken from there.

    Other

    https://en.wikipedia.org/wiki/Network_processor