240662_633888485056270520.ppt

Upload: rafesh

Post on 04-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 240662_633888485056270520.ppt

    1/115

    Syllabus Architecture of TMS 320C6x

    functional units fetch and execute

    Pipelining

    Registers

    addressing modes

    instruction sets

    Timers

    Interrupts

    serial ports

    DMA

    memory

  • 7/29/2019 240662_633888485056270520.ppt

    2/115

    Introduction to DSP

    A digital signal processor (DSP) is a type ofmicroprocessor that are optimized forDigital signalProcessing

    They Integrates system control and math-intensivefunctions

    Advantage is speed, cost and energy efficiency.

    It is a key component in many communication,medical, military and industrial products.

  • 7/29/2019 240662_633888485056270520.ppt

    3/115

    FPGA

    Field-Programmable Gate Arrays have the capability of being reconfigurable within a

    system

    But more expensive, have high power dissipation

    ASIC

    - Application Specific Integrated circuits

    can perform specific functions extremely well, andcan be made quite power efficient.

    But since ASICS are not field-programmable, theirfunctionality cannot be iteratively changed orupdated while in product development

    Alternatives

  • 7/29/2019 240662_633888485056270520.ppt

    4/115

    Why go digital?

    Digital signal processing techniquesare now so powerful that sometimes it

    is extremely difficult, if not impossible,

    for analogue signal processing toachieve similar performance.

    Examples:

    FIR filter with linear phase.Adaptive filters.

  • 7/29/2019 240662_633888485056270520.ppt

    5/115

    With DSP it is easy to:

    Change applications.

    Correct applications.

    Update applications.

    Additionally DSP reduces:

    Noise susceptibility.

    Chip count.

    Development time.

    Cost.

    Power consumption.

  • 7/29/2019 240662_633888485056270520.ppt

    6/115

    Use a DSP processor when the

    following are required:

    Cost saving.

    Smaller size.

    Low power consumption.

    Processing of many high frequency

    signals in real-time.

    Why do we need DSP processors?

  • 7/29/2019 240662_633888485056270520.ppt

    7/115

    Applications

  • 7/29/2019 240662_633888485056270520.ppt

    8/115

    General DSP System Block Diagram

    P

    E

    R

    I

    P

    H

    E

    R

    A

    L

    S

    Central

    Processing

    Unit

    Internal Memory

    Internal Buses

    ExternalMemory

  • 7/29/2019 240662_633888485056270520.ppt

    9/115

    Classification of DSP

    Von Neumann's architecture

    Harvard architecture Super Harvard architecture

  • 7/29/2019 240662_633888485056270520.ppt

    10/115

    VON NEUMANN'S ARCHITECTURE

  • 7/29/2019 240662_633888485056270520.ppt

    11/115

    One shared memory for instructions (program) and

    data with one data bus and one address bus betweenprocessor and memory.

    Instructions and data have to be fetched in sequentialorder (known as the Von Neuman Bottleneck), limitingthe operation bandwidth.

    Its design is simple

    It is mostly used to interface to external memory.

  • 7/29/2019 240662_633888485056270520.ppt

    12/115

    HARVARD ARCHITECTURE

  • 7/29/2019 240662_633888485056270520.ppt

    13/115

    uses physically separate memories for theirinstructions and data, requiring dedicated buses for

    each of them.

    Instructions and operands can therefore be fetchedsimultaneously.

    Different program and data bus widths are possible,allowing program and data memory to be betteroptimized to the architectural requirements.

    Eg.: If the instruction format requires 14 bits then program busand memory can be made 14-bit wide, while the data bus anddata memory remain 8-bit wide.

  • 7/29/2019 240662_633888485056270520.ppt

    14/115

  • 7/29/2019 240662_633888485056270520.ppt

    15/115

    Efficient Memory Access

    OR

    Bus

    General purpose processors

    Early DSP processors

    More optimized DSP processors

  • 7/29/2019 240662_633888485056270520.ppt

    16/115

    Classification of DSP

    Fixed pointperforms integer operations Floating pointperforms both integer and floating point

    processors

    It is the application that dictates which device and platform to

    use in order to achieve optimum performance at a low cost.

    For educational purposes, use the floating-point device as it can

    support both fixed and floating point operations.

    Fixed point TMS320C1x, C2x, C5x ..

    Floating point TMS320C3x, C4x, C67x .

  • 7/29/2019 240662_633888485056270520.ppt

    17/115

    Programs in C are more flexible and quicker to develop.

    programs in assembly often have better performance;

    they run faster and use less memory, resulting in lower cost.

    C versus Assembly language

  • 7/29/2019 240662_633888485056270520.ppt

    18/115

  • 7/29/2019 240662_633888485056270520.ppt

    19/115

  • 7/29/2019 240662_633888485056270520.ppt

    20/115

    How complicated is the program?

    If it is large and intricate, you will probably want to use C.If it is small and simple, assembly may be a good choice.

    Are you pushing the maximum speed of the DSP?

    If so, assembly will give you the last drop of performance from

    the device.

    For less demanding applications, you should consider using C.

    C / Assembly ?

  • 7/29/2019 240662_633888485056270520.ppt

    21/115

    How many programmers will be working together?

    If the project is large enough for more than one programmer,

    lean toward Cuse in-line assembly only for time critical segments.

    Which is more important, product cost /

    development cost ?If it is product cost, choose assembly;

    if it is development cost, choose C.

    What is your background?

    If you are experienced in assembly (on other microprocessors),choose assembly for your DSP.

    If your previous work is in C, choose C for your DSP.

  • 7/29/2019 240662_633888485056270520.ppt

    22/115

    The Digital Signal Processor Market

  • 7/29/2019 240662_633888485056270520.ppt

    23/115

    Digital Signal Processor market is dominated by4 companies.

    Analog Devices (www.analog.com/dsp)ADSP-21xx 16 bit, fixed point

    ADSP-21xxx 32 bit, floating and fixed

    Lucent Technologies (www.lucent.com)DSP16xxx 16 bit fixed point

    DSP32xx 32 bit floating point

    Motorola(www.mot.com)DSP561xx 16 bit fixed point

    DSP560xx 24 bit, fixed point

    DSP96002 32 bit, floating point

    Texas Instruments(www.ti.com)TMS320Cxx 16 bit fixed point

    TMS320Cxx 32 bit floating point

  • 7/29/2019 240662_633888485056270520.ppt

    24/115

  • 7/29/2019 240662_633888485056270520.ppt

    25/115

    TMS320 Family

    Lowest Cost

    Control Systems

    Motor Control

    Storage

    Digital Ctrl Systems

    C2000 C5000

    Efficiency

    Best MIPS

    Wireless phones

    Internet audio

    players

    Digital still cameras

    Modems

    Telephony VoIP

    C6000

    Multi Channel and Multi

    Function App's

    Comm. Infrastructure

    Wireless Base-stations

    Audio and SpeechProcessing

    Imaging

    Multi-media Servers

    Video

    Best Performance &Ease-of-Use

  • 7/29/2019 240662_633888485056270520.ppt

    26/115

    C6000 Roadmap

    C6713C62x

    Performance

    Time

    Floating Point

    Multi-core C64x DSP

    1.1 GHz

    C64x

    DSP

    2nd Generation (Fixed Point)

    General

    Purpose C6414 C6415 C6416

    Media

    Gateway

    3G Wireless

    Infrastructure

    C6201

    C6701

    C6202C6203

    C6211C6711

    C6204

    1st Generation

    C6205

    C6712C67x

    Fixed-point

    Floating-point

    C6411

  • 7/29/2019 240662_633888485056270520.ppt

    27/115

    Feature of the TMS320C6x The Texas Instruments TMS320C6x family of

    microprocessors is one of the largest VLIW successstories to date

    This family of processors are built to deliver speed

    Family have different size, cost, memory, peripherals,

    power consumption specificationsFixed-point C6201 version 5-ns Instruction Cycle Time

    200-MHz Clock Rate

    performance of up to 1600 MIPS

    Eight 32-Bit Instructions/Cycle

    Floating-point C6701 version Can operate at 167MHz

    6ns Instruction cycle time

    1 giga floating-point operations per second (GFLOPS)

    Eg:

  • 7/29/2019 240662_633888485056270520.ppt

    28/115

    Very Long Instruction Word (VLIW )

    refers to a CPU architecture designed to take advantage of

    instruction level parallelism executes operation in parallel based on a fixed schedule

    determined when programs are compiled.

    the order of execution of operations (including which operations

    can execute simultaneously) is handled by the compiler hencethe processor does not need the scheduling hardware

    VLIW CPUs offer significant computational power with less

    hardware complexity greater compiler complexity

    VLIW architectures execute multiple instructions/cycle

  • 7/29/2019 240662_633888485056270520.ppt

    29/115

    VLIW architectures execute multiple instructions/cycleand use simple, regular instruction sets

    More parallelism, higher performance

    Better compiler targets

  • 7/29/2019 240662_633888485056270520.ppt

    30/115

  • 7/29/2019 240662_633888485056270520.ppt

    31/115

  • 7/29/2019 240662_633888485056270520.ppt

    32/115

    Disadvantages of VLIW Architectures

    New kinds of programmer/compiler complexity

    Programmer (or code-generation tool) must keep

    track of instruction scheduling

    Deep pipelines and long latencies can be confusing,

    may make peak performance elusiveIncreased memory use

    High program memory bandwidth requirements

    High power consumptionMisleading MIPS ratings

    V l iTI

  • 7/29/2019 240662_633888485056270520.ppt

    33/115

    VelociTI

    VLIW modification done by TI is called VelociTI

    Reduces code size Increases performance when instructions reside off-chip

    C6X architecture is based on the high-performance advanced

    VelociTI very-long-instruction-word (VLIW) architecture

    developed by Texas Instruments (TI)

    an excellent choice for multichannel and multifunction

    applications (Several instructions captured & processed

    simultaneously)

    TMS320C6x with VelociTI Enables Cost-Effective

  • 7/29/2019 240662_633888485056270520.ppt

    34/115

    TMS320C6x with VelociTI Enables Cost-EffectiveSolutions for Emerging

    Applications

    Unlimited Internet bandwidth

    Universal wireless communication

    New telephony features

    Remote medical diagnostics

    Automated cruise control

    Personal home base station

    Personalized home security

    TMS320C6000 DSP Device Nomenclature

  • 7/29/2019 240662_633888485056270520.ppt

    35/115

    TMS320C6000. DSP Device Nomenclature

  • 7/29/2019 240662_633888485056270520.ppt

    36/115

    TMS320C6711

    A floating point processorwith VLIW architecture

    Internal memory includes a two level cache architecture

    - 4KB of level 1 program cache (L1P)

    - 4KB of level 1 data cache (L1D)

    - 64 KB of RAM / level 2 cache for data/program (L2) Has direct interface to both synchronous memories (SDRAM

    and SBSRAM) and asynchronous (SRAM and EPROM)

    With 32 bit address bus , total memory space is 232 =4GB

    It requires 3.3v for I/O and 1.8v for core

    Operates at 150 MHz

    perform 900 million floating point operations per second(MFLOPS)

    Translates to 1200 million instructions per second (MIPS)

  • 7/29/2019 240662_633888485056270520.ppt

    37/115

    DSK Contents

  • 7/29/2019 240662_633888485056270520.ppt

    38/115

    1.8V Power Supply 16M SDRAM 128K FLASHDaughter Card I/F(EMIF Connector)

    ParallelPort I/F

    PowerJack

    PowerLED

    3.3V Power Supply

    JTAG Header

    EmulationJTAG Header

    Reset

    Line Level Output (speakers)

    Line Level Input (microphone)

    16-bit codec (A/D & D/A)

    Three User LEDs

    User DIP

    switches

    C6711DSP

    D. Card I/F(Periph Con.)

    TMS320C6711

    Block diagram

  • 7/29/2019 240662_633888485056270520.ppt

    39/115

    Block diagram

  • 7/29/2019 240662_633888485056270520.ppt

    40/115

    CPU There are two sets of functional units A and B

    Each set contains four units and a register file. One set contains functional units .L1, .S1, .M1, and

    .D1

    the other set contains units .D2, .M2, .S2, and .L2. .M unit : multiplication operation

    .L unit : logical and arithmetic operations

    .S unit : branch, bit manipulation and arithmeticoperations

    .D unit : load/store and arithmetic operations

  • 7/29/2019 240662_633888485056270520.ppt

    41/115

  • 7/29/2019 240662_633888485056270520.ppt

    42/115

    The C67x CPU executes all C62x instructions.

    In addition to C62x fixed-point instructions, the six out of

    eight functional units (.L1, .S1, .M1, .M2, .S2, and .L2)

    also execute floating-point instructions.

    The remaining two functional units (.D1 and .D2) also

    execute the new LDDW instruction which loads 64 bits

    per CPU side for a total of 128 bits per cycle.

    TMS320C6711 Memory

  • 7/29/2019 240662_633888485056270520.ppt

    43/115

    TMS320C6711 Memory

    3-Access level of Memory Map

  • 7/29/2019 240662_633888485056270520.ppt

    44/115

    3 Access level of Memory Map1. L1 Memory

    -Cache-based Architecture

    -Program Cache & Data Cache

    -Size : PC(4Kbyte), DC(4Kbyte)

    2. L2 Memory

    - Size : 64Kbyte

    - Program & Data

    3. L3 Memory

    External Memory

  • 7/29/2019 240662_633888485056270520.ppt

    45/115

  • 7/29/2019 240662_633888485056270520.ppt

    46/115

    External Memory

    - Synchronous Memory

    (SRAM, SBSRAM)

    - Asynchronous Memory

    (SDRAM, EPROM)

    Internal Memory

    - Program

    - Data

    Registers:

  • 7/29/2019 240662_633888485056270520.ppt

    47/115

    g

    The two register files each contain 16 32-bit registers for atotal of 32 general-purpose registers (A0~A15, B0~B15)

    Interaction with the CPU must be done through these

    registers

    The four functional units on each side of the CPU can freely

    share the 16 registers belonging to that side.

    two cross paths 1x and 2x connects all the registers on the

    other side

    (which can access data from the register files on the

    opposite side.)

    If register access is by functional units on the same side of

    the CPU, register file can service all the units in a single

    clock cycle

    -register access using the register file across the CPU

    supports one read and one write per cycle.

    Restrictions on Register Accesses

  • 7/29/2019 240662_633888485056270520.ppt

    48/115

    Registers A0,A1,B0,B1 are used as conditional registers

    Registers A4-A7 and B4-B7 are used for circular addressing

    Registers A0-A9 and B0-B9 (except B3) are temporary

    registers

    Any Registers A10-A15 and B10-B15 used are saved and later

    restored before returning from a subroutine

    Restrictions on Register Accesses

  • 7/29/2019 240662_633888485056270520.ppt

    49/115

    Each function unit has read/write ports

    Data path 1 (2) units read/write A (B) registers

    Data path 2 (1) can read one A (B) register per cycle

    40 bit words stored in adjacent register pair

    Used in extended precision accumulation

    32 LSB bits are stored in even register(eg.A2) and remaining8 bits stored in the 8 LSB of next upper (odd) register(A3)

    64 bit is also stored in the similar fashion

    Two simultaneous memory accesses cannot use registers of

    same register file as address pointers

  • 7/29/2019 240662_633888485056270520.ppt

    50/115

    C6 i t l b

  • 7/29/2019 240662_633888485056270520.ppt

    51/115

    C6x internal buses

  • 7/29/2019 240662_633888485056270520.ppt

    52/115

  • 7/29/2019 240662_633888485056270520.ppt

    53/115

    'C6x Peripherals

  • 7/29/2019 240662_633888485056270520.ppt

    54/115

    C6x Peripherals

    C6x

    CPU

    EMIF

    DMA

    Boot

    External

    Memory

    EMIFExternal Memory Interface.

    A 32-bit bus on which external memories and other devices can beconnected.

    It includes features like internal wait state generation and SDRAM control.

    The EMIF can interface to both synchronous and synchronous memories.

    McBSP

    HPI/XB

    Timer

    PLL

    McBSP

  • 7/29/2019 240662_633888485056270520.ppt

    55/115

    McBSP

    2 McBSP Multichannel buffered serial ports.Each McBSP can be used for high speed serial data

    transmission with external devices or reprogrammed as generalpurpose I/Os.

    McBSP1 is used to transmit and receive audio data from the

    AIC23 stereo codec.

    McBSP0 is used to control the codec through its serial control

    port.

    On chip PLL t l k t f l t l l k

  • 7/29/2019 240662_633888485056270520.ppt

    56/115

    On-chip PLLgenerates processor clock rate from slower external clockreference.

    Timersgenerates periodic timer events as a function of the processor clock. Usedby DSP/BIOS to create time slices for multitasking.

    Power Down units - Save power for durations when CPU is inactive

    EDMA Controller Enhanced DMA controller allows high speed data transfers

    without intervention from the DSP.

    BOOT- Boot from 4M external block

    - Boot from HPI/XB

    SBSRAM: Synchronous Burst Static Random Access Memory

    Host Port Interface (HPI)

  • 7/29/2019 240662_633888485056270520.ppt

    57/115

    Host Port Interface (HPI)

    The host port interface (HPI) is a parallel port through which a

    host processor can directly access the CPUs memory space. The host device is the master of the interface, therefore

    increasing its ease of access.

    The host and the CPU can exchange information via internal or

    external memory. In addition, the host has direct access to memory-mappedperipherals.

    Connectivity to the CPUs memory space is provided throughthe DMA controller.

    Expansion bus (XB) is a replacement for the HPI, as well as anexpansion of the EMIF.

    The expansion provides two distinct areas of functionality (host

    port and I/O port) which can co-exist in a system

  • 7/29/2019 240662_633888485056270520.ppt

    58/115

    CPU operations

    Fetch instruction from memory (DSP programmemory)

    Decode instruction

    Execute instruction including reading datavalues

    Program Fetch (F)

  • 7/29/2019 240662_633888485056270520.ppt

    59/115

    Program Fetch (F)

    Program fetching consists of 4 phases

    generate fetch address (PG) send address to memory (PS)

    wait for data ready (PW)

    read opcode (PR)

    C6x

    Memory PGPS

    PW

    PR

  • 7/29/2019 240662_633888485056270520.ppt

    60/115

    Decode Stage (D)

    Decode stage consists of two phases dispatch instruction to functional unit (DP)

    instruction decoded at functional unit

    (DC)

    C6x

    Memory PGPS

    PW

    PR DCDP

  • 7/29/2019 240662_633888485056270520.ppt

    61/115

    Execute Stage (E)

    An execute packet (EP) consists of a group ofinstructions that can be executed in parallel within thesame cycle

    Number of EP within a fetch packet can vary from one(with 8 parallel instructions) to 8 (with no parallelinstructions)

    bit 0 (LSB) of every 32 bit instruction determines if thenext instruction belongs to same EP or not

    if 1 same EP

    if 0 part of next EP

    FETCH and EXECUTION PACKETS(

  • 7/29/2019 240662_633888485056270520.ppt

    62/115

    (Fetch packet consists of 8 32-bit instructions)

    Consider an FP with three EP:

    Instruction A

    II Instruction B

    instruction C

    II Instruction D

    II Instruction E

    Instruction F

    II Instruction G

    II Instruction H

    A D E F G HCB

    31 031 0 31 0 31 031 0 31 0 31 0 31 0 31 0

    In the fetch packet ,EP1 contains 2 parallelinstructions,EP2 contains 3andEP3 has 3 parallel instructions

    Pipelining

  • 7/29/2019 240662_633888485056270520.ppt

    63/115

    p g

    Overlap operations to increase performance

    Pipeline CPU operations to increase clock speed over

    a sequential implementation

    Separate parallel functional units

    Peripheral interfaces for I/O do not burden CPU

    It is a key feature in DSP to get parallel instructions working properly

    Requires careful timing

    non pipelined scalar architect re

  • 7/29/2019 240662_633888485056270520.ppt

    64/115

    non-pipelined scalar architecture

    - A processor that executes every instruction one after the

    other- may use processor resources inefficiently, potentially

    leading to poor performance.

    pipelining

    - executing different sub-steps of sequential instructionssimultaneously

    superscalar architectures

    - executing multiple instructions entirely simultaneously

  • 7/29/2019 240662_633888485056270520.ppt

    65/115

  • 7/29/2019 240662_633888485056270520.ppt

    66/115

  • 7/29/2019 240662_633888485056270520.ppt

    67/115

    There are 3 stages of pipelining:

    P f h f

  • 7/29/2019 240662_633888485056270520.ppt

    68/115

    Program fetch composed of 4 phases

    PGprogram address generateto fetch an address

    PSprogram address sendto send the address

    PWprogram address ready waitto wait for data

    PRprogram fetch packet receiveto read opcode frommemory

    Decode stage composed of 2 phasesDPdispatchall the instructions within an FP to theappropriate functional units

    DCinstruction decode

    Execute stage composed of 6 (fixed point)-10 (floating point)a) multiplication instruction consists of 2 phases due to 1 delay

    b) load instruction consists of 5 phases due to 4 delays

    c) branch instruction consists of 6 phases due to 5 delays

    Pipeline phases

  • 7/29/2019 240662_633888485056270520.ppt

    69/115

    Program fetch decode execute

    PG PS PW PR DP DC E1- E6 (E1-E10 for doubleprecision)

    Pipelining effectsClock cycles

    1 2 3 4 5 6 7 8 9 10

    PG PS PW PR DP DC E1 E2 E3 E4

    PG PS PW PR DP DC E1 E2 E3

    PG PS PW PR DP DC E1 E2PG PS PW PR DP DC E1

    PG PS PW PR DP DC

    PG PS PW PR DP

    PG PS PW PR

    Each row represents an FP

  • 7/29/2019 240662_633888485056270520.ppt

    70/115

    p

    PG of first FP starts in cycle 1,PG of second FP starts in cycle 2

    and so on.

    Each FP has 4 phases for fetch ,2 phases for decode andexecution phases can take from 1 to 10 phases

    At cycle 7,

    instruction in the first FP are in the first execution phase E1,

    instruction in the second FP is in decoding phase,

    instruction in the third FP is in dispatching phase

    and so on..

    All the instructions are proceeding through various phases

    Therefore pipeline is FULL

    Most instructions have 1 execute phase

  • 7/29/2019 240662_633888485056270520.ppt

    71/115

    Multiply (MPY) has 2

    Load (LDH/LDW) has 5

    Branch (B) has 6 phases

    Additional execute phases are associated with floating point anddouble precision type instructions (upto 10 phases)

    eg: MPYDP has 9 delay slots and a total 10 phases

    Functional unit latency:

    The number of cycles that an instruction ties up a functional unit. it is 1 for all instructions except double precision instructions

    no other instructions can use the functional unit

    it is different from delay slot

    eg: MPYDP has 4 functional unit latency but 9 delay slots

    delay slot: some instructions that are physically after the instruction areexecuted as if they were located before it.

    Classic examples are branch and call instructions, which often execute the

    following instruction before the branch or call is performed.

    Instruction Set

  • 7/29/2019 240662_633888485056270520.ppt

    72/115

    Instruction Set

    Assembly code format:

    Label II [ ] Instruction Unit operands ; comments

    A Label represents a specific address/memory location that contains an

    instruction or data (label must be in the first column)

    Parallel bars (II) are used if the instructions are being executed parallel with

    the previous instructions

    this field ([ ]) is optional to make the associated instruction conditional

    - 5 registers are used as conditional registers

    - [A2] specifies that the associated instruction executes if A2 is not zero

    - [!A2] associated instructions are executed if A2 is zero

    instruction field can be assembler directive or mnemonic

  • 7/29/2019 240662_633888485056270520.ppt

    73/115

    - assembler directive is a command for assembler

    .short : initialize 16 bit integer

    .int : initialize 32 bit integer

    .float : initialize 32 bit IEEE single precision constant- mnemonic is an actual instruction that executes at run time

    Unit field can be any one of the 8 functional units (optional)

    Comments starting in column 1 begin with an asterisk or a semicolonwhereas comments starting in any other column must begin with asemicolon

    ADD .L1 A3,A7,A7 ; add A3+A7 A7

    MPY .M2 A7,B7,B6 ; multiply 16 LSBs of A7,B7 B7

    II MPYH .M1 A7,B7,A6 ; multiply 16 MSBs of A7,B7 A6

    Eg:

    Instruction set

  • 7/29/2019 240662_633888485056270520.ppt

    74/115

    Instruction set They are designed to make maximum use of the

    processors resources and at the same time minimizethe memory space required to store the instructions.

    Minimizing the storage space ensures the cost

    effectiveness of the overall system.

    To ensure the maximum use of hardware of the DSP,

    the instructions are designed to perform several

    parallel operations in a single instruction, typically

    including fetching of data in parallel with mainarithmetic operation.

    Instructions are kept short by restricting which register

  • 7/29/2019 240662_633888485056270520.ppt

    75/115

    Instructions are kept short by restricting which registercan be used with which operations and whichoperations can be combined in an instruction.

    Some of the latest processors use VLIW architectures,where in multiple instructions are issued and executedper cycle.

    In such architectures the instructions are short anddesigned to perform much less work thus requiringless memory and increased speed because of theVLIW architecture.

  • 7/29/2019 240662_633888485056270520.ppt

    76/115

  • 7/29/2019 240662_633888485056270520.ppt

    77/115

    C67x Addl Instructions (by unit)

  • 7/29/2019 240662_633888485056270520.ppt

    78/115

    ( y )

    .S Unit

    CMPLTDPRCPSP

    RCPDP

    RSQRSP

    RSQRDP

    SPDP

    ABSSPABSDP

    CMPGTSP

    CMPEQSP

    CMPLTSP

    CMPGTDP

    CMPEQDP

    .M Unit

    MPYI

    MPYID

    MPYSP

    MPYDP

    .L Unit

    INTSPINTSPU

    SPINT

    SPTRUNC

    SUBSP

    SUBDP

    ADDDPADDSP

    DPINT

    DPSP

    INTDP

    INTDPU

    .D Unit

    ADDAD LDDW

    Control Register File

  • 7/29/2019 240662_633888485056270520.ppt

    79/115

  • 7/29/2019 240662_633888485056270520.ppt

    80/115

    The interrupt flag register(IFR)

  • 7/29/2019 240662_633888485056270520.ppt

    81/115

    - contains the status of INT4-INT15 and NMI interrupt.

    - Each corresponding bit in the IFR is set to 1 when that

    interrupt occurs; otherwise, the bits are cleared to 0.- If you want to check the status of interrupts, use the MVC

    instruction to read the IFR.

    The interrupt return pointer register(IRP)

    - contains the return pointer that directs the CPU to the

    proper location to continue program execution after

    processing a maskable interrupt.

    - A branch using the address in IRP (B IRP) in yourinterrupt service routine returns to the program flow when

    interrupt servicing is complete.

  • 7/29/2019 240662_633888485056270520.ppt

    82/115

    Addressing modes

  • 7/29/2019 240662_633888485056270520.ppt

    83/115

    Determines how one access memory

    Addressing refers to means to specify location of operands forinstructions

    - types of addressing are called addressing modes

    - operands may be input operands for the operation as well asresults of the operation

    Addressing modes supported by the TMS320C67x include

    register-indirect,

    indexed register-indirect,

    and modulo addressing (circular addressing).

    Immediate data is also supported.

    The TMS320C67x does not support modulo addressing for 64-bit data.

  • 7/29/2019 240662_633888485056270520.ppt

    84/115

  • 7/29/2019 240662_633888485056270520.ppt

    85/115

  • 7/29/2019 240662_633888485056270520.ppt

    86/115

  • 7/29/2019 240662_633888485056270520.ppt

    87/115

    Circular Buffer

  • 7/29/2019 240662_633888485056270520.ppt

    88/115

    At the beginning of eachsample period,

    a new sample will be read into

    the circular buffer,overwriting

    the oldest sample.The newest sample x(n) will be

    stored at the memory location

    pointed at by auxiliary register

    AR(i).

    The need of processing the digital signals in real time,l th t f Ci l B ff i

  • 7/29/2019 240662_633888485056270520.ppt

    89/115

    evolves the concept ofCircular Buffering. Circular buffers are used to store the most recent

    values of a continually updated signal.

    Circular buffering allows processors to access a blockof data sequentially and then automatically wraparound to the beginning address exactly the patternused to access coefficients in FIR filter.

    Circular buffering also very helpful in implementingfirst-in, first-out buffers, commonly used for I/O and for

    FIR delay lines.

  • 7/29/2019 240662_633888485056270520.ppt

    90/115

  • 7/29/2019 240662_633888485056270520.ppt

    91/115

    AMR mode and description

    Mode description00 for linear addressing

    01 for circular addressing using BK0

    For circular addressing using BK1

    reserved

  • 7/29/2019 240662_633888485056270520.ppt

    92/115

  • 7/29/2019 240662_633888485056270520.ppt

    93/115

  • 7/29/2019 240662_633888485056270520.ppt

    94/115

  • 7/29/2019 240662_633888485056270520.ppt

    95/115

    Block size = 2N+1 bytes

  • 7/29/2019 240662_633888485056270520.ppt

    96/115

    Eg:

  • 7/29/2019 240662_633888485056270520.ppt

    97/115

    MVK .S2 0X0004,B2

    ; lower 16 bits to B2

    MVKLH .S2 0x0005,B2

    ; upper 16 bits to B2

    The value 0x0004 =(0100) into 16 LSB of AMR sets bit 2 (third bit)to 1 and all other bits to zero.

    This sets the mode to 01 and selects register A5 as pointer to

    buffer using BK0

    The value 0x0005 =(0101) into 16 MSB of AMR sets bits 16 and18 to 1.

    This corresponds to value of N used to select size of buffer = 2 N+1

    = 64 bytes using BKO

  • 7/29/2019 240662_633888485056270520.ppt

    98/115

    Reset (RESET)

  • 7/29/2019 240662_633888485056270520.ppt

    99/115

    Reset (RESET)

    Reset is the highest priority interrupt and is used to

    halt the CPU and return it to a known state.

    The reset interrupt is unique in a number of ways:

    - RESET is an active-low signal. All other interruptsare active-high signals.

    - RESET must be held low for 10 clock cycles before it

    goes high again to reinitialize the CPU properly.

    - The instruction execution in progress is aborted andall registers are returned to their default states.

    - RESET is not affected by branches.

    Nonmaskable Interrupt (NMI)

  • 7/29/2019 240662_633888485056270520.ppt

    100/115

    - NMI is the second-highest priority interrupt- generally used to alert the CPU of a serious

    hardware problem such as imminent power failure.

    - For NMI processing to occur, the non maskable

    interrupt enable (NMIE) bit in the interrupt enableregister must be set to 1.

  • 7/29/2019 240662_633888485056270520.ppt

    101/115

  • 7/29/2019 240662_633888485056270520.ppt

    102/115

    Multichannel Buffered Serial Port (McBSP)

  • 7/29/2019 240662_633888485056270520.ppt

    103/115

    The standard serial port interface provides:

    Full-duplex communication

    Double-buffered data registers, which allow a continuous data stream

    Independent framing and clocking for reception and transmission

    Direct interface to industry-standard codecs, analog interface chips(AICs), and other serially connected A/D and D/A devices

    - Multi channel transmission and reception of up to 128 channels.

    An element sizes of 8, 12, 16, 20, 24, or 32-bit.

    - 8-bit data transfers with LSB or MSB first.

  • 7/29/2019 240662_633888485056270520.ppt

    104/115

  • 7/29/2019 240662_633888485056270520.ppt

    105/115

  • 7/29/2019 240662_633888485056270520.ppt

    106/115

  • 7/29/2019 240662_633888485056270520.ppt

    107/115

  • 7/29/2019 240662_633888485056270520.ppt

    108/115

    The DMA controller uses the bus request pin to notifyth DSP th t it i d t k t f t

  • 7/29/2019 240662_633888485056270520.ppt

    109/115

    the DSP core that it is ready to make a transfer to orfrom external memory.

    The DSP core completes its current instruction,releases control of external memory and signals theDMA controller via the bus grant pin that the DMAtransfer can proceed.

    The DMA controller then transfers the specifiednumber of data words and optionally signalscompletion through an interrupt.

    Some processor can also have multiple channels

    DMA managing DMA transfers in parallel.

    Timer

  • 7/29/2019 240662_633888485056270520.ppt

    110/115

    Timer

    The C67x has two 32-bit general-purpose timers that can beused to:

    Time events

    Count events

    Generate pulses

    Interrupt the CPU

    Send synchronization events to the DMA controller

  • 7/29/2019 240662_633888485056270520.ppt

    111/115

  • 7/29/2019 240662_633888485056270520.ppt

    112/115

    The timer works in one of the two signaling modes dependingon whether clocked by an internal or an external source.

    The timer has an input pin (TINP) and an output pin (TOUT). The TINP pin can be used as a general purpose input, and the

    TOUT pin can be used as a general-purpose output.

    When an internal clock is provided, the timer generates timingsequences to trigger peripheral or external devices such asDMA controller or A/D converter respectively.

    When an external clock is provided, the timer can countexternal events and interrupt the CPU after a specified number

    of events.

    oa tore pt onsIn 'C6x the instruction set supports several types

  • 7/29/2019 240662_633888485056270520.ppt

    113/115

    Four load instructions:LDDW Loa 64-bit double word (C67x only)

    LDW Load 32-bit word

    LDH Load 16-bit half-word (short)

    LDB Load 8-bit byte

    Three store instructions:

    STWSTH

    STB

    of load/store instructions:

  • 7/29/2019 240662_633888485056270520.ppt

    114/115

    Load, and Store Paths

  • 7/29/2019 240662_633888485056270520.ppt

    115/115

    The C67x DSP has two 32-bit paths for loading data from memory tothe register File: LD1 for register file A, and LD2 for register file B. The C67x DSP also has a second 32-bit load path for both register

    files A and B. This allows the LDDW instruction to simultaneously load two 32-bit

    values into register file A and two 32-bit values into register file B. For side A, LD1a is the load path for the 32 LSBs and LD1b is the

    load path for the 32 MSBs. For side B, LD2a is the load path for the 32 LSBs and LD2b is the

    load path for the 32 MSBs.

    There are also two 32-bit paths, ST1 and ST2, for storing registervalues to memory from each register file.