tms320c6x chapter programming 3mwickert/ece5655/lecture_notes/ece5655_chap3.pdf · in this chapter...
TRANSCRIPT
ECE 5655/4655 Real-Time DSP 3–1
TMS320C6x ProgrammingIntroductionIn this chapter programming the TMS320C6x in assembly, linearassembly, and C will be introduced. Preference will be given toexplaining code development for the DSK memory map. Thebasis for the material presented in this chapter are the coursenotes from TI’s C6000 4-day design workshop1.
Programming Alternatives
1.TMS320C6000 DSP Design Workshop, Revision 4.0, June 2000.
C
Linear
ASM
ASM
Efficiency* EffortCompilerOptimizer
AssemblyOptimizer
70 – 80%
95 – 100%
100%
Low
Medium
High
* Typical efficieny versus hand optimized assembly see TI benchmarks for more information
HandOptimize
Intrinsics
Chapter
3
Chapter 3 • TMS320C6x Programming
3–2 ECE 5655/4655 Real-Time DSP
Introduction to Assembly Language Pro-gramming
A Dot Product Example
• Recall the C6000 block diagram
• To motivate this introduction to assembly programming, con-sider a basic sum of products or dot product example
(3.1)
• Assembly instructions will initially be shown only with lim-ited detail
• In a later section the details of putting together an actualassembly file will be given
• The core of this algorithm is multiplication and addition
Internal BusesInternal Buses
CPUCPU
.D1.D1
.M1.M1
.L1.L1
.S1.S1
.D2.D2
.M2.M2
.L2.L2
.S2.S2
Regs (B
0R
egs (B0 -- B
15)B15)
Regs (A
0R
egs (A0 -- A
15)A
15)
Control RegsControl Regs
CPUCPU
.D1.D1
.M1.M1
.L1.L1
.S1.S1
.D2.D2
.M2.M2
.L2.L2
.S2.S2
Regs (B
0R
egs (B0 -- B
15)B15)
Regs (A
0R
egs (A0 -- A
15)A
15)
Control RegsControl Regs
EMIFEMIF
Ext’lMemory
Ext’lExt’lMemoryMemory
-- SyncSync-- AsyncAsync
ProgramProgramRAMRAM Data RamData Ram
D (32)D (32)
Serial PortSerial Port
Host PortHost Port
Boot LoadBoot Load
TimersTimers
Pwr DownPwr Down
DMADMA
AddrAddr
y anxnn 1=
40
¦=
Introduction to Assembly Language Programming
ECE 5655/4655 Real-Time DSP 3–3
• To multiply we use the .M (multiply) unit
– As shown here MPY calls a 16-bit multiply which gives a32-bit result
• To add or accumulate we use the .L (logical) unit
.M.M.M
Y =Y =4040
¦¦ aann xxnnn = 1n = 1
**Y =Y =4040
¦¦ aann xxnnn = 1n = 1
**
MPYMPY .M.M a, x, proda, x, prod
.M.M.M
.L.L.L
Y =Y =4040
¦¦ aann xxnnn = 1n = 1
**Y =Y =4040
¦¦ aann xxnnn = 1n = 1
**
MPYMPY .M.M a, x, proda, x, prodADDADD .L.L Y, prod, YY, prod, Y
Where arethe variables
stored?
Where areWhere arethe variablesthe variables
stored?stored?
Chapter 3 • TMS320C6x Programming
3–4 ECE 5655/4655 Real-Time DSP
• Note that we need to store the working variables in a registerfile, the C6000 has two, but for now we will just use the Aside
• We now rewrite the code to include the actual register names
• The original equation (3.1) specifies 40 multiply accumulates
• To create a loop we need:
– A branch instruction and a label
– A loop counter variable
– An instruction to decrement the loop counter
– A properly set branch condition
.M.M.M
.L.L.L
Y =Y =4040
¦¦ aann xxnnn = 1n = 1
**Y =Y =4040
¦¦ aann xxnnn = 1n = 1
**
MPYMPY .M.M A0, A1, A3A0, A1, A3ADDADD .L.L A4, A3, A4A4, A3, A4
A0A0A1A1A2A2A3A3A4A4
Register File ARegister File A
......
aaxx
prodprod
A15A15
3232--bitsbits
YY
A0A0A1A1A2A2A3A3A4A4
Register File ARegister File A
......
aaxx
prodprod
A15A15
3232--bitsbits
YY
Introduction to Assembly Language Programming
ECE 5655/4655 Real-Time DSP 3–5
• The unit responsible for branching is the .S (branch) unit
– MVK moves a 16-bit constant into the lower 16-bits of reg-ister A2
– We decrement the loop counter register by one using SUBwhich uses the .L unit
– Branch condition instructions execute conditionally basedon the value held in A2;general asm code form[condition] B loop
– The [A2] means execute if
– If we use [!A2] then execute only if
– On the C62x/C67x conditional registers are limited to A1,A2, B0, B1, B2
– Note: On the C64x the conditional registers are A0, A1,A2, B0, B1, B2
.M.M.M
.L.L.L
Y =Y =4040
¦¦ aann xxnnn = 1n = 1
**Y =Y =4040
¦¦ aann xxnnn = 1n = 1
**
MVKMVK .S.S 40, A240, A2loop:loop:
MPYMPY .M.M A0, A1, A3A0, A1, A3ADDADD .L.L A4, A3, A4A4, A3, A4SUBSUB .L.L A2, 1, A2A2, 1, A2
[A2][A2] BB .S.S looploop
.S.S.SA0A0A1A1A2A2A3A3A4A4
Register File ARegister File A
......
aaxx
prodprod
A15A15
3232--bitsbits
YY
loop countloop count
A0A0A1A1A2A2A3A3A4A4
Register File ARegister File A
......
aaxx
prodprod
A15A15
3232--bitsbits
YY
loop countloop count
A2 0z
A2 0=
Chapter 3 • TMS320C6x Programming
3–6 ECE 5655/4655 Real-Time DSP
• The next step is to get variables loaded into the register file
– We assume that the variables are located in memory (inter-nal or external)
– We then create a pointer to the address of the variable andstore it in a register
– Finally, we load the variable itself into another register
• The C notation of &a is used here to obtain the address of a,but there is more to this as we will see shortly
• The C62 has 3 three load instructions and the C67 and C64add a fourth
– The architecture allows byte level addressing (8-bits), half-word (16-bits), words (32-bits)
– Added on the C67/64 are double-words (64-bits)
.M.M.M
.L.L.L
.S.S.SA0A0A1A1A2A2A3A3A4A4
Register File ARegister File Aaaxx
prodprod
A15A15
3232--bitsbits
YY
loop countloop count
.M.M.M
.L.L.L
.S.S.SA0A0A1A1A2A2A3A3A4A4
Register File ARegister File Aaaxx
prodprod
A15A15
3232--bitsbits
YY
loop countloop count
A0A0A1A1A2A2A3A3A4A4
Register File ARegister File Aaaxx
prodprod
A15A15
3232--bitsbits
YY
loop countloop count
How do a and x get loaded?How do a and x get loaded?�� a, x, Y located in memorya, x, Y located in memory
MemoryMemorya [40]a [40]x [40]x [40]
YYMemoryMemory
a [40]a [40]x [40]x [40]
YY
�� Create a pointer to values Create a pointer to values A5 = &aA5 = &aA6 = &xA6 = &xA7 = &Y A7 = &Y
....
A5A5A6A6A7A7
&a[n]&a[n]&x[n]&x[n]&Y&Y....
A5A5A6A6A7A7
&a[n]&a[n]&x[n]&x[n]&Y&Y
*A5*A5*A6*A6*A7*A7
*A5*A5*A6*A6*A7*A7
�� Use pointer with load/storeUse pointer with load/storeLDLD *A5, A0*A5, A0LDLD *A6, A1*A6, A1STST A4, *A7 A4, *A7
Introduction to Assembly Language Programming
ECE 5655/4655 Real-Time DSP 3–7
• Load and store option summary:
• To carry out the load and store operations we use the .D(data) unit
• Note that as in C, *A5 takes the value pointed to by A5 andplaces the value into a register, here it is A0
�� LoadLoad instructions:instructions:LDBLDB Load 8Load 8--bit bytebit byte (char)(char)LDHLDH Load 16Load 16--bit halfbit half--word word (short)(short)LDWLDW Load 32Load 32--bit wordbit word ((intint))LDDWLDDW Load 64Load 64--bit doublebit double--wordword (C67x, C64x)(C67x, C64x)
(double)(double)
�� StoreStore instructions:instructions:STBSTBSTHSTHSTWSTWSTDW STDW (C64x)(C64x)
�� LoadLoad instructions:instructions:LDBLDB Load 8Load 8--bit bytebit byte (char)(char)LDHLDH Load 16Load 16--bit halfbit half--word word (short)(short)LDWLDW Load 32Load 32--bit wordbit word ((intint))LDDWLDDW Load 64Load 64--bit doublebit double--wordword (C67x, C64x)(C67x, C64x)
(double)(double)
�� StoreStore instructions:instructions:STBSTBSTHSTHSTWSTWSTDW STDW (C64x)(C64x)
.M.M.M
.L.L.L
Y =Y =4040
¦¦ aann xxnnn = 1n = 1
**Y =Y =4040
¦¦ aann xxnnn = 1n = 1
**
MVKMVK .S.S 40, A240, A2loop:loop: LDHLDH .D.D *A5, A0*A5, A0
LDHLDH .D.D *A6, A1*A6, A1MPYMPY .M.M A0, A1, A3A0, A1, A3ADDADD .L.L A4, A3, A4A4, A3, A4SUBSUB .L.L A2, 1, A2A2, 1, A2
[A2] [A2] BB .S.S looploopSTHSTH .D.D A4, *A7A4, *A7
.S.S.S
.D.D.D
Data MemoryData MemoryData Memory
A0A0A1A1A2A2A3A3A4A4
Register File ARegister File Aaaxx
prodprod
A15A15
3232--bitsbits
YY
loop countloop count
A0A0A1A1A2A2A3A3A4A4
Register File ARegister File Aaaxx
prodprod
A15A15
3232--bitsbits
YY
loop countloop count
....
A5A5A6A6A7A7
&a[n]&a[n]&x[n]&x[n]&Y&Y
Chapter 3 • TMS320C6x Programming
3–8 ECE 5655/4655 Real-Time DSP
• A remaining detail is the actual creation of a pointer, e.g., x,a, and y
• Earlier we used MVK to move a 16-bit constant into the lower16-bits of a register
• Now we want to move a 32-bit address corresponding tosome label a
– MVKL .S a,A5 ;will move the lower 16-bits withsign extension
– MVKH .S a,A5 ;will move the upper or high 16-bits without altering the lower 16-bits
– Use MVKL and MVKH in ordered combination to load con-stants greater the 16-bits, and MVK for 16-bit or less con-stants
• What should appear above the code MVK .S 40,A2 is:MVKL .S a,A5 ;store lower half of aMVKH .S a,A5 ;store upper half of aMVKL .S x,A6 ;store lower half of xMVKH .S x,A6 ;store upper half of xMVKL .S y,A7 ;store lower half of yMVKH .S y,A7 ;store upper half of y
• To properly loop over the data, the pointers need to be incr-mented
• The C notation “++” can be used to pre- or post-incrementregisters being used as pointers, e.g., A5++ increments byone the address held in A5 after it is used
Introduction to Assembly Language Programming
ECE 5655/4655 Real-Time DSP 3–9
• Pointer incrementing is summarized in the following figure:
• Since there is another set of function units we should havespecified which the side, e.g., .S1 for side A, etc.
aa00aa11aa22....
xx00xx11xx22....
&&aa&&xx
A5A5A6A6
A5A5 A6A6aa00aa11aa22....
xx00xx11xx22....
&&aa&&xx
A5A5A6A6
A5A5 A6A6 Y =Y =4040
¦¦ aann xxnnn = 1n = 1
**Y =Y =4040
¦¦ aann xxnnn = 1n = 1
**
MVKMVK .S.S 40, A240, A2loop:loop: LDHLDH .D.D *A5, A0*A5, A0
LDHLDH .D.D *A6, A1*A6, A1MPYMPY .M.M A0, A1, A3A0, A1, A3ADDADD .L.L A4, A3, A4A4, A3, A4SUBSUB .L.L A2, 1, A2A2, 1, A2
[A2] [A2] BB .S.S looploopSTHSTH .D.D A4, *A7A4, *A7
Y =Y =4040
¦¦ aann xxnnn = 1n = 1
**
MVKMVK .S.S 40, A240, A2loop:loop: LDHLDH .D.D *A5, A0*A5, A0
LDHLDH .D.D *A6, A1*A6, A1MPYMPY .M.M A0, A1, A3A0, A1, A3ADDADD .L.L A4, A3, A4A4, A3, A4SUBSUB .L.L A2, 1, A2A2, 1, A2
[A2] [A2] BB .S.S looploopSTHSTH .D.D A4, *A7A4, *A7
Y =Y =4040
¦¦ aann xxnnn = 1n = 1
**Y =Y =4040
¦¦ aann xxnnn = 1n = 1
**
MVKMVK .S.S 40, A240, A2loop:loop: LDHLDH .D.D *A5, A0*A5, A0
LDHLDH .D.D *A6, A1*A6, A1MPYMPY .M.M A0, A1, A3A0, A1, A3ADDADD .L.L A4, A3, A4A4, A3, A4SUBSUB .L.L A2, 1, A2A2, 1, A2
[A2] [A2] BB .S.S looploopSTHSTH .D.D A4, *A7A4, *A7
After first loop, A4 contains...After first loop, A4 contains...aa00 * * xx00
++++ ++++++++ ++++
How do you access How do you access aa11 and and xx11 on the second loop?on the second loop?
LDHLDH .D.D *A5++, A0*A5++, A0LDHLDH .D.D *A6++, A1*A6++, A1
Y =Y =4040
¦¦ aann xxnnn = 1n = 1
**
MVKMVK .S.S 40, A240, A2loop:loop: LDHLDH .D.D *A5++, A0*A5++, A0
LDHLDH .D.D *A6++, A1*A6++, A1MPYMPY .M.M A0, A1, A3A0, A1, A3ADDADD .L.L A4, A3, A4A4, A3, A4SUBSUB .L.L A2, 1, A2A2, 1, A2
[A2] [A2] BB .S.S looploopSTHSTH .D.D A4, *A7A4, *A7
Y =Y =4040
¦¦ aann xxnnn = 1n = 1
**Y =Y =4040
¦¦ aann xxnnn = 1n = 1
**
MVKMVK .S.S 40, A240, A2loop:loop: LDHLDH .D.D *A5++, A0*A5++, A0
LDHLDH .D.D *A6++, A1*A6++, A1MPYMPY .M.M A0, A1, A3A0, A1, A3ADDADD .L.L A4, A3, A4A4, A3, A4SUBSUB .L.L A2, 1, A2A2, 1, A2
[A2] [A2] BB .S.S looploopSTHSTH .D.D A4, *A7A4, *A7
.S1.S1.S1
.M1.M1.M1
.L1.L1.L1
.D1.D1.D1
.S2.S2.S2
.M2.M2.M2
.L2.L2.L2
.D2.D2.D2
A0A0A1A1A2A2A3A3A4A4
Register File ARegister File A
......
Data MemoryData Memory
B0B0B1B1B2B2B3B3B4B4
Register File BRegister File B
......
B15B15
3232--bitsbits 3232--bitsbits
Chapter 3 • TMS320C6x Programming
3–10 ECE 5655/4655 Real-Time DSP
• The final version of the A-side code is
– In the above we assume A4 is initially cleared
Instruction Set Summary by Category
MVKMVK .S1.S1 40, A240, A2 ; A2 = 40, loop count; A2 = 40, loop countloop:loop: LDHLDH .D1.D1 *A5++, A0*A5++, A0 ; A0 = a(n); A0 = a(n)
LDHLDH .D1.D1 *A6++, A1*A6++, A1 ; A1 = x(n); A1 = x(n)MPYMPY .M1.M1 A0, A1, A3A0, A1, A3 ; A3 = a(n) * x(n); A3 = a(n) * x(n)ADDADD .L1.L1 A3, A4, A4A3, A4, A4 ; Y = Y + A3; Y = Y + A3SUBSUB .L1.L1 A2, 1, A2A2, 1, A2 ; decrement loop count; decrement loop count
[A2][A2] BB .S1.S1 looploop ; if A2 ; if A2 zz 0, branch0, branchSTHSTH .D1.D1 A4, *A7A4, *A7 ; *A7 = Y; *A7 = Y
Y =Y =4040¦¦ aann xxnn
n = 1n = 1**Y =Y =
4040¦¦ aann xxnn
n = 1n = 1**
ArithmeticArithmeticABSABSADDADDADDAADDAADDKADDKADD2ADD2MPYMPYMPYHMPYHNEGNEGSMPYSMPYSMPYHSMPYHSADDSADDSATSATSSUBSSUBSUBSUBSUBASUBASUBCSUBCSUB2SUB2ZEROZERO
ArithmeticArithmeticABSABSADDADDADDAADDAADDKADDKADD2ADD2MPYMPYMPYHMPYHNEGNEGSMPYSMPYSMPYHSMPYHSADDSADDSATSATSSUBSSUBSUBSUBSUBASUBASUBCSUBCSUB2SUB2ZEROZERO
Program CtrlProgram CtrlBBIDLEIDLENOPNOP
Program CtrlProgram CtrlBBIDLEIDLENOPNOP
LogicalLogicalANDANDCMPEQCMPEQCMPGTCMPGTCMPLTCMPLTNOTNOTORORSHLSHLSHRSHRSSHLSSHLXORXOR
LogicalLogicalANDANDCMPEQCMPEQCMPGTCMPGTCMPLTCMPLTNOTNOTORORSHLSHLSHRSHRSSHLSSHLXORXOR
Data MgmtData MgmtLDB/H/WLDB/H/WMVMVMVCMVCMVKMVKMVKLMVKLMVKHMVKHMVKLHMVKLHSTB/H/WSTB/H/W
Data MgmtData MgmtLDB/H/WLDB/H/WMVMVMVCMVCMVKMVKMVKLMVKLMVKHMVKHMVKLHMVKLHSTB/H/WSTB/H/W
Bit MgmtBit MgmtCLRCLREXTEXTLMBDLMBDNORMNORMSETSET
Bit MgmtBit MgmtCLRCLREXTEXTLMBDLMBDNORMNORMSETSET
Introduction to Assembly Language Programming
ECE 5655/4655 Real-Time DSP 3–11
C62xx and C67xx Instruction Set Summary by Unit
.L .L .L
.D .D .D
.S .S .S
.M .M .M
.L .L .L
.D .D .D
.S .S .S
.M .M .M
No Unit UsedIDLEIDLENOPNOP
No Unit UsedIDLEIDLENOPNOP
.S Unit.S UnitNEGNEGNOT NOT ORORSETSETSHLSHLSHRSHRSSHLSSHLSUBSUBSUB2SUB2XORXORZEROZERO
ADDADDADDKADDKADD2ADD2ANDANDBBCLRCLREXTEXTMVMVMVCMVCMVKMVKMVKLMVKLMVKHMVKH
.S Unit.S UnitNEGNEGNOT NOT ORORSETSETSHLSHLSHRSHRSSHLSSHLSUBSUBSUB2SUB2XORXORZEROZERO
ADDADDADDKADDKADD2ADD2ANDANDBBCLRCLREXTEXTMVMVMVCMVCMVKMVKMVKLMVKLMVKHMVKH
.L Unit.L UnitNOTNOTORORSADDSADDSATSATSSUBSSUBSUBSUBSUBCSUBCXORXORZEROZERO
ABSABSADDADDANDANDCMPEQCMPEQCMPGTCMPGTCMPLTCMPLTLMBDLMBDMVMVNEGNEGNORMNORM
.L Unit.L UnitNOTNOTORORSADDSADDSATSATSSUBSSUBSUBSUBSUBCSUBCXORXORZEROZERO
ABSABSADDADDANDANDCMPEQCMPEQCMPGTCMPGTCMPLTCMPLTLMBDLMBDMVMVNEGNEGNORMNORM
.M Unit.M UnitSMPYSMPYSMPYHSMPYH
MPYMPYMPYHMPYHMPYLHMPYLHMPYHLMPYHL
.M Unit.M UnitSMPYSMPYSMPYHSMPYH
MPYMPYMPYHMPYHMPYLHMPYLHMPYHLMPYHL
.D Unit.D Unit.D Unit.D Unit.D Unit.D UnitNEGNEGSTBSTB (B/H/W) (B/H/W) SUBSUBSUBAB SUBAB (B/H/W) (B/H/W) ZEROZERO
ADDADDADDABADDAB (B/H/W)(B/H/W)
LDBLDB (B/H/W)(B/H/W)
MVMV
.D Unit.D Unit.D Unit.D Unit.D Unit.D UnitNEGNEGSTBSTB (B/H/W) (B/H/W) SUBSUBSUBAB SUBAB (B/H/W) (B/H/W) ZEROZERO
ADDADDADDABADDAB (B/H/W)(B/H/W)
LDBLDB (B/H/W)(B/H/W)
MVMV
.L .L .L
.D .D .D
.S .S .S
.M .M .M
No Unit UsedIDLEIDLENOPNOP
.S Unit.S UnitNEGNEGNOT NOT ORORSETSETSHLSHLSHRSHRSSHLSSHLSUBSUBSUB2SUB2XORXORZEROZERO
ADDADDADDKADDKADD2ADD2ANDANDBBCLRCLREXTEXTMVMVMVCMVCMVKMVKMVKLMVKLMVKHMVKH
ABSSPABSSPABSDPABSDPCMPGTSPCMPGTSPCMPEQSPCMPEQSPCMPLTSPCMPLTSPCMPGTDPCMPGTDPCMPEQDPCMPEQDPCMPLTDPCMPLTDPRCPSPRCPSPRCPDPRCPDPRSQRSPRSQRSPRSQRDPRSQRDPSPDPSPDP
.L Unit.L UnitNOTNOTORORSADDSADDSATSATSSUBSSUBSUBSUBSUBCSUBCXORXORZEROZERO
ABSABSADDADDANDANDCMPEQCMPEQCMPGTCMPGTCMPLTCMPLTLMBDLMBDMVMVNEGNEGNORMNORM
ADDSPADDSPADDDPADDDPSUBSPSUBSPSUBDPSUBDPINTSPINTSPINTDPINTDPSPINTSPINTDPINTDPINTSPRTUNCSPRTUNCDPTRUNCDPTRUNCDPSPDPSP
.M Unit.M UnitSMPYSMPYSMPYHSMPYH
MPYMPYMPYHMPYHMPYLHMPYLHMPYHLMPYHL
MPYSPMPYSPMPYDPMPYDPMPYIMPYIMPYIDMPYID
.D Unit.D UnitNEGNEGSTBSTB (B/H/W) (B/H/W) SUBSUBSUBAB SUBAB (B/H/W) (B/H/W) ZEROZERO
ADDADDADDABADDAB (B/H/W)(B/H/W)ADDADADDADLDBLDB (B/H/W)(B/H/W)LDDWLDDWMVMV
.L .L .L
.D .D .D
.S .S .S
.M .M .M
.L .L .L
.D .D .D
.S .S .S
.M .M .M
No Unit UsedIDLEIDLENOPNOP
.S Unit.S UnitNEGNEGNOT NOT ORORSETSETSHLSHLSHRSHRSSHLSSHLSUBSUBSUB2SUB2XORXORZEROZERO
ADDADDADDKADDKADD2ADD2ANDANDBBCLRCLREXTEXTMVMVMVCMVCMVKMVKMVKLMVKLMVKHMVKH
ABSSPABSSPABSDPABSDPCMPGTSPCMPGTSPCMPEQSPCMPEQSPCMPLTSPCMPLTSPCMPGTDPCMPGTDPCMPEQDPCMPEQDPCMPLTDPCMPLTDPRCPSPRCPSPRCPDPRCPDPRSQRSPRSQRSPRSQRDPRSQRDPSPDPSPDP
.L Unit.L UnitNOTNOTORORSADDSADDSATSATSSUBSSUBSUBSUBSUBCSUBCXORXORZEROZERO
ABSABSADDADDANDANDCMPEQCMPEQCMPGTCMPGTCMPLTCMPLTLMBDLMBDMVMVNEGNEGNORMNORM
ADDSPADDSPADDDPADDDPSUBSPSUBSPSUBDPSUBDPINTSPINTSPINTDPINTDPSPINTSPINTDPINTDPINTSPRTUNCSPRTUNCDPTRUNCDPTRUNCDPSPDPSP
.M Unit.M UnitSMPYSMPYSMPYHSMPYH
MPYMPYMPYHMPYHMPYLHMPYLHMPYHLMPYHL
MPYSPMPYSPMPYDPMPYDPMPYIMPYIMPYIDMPYID
.D Unit.D UnitNEGNEGSTBSTB (B/H/W) (B/H/W) SUBSUBSUBAB SUBAB (B/H/W) (B/H/W) ZEROZERO
ADDADDADDABADDAB (B/H/W)(B/H/W)ADDADADDADLDBLDB (B/H/W)(B/H/W)LDDWLDDWMVMV
No Unit UsedIDLEIDLENOPNOP
No Unit UsedIDLEIDLENOPNOP
.S Unit.S UnitNEGNEGNOT NOT ORORSETSETSHLSHLSHRSHRSSHLSSHLSUBSUBSUB2SUB2XORXORZEROZERO
ADDADDADDKADDKADD2ADD2ANDANDBBCLRCLREXTEXTMVMVMVCMVCMVKMVKMVKLMVKLMVKHMVKH
ABSSPABSSPABSDPABSDPCMPGTSPCMPGTSPCMPEQSPCMPEQSPCMPLTSPCMPLTSPCMPGTDPCMPGTDPCMPEQDPCMPEQDPCMPLTDPCMPLTDPRCPSPRCPSPRCPDPRCPDPRSQRSPRSQRSPRSQRDPRSQRDPSPDPSPDP
.S Unit.S UnitNEGNEGNOT NOT ORORSETSETSHLSHLSHRSHRSSHLSSHLSUBSUBSUB2SUB2XORXORZEROZERO
ADDADDADDKADDKADD2ADD2ANDANDBBCLRCLREXTEXTMVMVMVCMVCMVKMVKMVKLMVKLMVKHMVKH
ABSSPABSSPABSDPABSDPCMPGTSPCMPGTSPCMPEQSPCMPEQSPCMPLTSPCMPLTSPCMPGTDPCMPGTDPCMPEQDPCMPEQDPCMPLTDPCMPLTDPRCPSPRCPSPRCPDPRCPDPRSQRSPRSQRSPRSQRDPRSQRDPSPDPSPDP
.L Unit.L UnitNOTNOTORORSADDSADDSATSATSSUBSSUBSUBSUBSUBCSUBCXORXORZEROZERO
ABSABSADDADDANDANDCMPEQCMPEQCMPGTCMPGTCMPLTCMPLTLMBDLMBDMVMVNEGNEGNORMNORM
ADDSPADDSPADDDPADDDPSUBSPSUBSPSUBDPSUBDPINTSPINTSPINTDPINTDPSPINTSPINTDPINTDPINTSPRTUNCSPRTUNCDPTRUNCDPTRUNCDPSPDPSP
.L Unit.L UnitNOTNOTORORSADDSADDSATSATSSUBSSUBSUBSUBSUBCSUBCXORXORZEROZERO
ABSABSADDADDANDANDCMPEQCMPEQCMPGTCMPGTCMPLTCMPLTLMBDLMBDMVMVNEGNEGNORMNORM
ADDSPADDSPADDDPADDDPSUBSPSUBSPSUBDPSUBDPINTSPINTSPINTDPINTDPSPINTSPINTDPINTDPINTSPRTUNCSPRTUNCDPTRUNCDPTRUNCDPSPDPSP
.M Unit.M UnitSMPYSMPYSMPYHSMPYH
MPYMPYMPYHMPYHMPYLHMPYLHMPYHLMPYHL
MPYSPMPYSPMPYDPMPYDPMPYIMPYIMPYIDMPYID
.M Unit.M UnitSMPYSMPYSMPYHSMPYH
MPYMPYMPYHMPYHMPYLHMPYLHMPYHLMPYHL
MPYSPMPYSPMPYDPMPYDPMPYIMPYIMPYIDMPYID
.D Unit.D UnitNEGNEGSTBSTB (B/H/W) (B/H/W) SUBSUBSUBAB SUBAB (B/H/W) (B/H/W) ZEROZERO
ADDADDADDABADDAB (B/H/W)(B/H/W)ADDADADDADLDBLDB (B/H/W)(B/H/W)LDDWLDDWMVMV
.D Unit.D UnitNEGNEGSTBSTB (B/H/W) (B/H/W) SUBSUBSUBAB SUBAB (B/H/W) (B/H/W) ZEROZERO
ADDADDADDABADDAB (B/H/W)(B/H/W)ADDADADDADLDBLDB (B/H/W)(B/H/W)LDDWLDDWMVMV
•The C67 adds 31 More Instructions
Chapter 3 • TMS320C6x Programming
3–12 ECE 5655/4655 Real-Time DSP
• In total, the processor has only about 48 instructions, andhence is considered to be a RISC device
• Before going any further in assembly programming we needto spend some time studying the pipeline
Introduction to the Pipeline• DSP microprocessors rely heavily on the performance advan-
tages of pipelining, the C6x is no exception
• It would be nice to never have to worry about pipeline issues,but some exposure will be helpful in future programming
• Getting code to work only requires a few basic guidelines,while full optimization of the eight function units is beyondthe scope of this section of the notes
• The basic operations of the CPU are:
– (F) Fetch or Program Fetch (PF): get an instruction frommemory
– (D) Decode: figure out what type of instruction it is (ADD,MPY)
– (E) Execute: Actually perform the operation
Introduction to the Pipeline
ECE 5655/4655 Real-Time DSP 3–13
Pipelined and Non-Pipelined
• Once the pipeline is full the multiple buses of the C6x cancarry out the F, D, and E operations in parallel, all within thesame clock cycle
• On the downside, when discontinuities such as programbranching occur, the pipeline must be flushed which results inadded processor overhead
Program Fetch Stage
• The program fetch stage actally is broken into four phases
– PG: Generate fetch address
– PS: Send address to memory
– PW: Wait for data ready
– PR: Read opcode
FF11 DD11 EE11 FF22 DD22 EE22 FF33 DD33 EE33FF11 DD11 EE11 FF22 DD22 EE22 FF33 DD33 EE33
CPU TypeCPU Type
NonNon--PipelinedPipelined
PipelinedPipelined
Clock CyclesClock Cycles1 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 9
FF11 DD11 EE11
FF22 DD22 EE22
FF33 DD33 EE33
Pipeline fullPipeline full
Chapter 3 • TMS320C6x Programming
3–14 ECE 5655/4655 Real-Time DSP
Decode Stage
• The decode stage consists of two phases
– DP: Route the instruction to a functional unit (dispatch)
– DC: Actually decode the instruction at the functional unit(decode)
Execute Stage
• For code writing purposes the execute stage is the most inter-esting
• On the C62x all instructions execute in a single cycle, butresults are delayed by varying amounts
• Furthermore, there is an additional cycle before the resultsare available, which is known as the pipeline latency
• Common examples of delay and latency
• As a result of the maximum delay of 5 cycles, there are sixexecute phases E1–E6
DescriptionDescription InstructionsInstructions DelayDelay LatencyLatency
Single CycleSingle Cycle All, except ...All, except ... 00 0 + 1 = 10 + 1 = 1
MultiplyMultiply MPY / SMPYMPY / SMPY 11 22
LoadLoad LDB/H/WLDB/H/W 44 55
BranchBranch BB 55 66
DescriptionDescription InstructionsInstructions DelayDelay LatencyLatency
Single CycleSingle Cycle All, except ...All, except ... 00 0 + 1 = 10 + 1 = 1
MultiplyMultiply MPY / SMPYMPY / SMPY 11 22
LoadLoad LDB/H/WLDB/H/W 44 55
BranchBranch BB 55 66
Introduction to the Pipeline
ECE 5655/4655 Real-Time DSP 3–15
Summary of Pipeline PhasesProgram Program
FetchFetch ExecuteExecuteDecodeDecode
DP DCDP DC E1 E1 E2 E3 E4 E5 E6E2 E3 E4 E5 E6(1) (2) (3) (4)(1) (2) (3) (4) (5) (6)(5) (6) (7) (8) (9) (10) (11) (12) (7) (8) (9) (10) (11) (12)
E2E2--E6E6 are place holdersare place holdersfor delayed resultsfor delayed results
PG PS PW PR DP DC PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7E1 E2 E3 E4 E5 E6 E7PG PS PW PR DP DC E1 PG PS PW PR DP DC E1
PG PS PW PR DP DC E1PG PS PW PR DP DC E1PG PS PW PR DP DC E1PG PS PW PR DP DC E1
PG PS PW PR DP DC E1PG PS PW PR DP DC E1PG PS PW PR DP DC E1 PG PS PW PR DP DC E1
PG PS PW PR DP DC E1PG PS PW PR DP DC E1
PG PS PW PR DP DC PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7E1 E2 E3 E4 E5 E6 E7PG PS PW PR DP DC E1 PG PS PW PR DP DC E1
PG PS PW PR DP DC E1PG PS PW PR DP DC E1PG PS PW PR DP DC E1PG PS PW PR DP DC E1
PG PS PW PR DP DC E1PG PS PW PR DP DC E1PG PS PW PR DP DC E1 PG PS PW PR DP DC E1
PG PS PW PR DP DC E1PG PS PW PR DP DC E1
Pipeline full
Chapter 3 • TMS320C6x Programming
3–16 ECE 5655/4655 Real-Time DSP
Sending Code Through the Pipeline
• Since there are eight function units, eight 32-bit instructionsare fetched every clock cycle
• The 256-bit total is called a fetch packet
• Recall that there is a 256-bit wide program data bus for thispurpose
Pipeline Code Example
• Consider the sum of products example used earlier
256 Bits256 Bits
I 1I 1 I 2I 2 I 3I 3 I 4I 4 I 5I 5 I 6I 6 I 7I 7 I 8I 8
256 Bits256 Bits
I 1I 1 I 2I 2 I 3I 3 I 4I 4 I 5I 5 I 6I 6 I 7I 7 I 8I 8
Fetch Packet (8 x 32Fetch Packet (8 x 32--bit)bit)
The 'C6x fetches eight 32The 'C6x fetches eight 32--bit bit instructions every cycleinstructions every cycle
; mycode.asm; mycode.asmI1I1 .unit.unitI2I2 .unit.unitI3I3 .unit.unitI4I4 .unit.unitI5I5 .unit.unitI6I6 .unit.unitI7I7 .unit.unitI8I8 .unit.unit
MVKMVK .S1.S1 40, A240, A2loop:loop: LDHLDH .D1.D1 *A5++, A0*A5++, A0
LDHLDH .D1.D1 *A6++, A1*A6++, A1MPYMPY .M1.M1 A0, A1, A3A0, A1, A3ADDADD .L1.L1 A3, A4, A4A3, A4, A4SUBSUB .L1.L1 A2, 1, A2A2, 1, A2
[A2][A2] BB .S1.S1 looploopSTHSTH .D1.D1 A4, *A7A4, *A7
We assume A4 is already cleared
Introduction to the Pipeline
ECE 5655/4655 Real-Time DSP 3–17
• We have eight instructions, so on the first cycle they are in thePG phase of program fetch
• On the fifth cycle, assuming zero wait state memory, theeight instructions are now at the DP phase
• On the next cycle the first instruction moves to the DC
ProgramProgramFetchFetch
PG PS PW PRPG PS PW PRDecodeDecodeDP DCDP DC
ExecuteExecuteE1 E1 -- E6E6
MVKMVKLDHLDHLDHLDHMPYMPYADDADDSUBSUB
BBSTHSTH
1212
3399
66
11111010
11
88
22
77 5544
Chapter 3 • TMS320C6x Programming
3–18 ECE 5655/4655 Real-Time DSP
(decode) phase, and the other seven wait in line
• On cycle eight MVK has completed execution and LDH beginsexecution, but requires five total cycles (+ signs)
ExecuteExecuteE1 E2 E3 E4 E5 E6E1 E2 E3 E4 E5 E6
DoneDone
99
DoneDone
99
DecodeDecodeDP DCDP DC
LDHLDHLDHLDHMPYMPYADDADDSUBSUB
BBSTHSTH
Prog.Prog.FetchFetchPP
Prog.Prog.FetchFetchPP
1212
3399
66
11111010
11
88
22
77 5544
FPFP55--22
MVKMVK
ExecuteExecuteE1 E2 E3 E4 E5 E6E1 E2 E3 E4 E5 E6
DoneDone
99
DoneDone
99
DecodeDecodeDP DCDP DC
MPYMPYADDADDSUBSUB
BBSTHSTH
Prog.Prog.FetchFetchPP
Prog.Prog.FetchFetchPP
1212
3399
66
11111010
11
88
22
77 5544
FPFP55--22
MVKMVK
LDHLDHLDHLDH ++ ++ ++ ++
MVKMVK
LDHLDHLDHLDH ++ ++ ++ ++
Introduction to the Pipeline
ECE 5655/4655 Real-Time DSP 3–19
• On the 10th cycle the second LDH enters E2 and the first LDHis moved over to E3, with MPY at E1
– Note that the MPY requires only one delay, but needs val-ues from memory that the LDH’s bring in
– The LDH’s have not finished yet! What to do?
• A similar problem exists when the ADD instruction reachesE1
– The one cycle delay of MPY means that the addition hasstarted too early as well
ExecuteExecuteE1 E2 E3 E4 E5 E6E1 E2 E3 E4 E5 E6
DoneDone
99
DoneDone
99
DecodeDecodeDP DCDP DC
SUBSUBBB
STHSTH
Prog.Prog.FetchFetchPP
Prog.Prog.FetchFetchPP
1212
3399
66
11111010
11
88
22
77 5544
1212
3399
66
11111010
11
88
22
77 5544
FPFP55--22
MVKMVKLDHLDH ++ ++
LDHLDH ++ ++ ++MPYMPY
ADDADD++
MVKMVKLDHLDH ++ ++
LDHLDH ++ ++ ++MPYMPY
ADDADD++
Chapter 3 • TMS320C6x Programming
3–20 ECE 5655/4655 Real-Time DSP
• For the existing code, we see that at 12 cycles MPY and ADDhave both finished, but both LDH’s still have not completed
• To fix the code we need to add instruction delays or NOPs
– To start with we need to add one NOP between MPY andADD
– We need to add four NOPs between the second LDH andMPY
• Simple NOP insertion rules:
ExecuteExecuteE1 E2 E3 E4 E5 E6E1 E2 E3 E4 E5 E6
DoneDone
99
DoneDone
99
DecodeDecodeDP DCDP DC
STHSTH
Prog.Prog.FetchFetchPP
Prog.Prog.FetchFetchPP
1212
3399
66
11111010
11
88
22
77 5544
FPFP55--22
MVKMVK
MPYMPYLDHLDH ++
SUBSUB
LDHLDH
BB
ADDADD
MVKMVK
MPYMPYLDHLDH ++
SUBSUB
LDHLDH
BB
ADDADD
LDHLDH ++
SUBSUB
LDHLDH
BB
ADDADD
Single CycleSingle Cycle 00 00
MultiplyMultiply 11 11
LoadLoad 44 44
BranchBranch 55 55
DescriptionDescription Delay SlotsDelay Slots # of NOP’s# of NOP’s
Introduction to the Pipeline
ECE 5655/4655 Real-Time DSP 3–21
• Rather than typing four lines of NOP, we can type a singleline
• The final NOP “fixed code”, including benchmark informa-tion is the following:
– The NOPs greatly increase the cycle count, but we have nottried any optimization yet
– With full optimization just 28 cycles can be achieved, lessthan the loop count!
NOPNOPNOPNOP
NOP 4
MVKMVK .S1.S1 40,A240,A2loop: loop: LDHLDH .D1.D1 *A5++, A0*A5++, A0
LDHLDH .D1.D1 *A6++, A1*A6++, A1NOPNOP 44MPYMPY .M1.M1 A0,A1,A3A0,A1,A3NOPNOPADDADD .L1.L1 A3,A4,A4A3,A4,A4SUBSUB .L1.L1 A2,1,A2A2,1,A2
[A2][A2] BB .S1.S1 looploopNOPNOP 55STHSTH .D1.D1 A4,*A7A4,*A7
MVKMVK .S1.S1 40,A240,A2loop: loop: LDHLDH .D1.D1 *A5++, A0*A5++, A0
LDHLDH .D1.D1 *A6++, A1*A6++, A1NOPNOP 44MPYMPY .M1.M1 A0,A1,A3A0,A1,A3NOPNOPADDADD .L1.L1 A3,A4,A4A3,A4,A4SUBSUB .L1.L1 A2,1,A2A2,1,A2
[A2][A2] BB .S1.S1 looploopNOPNOP 55STHSTH .D1.D1 A4,*A7A4,*A7
Benchmark = _______ cyclesBenchmark = _______ cyclesBest case = _______ cyclesBest case = _______ cycles
(1)(1)(1)(1)(4)(4)(1)(1)
(1)(1)(1)(1)(1)(1)(1)(1)(5)(5)
LoopLoop = 16 = 16 xx 4040= 640= 640
(1)(1)(1)(1)(4)(4)(1)(1)
(1)(1)(1)(1)(1)(1)(1)(1)(5)(5)
LoopLoop = 16 = 16 xx 4040= 640= 640
(1)(1)
(1)(1)
+ 2 = 642 cycles+ 2 = 642 cycles(1)(1)
(1)(1)
+ 2 = 642 cycles+ 2 = 642 cycles
64264228 28
Chapter 3 • TMS320C6x Programming
3–22 ECE 5655/4655 Real-Time DSP
Use of Parallel Instructions
• In the pipeline example above all of the instructions flowedserially
• Parallel instructions are given with the double pipe symbol||
• Up to eight instructions can be put in parallel since there areeight functional units
• A partially parallel solution is given below:
• When instructions process in parallel they are called executepackets, and are so denoted in the pipeline diagrams
• Each fetch packet can contain multiple execute packets
SerialSerial PartiallyPartiallyParallelParallel
FullyFullyParallelParallel
B .S1B .S1MVK .S1MVK .S1ADD .L1ADD .L1ADD .L1ADD .L1MPY .M1MPY .M1MPY .M1MPY .M1LDW .D1LDW .D1LDB .D1LDB .D1
B .S1B .S1|| MVK .S2|| MVK .S2
ADD .L1ADD .L1|| ADD .L2|| ADD .L2|| MPY .M1|| MPY .M1
MPY .M1MPY .M1|| LDW .D1|| LDW .D1|| LDB .D2|| LDB .D2
Introduction to the Pipeline
ECE 5655/4655 Real-Time DSP 3–23
• At the beginning of the decode phase (dispatch), the aboveexample code, has three execute packets entering DC
• Each execute packet enters E1 and the individual instructionsexecute simultaneously until completed, with their respectivedelays
BBMVKMVK
ADDADDADDADDMPYMPY
MPYMPYLDWLDWLDBLDB
DecodeDecodeDP DCDP DC
ExecuteExecuteE1 E2 E3 E4 E5 E6E1 E2 E3 E4 E5 E6
DoneDone
99
DoneDone
99
1212
3399
66
11111010
11
88
22
77 5544
1212
3399
66
11111010
11
88
22
77 5544
Chapter 3 • TMS320C6x Programming
3–24 ECE 5655/4655 Real-Time DSP
• At cycle eight we have packet two at E1 and part of packetone is complete
• Parallel instructions give a great performance increase
• For the code example we have been considering it is possibleto go fully parallel since there are only eight instructions
• To do so will require full utilization of both sides of the CPU
MPYMPYLDWLDWLDBLDB
DecodeDecodeDP DCDP DC
ExecuteExecuteE1 E2 E3 E4 E5 E6E1 E2 E3 E4 E5 E6
DoneDone
99
DoneDone
99
1212
3399
66
11111010
11
88
22
77 5544
1212
3399
66
11111010
11
88
22
77 5544
BB
ADDADDADDADDMPYMPY
++ ++ ++ ++MVKMVK
++
Introduction to the Pipeline
ECE 5655/4655 Real-Time DSP 3–25
• The fully parallel code
• At the start of execution (seventh cycle) we have
SerialSerial PartiallyPartiallyParallelParallel
FullyFullyParallelParallel
B .S1B .S1MVK .S1MVK .S1ADD .L1ADD .L1ADD .L1ADD .L1MPY .M1MPY .M1MPY .M1MPY .M1LDW .D1LDW .D1LDB .D1LDB .D1
B .S1B .S1|| MVK .S2|| MVK .S2
ADD .L1ADD .L1|| ADD .L2|| ADD .L2|| MPY .M1|| MPY .M1
MPY .M1MPY .M1|| LDW .D1|| LDW .D1|| LDB .D2|| LDB .D2
B .S1B .S1|| MVK .S2|| MVK .S2|| ADD .L1|| ADD .L1|| ADD .L2|| ADD .L2|| MPY .M1|| MPY .M1|| MPY .M2|| MPY .M2|| LDW .D1|| LDW .D1|| LDB .D2|| LDB .D2
DecodeDecodeDP DCDP DC
ExecuteExecuteE1 E2 E3 E4 E5 E6E1 E2 E3 E4 E5 E6
DoneDone
99
1212
3399
66
11111010
11
88
22
77 5544
++ ++ ++ ++ ++
++++++ ++ ++ ++++ ++ ++ ++
EPEP22
BBMVKMVKADDADDADDADDMPYMPYMPYMPYLDWLDWLDBLDB
DecodeDecodeDP DCDP DC
ExecuteExecuteE1 E2 E3 E4 E5 E6E1 E2 E3 E4 E5 E6
DoneDone
99
DoneDone
99
1212
3399
66
11111010
11
88
22
77 5544
1212
3399
66
11111010
11
88
22
77 5544
++ ++ ++ ++ ++
++++++ ++ ++ ++++ ++ ++ ++
EPEP22
BBMVKMVKADDADDADDADDMPYMPYMPYMPYLDWLDWLDBLDB
Chapter 3 • TMS320C6x Programming
3–26 ECE 5655/4655 Real-Time DSP
• This sort of efficiency requires smart coding
• Two not so obvious requirements are:
– Properly filling delay slots
– Proper use of parallel instructions
• The assembly optimizer (part of linear assembly) and theoptimizing C compiler significantly simplify this process
C67x Exceptions
• With the floating point capability comes additional delay slotrequirements and latency
• There is also functional unit latency beyond one cycle, whichoccurs in some double precision (DP) instructions
.S Unit.S UnitCMPLTDPCMPLTDP (1.2)(1.2)RCPSPRCPSP (1.1)(1.1)RCPDPRCPDP (1.2)(1.2)RSQRSPRSQRSP (1.1)(1.1)RSQRDPRSQRDP (1.2)(1.2)SPDPSPDP (1.2)(1.2)
ABSSPABSSP (1.1)(1.1)ABSDPABSDP (1.2)(1.2)CMPEQSPCMPEQSP (1.1)(1.1)CMPGTSPCMPGTSP (1.1)(1.1)CMPLTSPCMPLTSP (1.2)(1.2)CMPEQDPCMPEQDP (1.3)(1.3)CMPGTDPCMPGTDP (1.3)(1.3)
.S Unit.S UnitCMPLTDPCMPLTDP (1.2)(1.2)RCPSPRCPSP (1.1)(1.1)RCPDPRCPDP (1.2)(1.2)RSQRSPRSQRSP (1.1)(1.1)RSQRDPRSQRDP (1.2)(1.2)SPDPSPDP (1.2)(1.2)
ABSSPABSSP (1.1)(1.1)ABSDPABSDP (1.2)(1.2)CMPEQSPCMPEQSP (1.1)(1.1)CMPGTSPCMPGTSP (1.1)(1.1)CMPLTSPCMPLTSP (1.2)(1.2)CMPEQDPCMPEQDP (1.3)(1.3)CMPGTDPCMPGTDP (1.3)(1.3)
.M Unit.M UnitMPYI MPYI (4.9) (4.9) MPYIDMPYID (4.10) (4.10)
MPYSPMPYSP (1.4)(1.4)MPYDPMPYDP (4.10)(4.10)
.M Unit.M UnitMPYI MPYI (4.9) (4.9) MPYIDMPYID (4.10) (4.10)
MPYSPMPYSP (1.4)(1.4)MPYDPMPYDP (4.10)(4.10)
.L Unit.L UnitINTSP (1.4)(1.4)INTSPU (1.4)(1.4)SPINT (1.4)(1.4)SPTRUNC (1.4)(1.4)SUBSP (1.4)(1.4)SUBDP (2.7)(2.7)
ADDSPADDSP (1.3)(1.3)ADDDPADDDP (2.7)(2.7)DPINTDPINT (1.4)(1.4)DPSPDPSP (1.4)(1.4)INTDPINTDP (1.5)(1.5)INTDPUINTDPU (1.5)(1.5)
.L Unit.L UnitINTSP (1.4)(1.4)INTSPU (1.4)(1.4)SPINT (1.4)(1.4)SPTRUNC (1.4)(1.4)SUBSP (1.4)(1.4)SUBDP (2.7)(2.7)
.L Unit.L UnitINTSP (1.4)(1.4)INTSPU (1.4)(1.4)SPINT (1.4)(1.4)SPTRUNC (1.4)(1.4)SUBSP (1.4)(1.4)SUBDP (2.7)(2.7)
ADDSPADDSP (1.3)(1.3)ADDDPADDDP (2.7)(2.7)DPINTDPINT (1.4)(1.4)DPSPDPSP (1.4)(1.4)INTDPINTDP (1.5)(1.5)INTDPUINTDPU (1.5)(1.5)
.D Unit.D UnitADDADADDAD (1.1) (1.1) LDDWLDDW (1.5) (1.5)
.D Unit.D UnitADDADADDAD (1.1) (1.1) LDDWLDDW (1.5) (1.5)
C67x Latencies: (unit.instruction)
e.g., MPYSP (1.4) means a single precision float multiplyrequires a single function unit latency and three delay slots.
ECE 5655/4655 Real-Time DSP 3–27
C ProgrammingThe section will focus on some of the uses of the C6x develop-ment tools and some of the compiler, assembler, and linker set-tings.
• As stated at the beginning of this chapter, the use of C codecan achieve from 80–100% the efficiency of hand assembly
– Further optimization, what is discussed in this section, willlikely be required, but it is safe to say that C code is a goodstarting point for algorithm development
• Recall the basic code building tool layout is:
• When the compiler tools are coupled with Code ComposerStudio (CCS) we have a compete development environment:
.out.out.out.outLinkerLinker
.obj.obj
Link.cmdLink.cmd
LinkerLinker.obj.obj
Link.cmdLink.cmd
EditorEditor
.sa.sa
AsmAsmOptimizerOptimizer
.sa.sa
AsmAsmOptimizerOptimizer
.c / ..c / .cppcpp
CompilerCompiler
.c / ..c / .cppcpp
CompilerCompiler
.c / ..c / .cppcpp
CompilerCompiler
AsmAsm.asm.asm
AsmAsm.asm.asm
Chapter 3 • TMS320C6x Programming
3–28 ECE 5655/4655 Real-Time DSP
• The output code can be controlled with a very large numberof options that span the compiler, assemble, and linker
PLU
G IN
S (C
++, V
B, J
ava)
PLU
G IN
S (C
++, V
B, J
ava)
PLU
G IN
S (C
++, V
B, J
ava)
PLU
G IN
S (C
++, V
B, J
ava)
LinkLinkAsmAsm
CompileCompileAsm OptoAsm Opto
EditEdit
DSPDSPBoardBoard
Debug
SIMProbe In
Probe OutGraphsProfiling
DSK
EVM
Third Party
XDS
LinkLinkAsmAsm
CompileCompileAsm OptoAsm Opto
EditEdit
DSPDSPBoardBoard
Debug
SIM
Debug
SIMSIMProbe In
Probe In
Probe Out
Probe OutGraphsGraphsProfilingProfiling
DSK
EVM
Third Party
DSK
EVM
Third Party
XDSXDSStudio Includes:Studio Includes:�� Code Generation ToolsCode Generation Tools�� BIOS:BIOS: RealReal--time kerneltime kernel
RealReal--time analysis (time analysis (RTARTA))
BIOSBIOSLibraryLibrary
Studio Includes:Studio Includes:�� Code Generation ToolsCode Generation Tools�� BIOS:BIOS: RealReal--time kerneltime kernel
RealReal--time analysis (time analysis (RTARTA))
BIOSBIOSLibraryLibrary
�� SimulatorSimulator�� Simulator, PlugSimulator, Plug--ins, ins, ÆÆRTDXRTDX
LinkLinkAsmAsmCompileCompilefile.cfile.cfile.c file.outfile.outfile.outLinkLinkAsmAsmCompileCompilefile.cfile.cfile.c file.outfile.outfile.out
Indicates how output file should be constructed� Which Optimizations� Where to find files/libs� ‘C62x or ‘C67x� How to link files� Etc.
(Old CCS Interface shown)
C Programming
ECE 5655/4655 Real-Time DSP 3–29
Debug options
• All total there are about five pages of options in the compileruser manual
Optimize Options
• When first debugging code we typically use -gs (above),later optimization can be turned on, e.g., -o3
OptionsOptions DescriptionDescription CC TabCC TabOptionsOptions DescriptionDescription CC TabCC Tab
debugdebugdebugdebug
--mv6700 Generate ‘C6700 code (‘C6200 is default)mv6700 Generate ‘C6700 code (‘C6200 is default) CompilerCompiler--fr <dir>fr <dir> Directory containing source filesDirectory containing source files CompilerCompiler--gg EnablesEnables srcsrc--level symbolic debugginglevel symbolic debugging Comp/Comp/AsmAsm--ss InterlistInterlist C statements into assembly listingC statements into assembly listing CompilerCompiler--kk Keep assembly fileKeep assembly file CompilerCompiler--mgmg Enables minimum debug to allow profilingEnables minimum debug to allow profiling CompilerCompiler--mtmt NoNo aliasingaliasing usedused CompilerCompiler--o3o3 Invoke optimizer (Invoke optimizer (--o0, o0, --o1, o1, --o2/o2/--o, o, --o3)o3) CompilerCompiler--pmpm Combine all C source files before compileCombine all C source files before compile CompilerCompiler--gsgs--gsgs
OptionsOptions DescriptionDescription CC TabCC TabOptionsOptions DescriptionDescription CC TabCC Tab
speedspeedoptoopto
speedspeedoptoopto
--mv6700 Generate ‘C6700 code (‘C6200 is default)mv6700 Generate ‘C6700 code (‘C6200 is default) CompilerCompiler--frfr <dir><dir> Directory containing source filesDirectory containing source files CompilerCompiler--gg EnablesEnables srcsrc--level symbolic debugginglevel symbolic debugging Comp/Comp/AsmAsm--ss InterlistInterlist C statements into assembly listingC statements into assembly listing CompilerCompiler--kk Keep assembly fileKeep assembly file CompilerCompiler--mgmg Enables minimum debug to allow profilingEnables minimum debug to allow profiling CompilerCompiler--mtmt NoNo aliasingaliasing usedused CompilerCompiler--o3o3 Invoke optimizer (Invoke optimizer (--o0, o0, --o1, o1, --o2/o2/--o, o, --o3)o3) CompilerCompiler--pmpm Combine all C source files before compileCombine all C source files before compile CompilerCompiler--msms Minimize code size (Minimize code size (--ms0/ms0/--ms, ms, --ms1, ms1, --ms2)ms2) CompilerCompiler--oi0oi0 Disables automatic functionDisables automatic function inlininginlining CompilerCompiler
--k k --mgt mgt --o3 o3 --pmpm--k k --mgt mgt --o3 o3 --pmpm
Chapter 3 • TMS320C6x Programming
3–30 ECE 5655/4655 Real-Time DSP
Code Size
Assembler Options
OptionsOptions DescriptionDescription CC TabCC TabOptionsOptions DescriptionDescription CC TabCC Tab
sizesizeoptoopto
--mv6700 Generate ‘C6700 code (‘C6200 is default)mv6700 Generate ‘C6700 code (‘C6200 is default) CompilerCompiler--frfr <dir><dir> Directory containing source filesDirectory containing source files CompilerCompiler--gg EnablesEnables srcsrc--level symbolic debugginglevel symbolic debugging Comp/Comp/AsmAsm--ss InterlistInterlist C statements into assembly listingC statements into assembly listing CompilerCompiler--kk Keep assembly fileKeep assembly file CompilerCompiler--mgmg Enables minimum debug to allow profilingEnables minimum debug to allow profiling CompilerCompiler--mtmt NoNo aliasingaliasing usedused CompilerCompiler--o3o3 Invoke optimizer (Invoke optimizer (--o0, o0, --o1, o1, --o2/o2/--o, o, --o3)o3) CompilerCompiler--pmpm Combine all C source files before compileCombine all C source files before compile CompilerCompiler--msms Minimize code size (Minimize code size (--ms0/ms0/--ms, ms, --ms1, ms1, --ms2)ms2) CompilerCompiler--oi0oi0 Disables automatic functionDisables automatic function inlininginlining CompilerCompiler
--k k --mgt mgt --ms0 ms0 --o3 o3 --oi0 oi0 --pmpm--k k --mgt mgt --ms0 ms0 --o3 o3 --oi0 oi0 --pmpm
OptionsOptions DescriptionDescription CC TabCC Tab
-- gg Enables srcEnables src--level symbolic debugginglevel symbolic debugging Comp/AsmComp/Asm-- ll Create assembler listing file (small Create assembler listing file (small --L)L) AssemblerAssembler-- ss Retain asm symbols for debuggingRetain asm symbols for debugging AssemblerAssembler
OptionsOptions DescriptionDescription CC TabCC TabOptionsOptions DescriptionDescription CC TabCC Tab
-- gg Enables srcEnables src--level symbolic debugginglevel symbolic debugging Comp/AsmComp/Asm-- ll Create assembler listing file (small Create assembler listing file (small --L)L) AssemblerAssembler-- ss Retain asm symbols for debuggingRetain asm symbols for debugging AssemblerAssembler
--glsgls--glsgls
C Programming
ECE 5655/4655 Real-Time DSP 3–31
Linker Options
Summary of Popular Options
OptionsOptions DescriptionDescription CC TabCC TabOptionsOptions DescriptionDescription CC TabCC Tab
-- oo <file><file> Output file nameOutput file name LinkerLinker-- mm <file><file> Map file nameMap file name LinkerLinker-- cc AutoAuto--initialize global/static C variablesinitialize global/static C variables LinkerLinker
Options Description Options Tab
debug
speedopto
-mv6700 Generate ‘C6700 code (‘C6200 is default) Compiler-fr <dir> Directory containing source files Compiler-g Enables src-level symbolic debugging Comp/Asm-s Interlist C statements into assembly listing Compiler-k Keep assembly file Compiler-mg Enables minimum debug to allow profiling Compiler-mt No aliasing used Compiler-o3 Invoke optimizer (-o0, -o1, -o2/-o, -o3) Compiler-pm Combine all C source files before compile Compiler-ms Minimize code size (-ms0/-ms, -ms1, -ms2) Compiler-oi0 Disables automatic function inlining Compiler -l Create assembler listing file (small -L) Assembler-s Retain asm symbols for debugging Assembler-o <dir> Output file name Linker-m <dir> Map file name Linker-c Auto-Init C variables (-cr turns off autoinit) Linker
sizeopto
Options Description Options TabOptions Description Options Tab
debugdebug
speedopto
-mv6700 Generate ‘C6700 code (‘C6200 is default) Compiler-fr <dir> Directory containing source files Compiler-g Enables src-level symbolic debugging Comp/Asm-s Interlist C statements into assembly listing Compiler-k Keep assembly file Compiler-mg Enables minimum debug to allow profiling Compiler-mt No aliasing used Compiler-o3 Invoke optimizer (-o0, -o1, -o2/-o, -o3) Compiler-pm Combine all C source files before compile Compiler-ms Minimize code size (-ms0/-ms, -ms1, -ms2) Compiler-oi0 Disables automatic function inlining Compiler -l Create assembler listing file (small -L) Assembler-s Retain asm symbols for debugging Assembler-o <dir> Output file name Linker-m <dir> Map file name Linker-c Auto-Init C variables (-cr turns off autoinit) Linker
sizeopto
Chapter 3 • TMS320C6x Programming
3–32 ECE 5655/4655 Real-Time DSP
• A block diagram depicting what happens when a projectbuild takes place is shown below:
Embedded Systems with CConsider software systems development in terms of the C6x
• An embedded system, for the purposes of C6x development,consists of:
– Program (algorithm and data structures)
– Initialization
– Memory management
• The program part seems pretty clear
• The initialization and memory management part are beyondwhat you find in a typical host programming environment,such as Visual C++ on a PC
-ofile.out
file.cfile.c Compiler
file.obj
-sfile.asm
-alfile.lst
Assembler
Linker-z
-m file.map
-o COptimizer
Run-timeLibrary(boot.c)
Embedded Systems with C
ECE 5655/4655 Real-Time DSP 3–33
• From a C programming perspective on a host, once the sys-tem resets and initializes, we only deal with the program
– In the embedded world we have to also deal with initializa-tion
– We have more flexibility this way, and we only need toinclude the hardware and software really needed to get thejob done
– Using only the hardware and software that is needed alsoprovides a cost savings
• The reset operation
– Stops the processor,
reset vectorreset reset vectorvector
reset
resetpinpin reset
vectorreset reset vectorvector
reset
resetpinpinreset
resetpinpin
Initialize System
Initialize Initialize SystemSystem
Initialize System
Initialize Initialize SystemSystem
ProgramProgramProgramProgramProgramProgram
short m = 10;short m = 10;short b = 2;short b = 2;short y = 0;short y = 0;
main()main(){{short x = 0;short x = 0;scanf(x);scanf(x);malloc(y);malloc(y);y = m * x;y = m * x;y = y + b;y = y + b;
}}
short m = 10;short m = 10;short b = 2;short b = 2;short y = 0;short y = 0;
main()main(){{short x = 0;short x = 0;scanf(x);scanf(x);malloc(y);malloc(y);y = m * x;y = m * x;y = y + b;y = y + b;
}}
short m = 10;short m = 10;short b = 2;short b = 2;short y = 0;short y = 0;
main()main(){{short x = 0;short x = 0;scanf(x);scanf(x);malloc(y);malloc(y);y = m * x;y = m * x;y = y + b;y = y + b;
}}
CodeCode
GlobalGlobalVariablesVariables InitialInitial
ValuesValues
LocalLocalVariablesVariables
DynamicDynamicVariablesVariables
Basic Sections ofBasic Sections ofC fileC file
CodeCode
GlobalGlobalVariablesVariables InitialInitial
ValuesValues
LocalLocalVariablesVariables
DynamicDynamicVariablesVariables
CodeCode
GlobalGlobalVariablesVariables InitialInitial
ValuesValues
LocalLocalVariablesVariables
DynamicDynamicVariablesVariables
Basic Sections ofBasic Sections ofC fileC file
Chapter 3 • TMS320C6x Programming
3–34 ECE 5655/4655 Real-Time DSP
– brings some registers back to a preset state,
– sets the program counter (PC) to zero, and
– begins running code (address 0)
Initialization Under C
• The C compiler run-time support library contains the routineboot.c
– Note, global variables are optionally initialized through acompiler switch
resetreset
pinpin
Initialize System
Initialize Initialize SystemSystem
reset vectorreset reset vectorvector
short m = 10;short b = 2;short y = 0;
main(){short x = 0;scanf(x);malloc(y);y = m * x;y = y + b;
}
short m = 10;short m = 10;short b = 2;short b = 2;short y = 0;short y = 0;
main()main(){{short x = 0;short x = 0;scanf(x);scanf(x);malloc(y);malloc(y);y = m * x;y = m * x;y = y + b;y = y + b;
}}
resetreset
pinpinresetreset
pinpin
Initialize System
Initialize Initialize SystemSystem
reset vectorreset reset vectorvector
short m = 10;short b = 2;short y = 0;
main(){short x = 0;scanf(x);malloc(y);y = m * x;y = y + b;
}
short m = 10;short m = 10;short b = 2;short b = 2;short y = 0;short y = 0;
main()main(){{short x = 0;short x = 0;scanf(x);scanf(x);malloc(y);malloc(y);y = m * x;y = m * x;y = y + b;y = y + b;
}}
boot.cboot.cboot.cboot.c
1.1. Initialize PointersInitialize Pointers(discussed in mod 11)(discussed in mod 11)�� stackstack�� heapheap�� global/staticglobal/static
2.2. Initialize global and staticInitialize global and staticvariablesvariables
3.3. Call _mainCall _main
_main_main
Embedded Systems with C
ECE 5655/4655 Real-Time DSP 3–35
• Following the actual hardware reset, the software begins toreset via vectors.asm via a branch to c_int00
– Note that c_int00 is defined in the C library
– Note also that when using CCS and debugging the target,e.g., the DSK, some of this functionality is automaticallytaken care of
• NOP’s are added to fill the fetch packet
• Each interrupt vector is aligned on the fetch packet boundar-ies
• Other interrupts, which are typically also part of this file,will be discussed later
boot.cboot.cboot.c
reset vectorreset reset vectorvector
short m = 10;short b = 2;short y = 0;
main(){short x = 0;scanf(x);malloc(y);y = m * x;y = y + b;
}
short m = 10;short m = 10;short b = 2;short b = 2;short y = 0;short y = 0;
main()main(){{short x = 0;short x = 0;scanf(x);scanf(x);malloc(y);malloc(y);y = m * x;y = m * x;y = y + b;y = y + b;
}}
_main_main
1. Init stack, heap, 1. Init stack, heap, & global ptrs& global ptrs
2. init variables2. init variables3. call _main3. call _main
boot.cboot.cboot.c
reset vectorreset reset vectorvector
short m = 10;short b = 2;short y = 0;
main(){short x = 0;scanf(x);malloc(y);y = m * x;y = y + b;
}
short m = 10;short m = 10;short b = 2;short b = 2;short y = 0;short y = 0;
main()main(){{short x = 0;short x = 0;scanf(x);scanf(x);malloc(y);malloc(y);y = m * x;y = m * x;y = y + b;y = y + b;
}}
_main_main
1. Init stack, heap, 1. Init stack, heap, & global ptrs& global ptrs
2. init variables2. init variables3. call _main3. call _main
resetreset
pinpin 00resetreset
pinpinresetreset
pinpin 00
vectors.asmvectors.asmvectors.asmvectors.asm
bb _c_int00_c_int00nopnop 55
_c_int00_c_int00
nopnopnopnopnopnopnopnopnopnopnopnop
One One
Fetch PacketFetch Packet
.global _c_int00.global _c_int00
.sect “vectors” .sect “vectors”
Chapter 3 • TMS320C6x Programming
3–36 ECE 5655/4655 Real-Time DSP
Compiler Sections
• The system software is broken into modules of code and dataknown as sections
• The sections as found in a typical C program are shownbelow:
• The above names seem reasonable, but the compiler usesnames associated with the common object files format(coff) developed many years ago by AT&T for use with Cand Unix
• The real names used by the C6x complier tools are the fol-lowing:
HardwareHardware SoftwareSoftwareHardwareHardware SoftwareSoftware
‘C6x‘C6x‘C6x
MemoryMemoryMemory
ROMROMROM
RAMRAMRAMRAMRAMRAM
RAMRAMRAM
PeriphPeriphPeriph
‘C6x‘C6x‘C6x
MemoryMemoryMemory
ROMROMROM
RAMRAMRAMRAMRAMRAM
RAMRAMRAM
PeriphPeriphPeriphProgramProgramProgram
CodeCodeCode
DataDataData
ProgramProgramProgram
CodeCodeCode
DataDataData Init Values(global)
Init ValuesInit Values(global)(global)
Variables(global)
VariablesVariables(global)(global)
Stack(local)StackStack(local)(local)
Heap(dynamic)
HeapHeap(dynamic)(dynamic)
Init Values(global)
Init ValuesInit Values(global)(global)
Variables(global)
VariablesVariables(global)(global)
Stack(local)StackStack(local)(local)
Heap(dynamic)
HeapHeap(dynamic)(dynamic)
Init Values(global)
Init ValuesInit Values(global)(global)
Variables(global)
VariablesVariables(global)(global)
Stack(local)StackStack(local)(local)
Heap(dynamic)
HeapHeap(dynamic)(dynamic)
C Code(main.c)C CodeC Code(main.c)(main.c)
System Init(boot.c)
System InitSystem Init(boot.c)(boot.c)
Vectors(reset)
VectorsVectors(reset)(reset)
System Init(boot.c)
System InitSystem Init(boot.c)(boot.c)
Vectors(reset)
VectorsVectors(reset)(reset)
System Init(boot.c)
System InitSystem Init(boot.c)(boot.c)
Vectors(reset)
VectorsVectors(reset)(reset)
Embedded Systems with C
ECE 5655/4655 Real-Time DSP 3–37
– The reset section can be any name, but vectors is rea-sonable
• The complete list of C compiler sections is:
ProgramProgramProgram
CodeCodeCode
DataDataData Init Values(global)
Init ValuesInit Values(global)(global)
Variables(global)
VariablesVariables(global)(global)
Stack(local)StackStack(local)(local)
Heap(dynamic)
HeapHeap(dynamic)(dynamic)
C Code(main.c)C CodeC Code(main.c)(main.c)
System Init(boot.c)
System InitSystem Init(boot.c)(boot.c)
Vectors(reset)
VectorsVectors(reset)(reset)
ProgramProgramProgram
CodeCodeCode
DataDataData
ProgramProgramProgram
CodeCodeCode
DataDataData Init Values(global)
Init ValuesInit Values(global)(global)
Variables(global)
VariablesVariables(global)(global)
Stack(local)StackStack(local)(local)
Heap(dynamic)
HeapHeap(dynamic)(dynamic)
Init Values(global)
Init ValuesInit Values(global)(global)
Variables(global)
VariablesVariables(global)(global)
Stack(local)StackStack(local)(local)
Heap(dynamic)
HeapHeap(dynamic)(dynamic)
Init Values(global)
Init ValuesInit Values(global)(global)
Variables(global)
VariablesVariables(global)(global)
Stack(local)StackStack(local)(local)
Heap(dynamic)
HeapHeap(dynamic)(dynamic)
C Code(main.c)C CodeC Code(main.c)(main.c)
System Init(boot.c)
System InitSystem Init(boot.c)(boot.c)
Vectors(reset)
VectorsVectors(reset)(reset)
System Init(boot.c)
System InitSystem Init(boot.c)(boot.c)
Vectors(reset)
VectorsVectors(reset)(reset)
System Init(boot.c)
System InitSystem Init(boot.c)(boot.c)
Vectors(reset)
VectorsVectors(reset)(reset)
.stack.stack.stack.stack
.sysmem.sysmem.sysmem.sysmem
.cinit.cinit.cinit.cinit
.bss.bss.bss.bss
.text.text.text.text
????youryourchoicechoice
.bss.bss
.text.text
.cinit.cinit
Global and static variablesGlobal and static variables
CodeCode
Initial values for global/static varsInitial values for global/static vars
DescriptionDescriptionSection Section NameName
.stack.stack Stack (local variables)Stack (local variables)
.sysmem.sysmem Memory for malloc fcns (heap)Memory for malloc fcns (heap)
.switch.switch Tables for switch instructionsTables for switch instructions.switch.switch Tables for switch instructionsTables for switch instructions
.const.const Global and static Global and static sstring literalstring literals.const.const Global and static Global and static sstring literalstring literals
.far.far Global and statics declared Global and statics declared farfar.far.far Global and statics declared Global and statics declared farfar
.cio.cio Buffers for stdio functionsBuffers for stdio functions.cio.cio Buffers for stdio functionsBuffers for stdio functions
Chapter 3 • TMS320C6x Programming
3–38 ECE 5655/4655 Real-Time DSP
• A possible section placement solution for the C6201:
• A more generalized way of describing the memory sections isto use the terms initialized and uninitialized as opposed toROM and RAM, i.e.,
‘C6201‘C6201
EPROMEPROM
.cinit.cinit.const.const.text.text
.switch.switch
CE0CE0
EPROMEPROM
.cinit.cinit.const.const.text.text
.switch.switch
EPROMEPROM
.cinit.cinit.const.const.text.text
.switch.switch
CE0CE0
SDRAMSDRAM
.sysmem.sysmem.far .far .cio .cio
CE2CE2
SDRAMSDRAM
.sysmem.sysmem.far .far .cio .cio
SDRAMSDRAM
.sysmem.sysmem.far .far .cio .cio
CE2CE2
.bss .bss 8000_00008000_0000(data RAM)(data RAM) .stack .stack
.bss .bss 8000_00008000_0000(data RAM)(data RAM) .stack .stack
.text.text.text
.switch.switch.switch
.const.const.const
.cinit.cinit.cinit
.bss.bss.bss
.far.far.far
.stack.stack.stack
.sysmem.sysmem.sysmem
.cio.cio.cio
140_0000140_0000(prog RAM)(prog RAM)140_0000140_0000(prog RAM)(prog RAM)
Many other solutionspossible; the C67xx?
.bss.bss
.text.text
.cinit.cinit
Global and static variablesGlobal and static variables
CodeCode
Initial values for global/static varsInitial values for global/static vars
uninitializeduninitialized
initializedinitialized
initializedinitialized
DescriptionDescriptionSection Section NameName
MemoryMemoryTypeType
.switch.switch Tables for switch instructionsTables for switch instructions initializedinitialized
.const.const Global and static Global and static sstring literalstring literals initializedinitialized
.far.far Global and statics declared Global and statics declared farfar uninitializeduninitialized
.stack.stack Stack (local variables)Stack (local variables) uninitializeduninitialized
.sysmem.sysmem Memory for malloc fcns (heap)Memory for malloc fcns (heap) uninitializeduninitialized
.cio.cio Buffers for stdio functionsBuffers for stdio functions uninitializeduninitialized
.bss.bss
.text.text
.cinit.cinit
Global and static variablesGlobal and static variables
CodeCode
Initial values for global/static varsInitial values for global/static vars
uninitializeduninitialized
initializedinitialized
initializedinitialized
DescriptionDescriptionSection Section NameName
MemoryMemoryTypeType
.switch.switch Tables for switch instructionsTables for switch instructions initializedinitialized.switch.switch Tables for switch instructionsTables for switch instructions.switch.switch Tables for switch instructionsTables for switch instructions initializedinitialized
.const.const Global and static Global and static sstring literalstring literals initializedinitialized.const.const Global and static Global and static sstring literalstring literals.const.const Global and static Global and static sstring literalstring literals initializedinitialized
.far.far Global and statics declared Global and statics declared farfar uninitializeduninitialized.far.far Global and statics declared Global and statics declared farfar.far.far Global and statics declared Global and statics declared farfar uninitializeduninitialized
.stack.stack Stack (local variables)Stack (local variables) uninitializeduninitialized.stack.stack Stack (local variables)Stack (local variables).stack.stack Stack (local variables)Stack (local variables) uninitializeduninitialized
.sysmem.sysmem Memory for malloc fcns (heap)Memory for malloc fcns (heap) uninitializeduninitialized.sysmem.sysmem Memory for malloc fcns (heap)Memory for malloc fcns (heap).sysmem.sysmem Memory for malloc fcns (heap)Memory for malloc fcns (heap) uninitializeduninitialized
.cio.cio Buffers for stdio functionsBuffers for stdio functions uninitializeduninitialized.cio.cio Buffers for stdio functionsBuffers for stdio functions.cio.cio Buffers for stdio functionsBuffers for stdio functions uninitializeduninitialized
Embedded Systems with C
ECE 5655/4655 Real-Time DSP 3–39
Memory Management
• We control the physical mapping of memory to program anddata sections sections via a linker command file
• The linker command file .cmd has two parts
.cmd.cmd.cmd.cmd
LinkerLinker.obj.obj.obj.obj
.map.map--mm
.out.out--ooLinkerLinker.obj.obj
.obj.obj.obj.obj.obj.obj
.map.map--mm
.map.map--mm
.out.out--oo
.out.out--oo
MemoryMemoryMemory
‘C6x‘C6x‘C6x
MemoryMemoryMemory
ROMROMROM
RAMRAMRAMRAMRAMRAM
RAMRAMRAM
PeriphPeriphPeriph
‘C6x‘C6x‘C6x
MemoryMemoryMemory
ROMROMROM
RAMRAMRAMRAMRAMRAM
RAMRAMRAM
PeriphPeriphPeriph
‘C6x‘C6x‘C6x
MemoryMemoryMemory
ROMROMROM
RAMRAMRAMRAMRAMRAM
RAMRAMRAM
PeriphPeriphPeriph
.obj.obj.obj.obj
MEMORYMEMORY{ {
Memory DescriptionMemory Description
}}
SECTIONSSECTIONS{{
Binding Code/Data Sections to MemoryBinding Code/Data Sections to Memory
}}
Chapter 3 • TMS320C6x Programming
3–40 ECE 5655/4655 Real-Time DSP
• In the memory description portion we create a description ofboth processor and system resources
• Each line is of the formname:origin = address, length = size-in-bytes
– Note that we can shorten origin to simply o or org, andlength to simply len or l, i.e., consider the memoryportion of the C6711 command file we have used thus farMEMORY{
vecs: org = 00000000h , len = 220h IRAM: org = 00000220h , len = 0000fdc0h CE0: org = 80000000h , len = 01000000h
FLASH: org = 90000000h , len = 00020000h}
– Quantities may be specified in hex or decimal, but hex ispreferred, e.g., 100h or 0x100
• Note: The vectors section must come first, so that followingreset, initialization can occur
• The vecs space must be at least 200 hex long since on theC6x there are a total of 16 interrupts, each requiring one fetchpacket of 8, 32-bit instructions ( )
– Here the 220h leaves room for 32 bits more
– There will be more discussion of interrupts later
• To understand the rest of the memory space assignments,recall the C6x11 memory map
16 32! 200h=
Embedded Systems with C
ECE 5655/4655 Real-Time DSP 3–41
• On the C6x13 DSK we frequently place all of the sections,program and data, in the internal RAM (IRAM)
SECTIONS{ vectors :> vecs .text :> IRAM .bss :> IRAM .cinit :> IRAM .stack :> IRAM .sysmem :> SDRAM .const :> IRAM .switch :> IRAM .far :> SDRAM .cio :> SDRAM}
• Note some sections are placed in the SDRAM of CE0
FFFF_FFFFFFFF_FFFF
0000_00000000_000064K x 8 Internal64K x 8 Internal
(L2)(L2)
OnOn--chip Peripheralschip Peripherals0180_00000180_0000
256M x 8 External2
256M x 8 External3
8000_00008000_0000
9000_00009000_0000
A000_0000A000_0000
B000_0000B000_0000
256M x 8 External0
256M x 8 External1
FFFF_FFFFFFFF_FFFF
0000_00000000_000064K x 8 Internal64K x 8 Internal
(L2)(L2)
OnOn--chip Peripheralschip Peripherals0180_00000180_0000
256M x 8 External2 256M x 8 External2
256M x 8 External3 256M x 8 External3
8000_00008000_0000
9000_00009000_0000
A000_0000A000_0000
B000_0000B000_0000
256M x 8 External0 256M x 8 External0
256M x 8 External1 256M x 8 External1
64K64KUnifiedUnifiedRAMRAM
CPUCPU
4K4KProgramProgramCacheCache
4K4KDataData
CacheCache
64K64KUnifiedUnifiedRAMRAM
CPUCPU
4K4KProgramProgramCacheCache
4K4KDataData
CacheCache
The 6713DSKhas 16Mat8000_0000
C67xx Memory MapThe 6713DSKhas 264kBstarting at0000_0000
Chapter 3 • TMS320C6x Programming
3–42 ECE 5655/4655 Real-Time DSP
Linker Options
• In the third tab of the project options dialog box, we set linkeroptions
• The -o specifies the executable file, e.g., norm_sq_c.out
• The -m creates a map file which shows in detail how thelinker has located everything in memory
Embedded Systems with C
ECE 5655/4655 Real-Time DSP 3–43
• The -c option, run-time autoinitialization, invokes BOOT.Cso that variables are autoinitialized, that is initial values in the.cinit section are copied into the .bss section
– We can turn of autoinit by using -cr
• -stack sets the size of the stack, e.g., .stack section; thedefault is 0x400
• -heap sets the size of the heap, which is actually the .sys-mem section, has a default value of 0x400
• -q supresses the banner display and -w has the linkerexhaustively read all libraries
Chapter 3 • TMS320C6x Programming
3–44 ECE 5655/4655 Real-Time DSP
Calling Assembly with CBeing able to call assembly routines from C is a powerful capa-bility of the compiler tools. In this section we explore the mainpoints.
• For more detail refer to spru187t or newer, TMS320C6000Optimizing Compiler v 7.3: User's Guide
– Sections 7.4 & 7.5
• To begin with all C labels are accessed in the assembly filewith an underscore (_) character, e.g., sum --> _sum
• To call an assembly routine requires that we follow a fewsimple rules
• Things we would like to do are:
– Pass arguments in
– Return results
– Access C’s global variables in assembly
• More advanced issues, not dealt with here, are use of andaccess to the stack and optimal access to global variables
main( )main( ){{
}}
_asm_asmFunction:Function:
bb
Calling Assembly with C
ECE 5655/4655 Real-Time DSP 3–45
• To find a function we have a global (inter-file) reference
• To pass variables in, take a return value, and return to the par-ent code flow, we use a set of argument/register passing rules
Child.C
int child(int a, int b){
return(a + b);}
Child.CChild.C
int child(int a, int b)int child(int a, int b){{
return(a + b);return(a + b);}}
Child.ASMChild.ASM
.global.global _child_child
_child: _child:
; end of subroutine; end of subroutine
�� UseUse __underscoreunderscore�� Make label Make label globalglobal
Parent.C
int child(int, int);int x = 7, y, w = 3;
void main (void){
y = child(x, 5);}
Parent.CParent.C
int child(int, int);int child(int, int);int x = 7, y, w = 3;int x = 7, y, w = 3;
void main (void)void main (void){ {
y = child(x, 5);y = child(x, 5);}}
...assembly code...
arg1/arg1/r_valr_val
arg3arg3
arg5arg5
arg7arg7
arg9arg9
ret addrret addrarg2arg2
arg4arg4
arg6arg6
arg8arg8
arg10arg10
112233445566778899
101011111212131314141515
00AA BB
arg1/arg1/r_valr_val
arg3arg3
arg5arg5
arg7arg7
arg9arg9
ret addrret addrarg2arg2
arg4arg4
arg6arg6
arg8arg8
arg10arg10
112233445566778899
101011111212131314141515
00AA BBAA BB
�� Arguments are passed in Arguments are passed in registers as shownregisters as shown
�� Return value in A4Return value in A4and return to addressand return to addressin B3in B3
Child.C
int child(int a, int b){
return(a + b);}
Child.CChild.C
int child(int a, int b)int child(int a, int b){{
return(a + b);return(a + b);}}
Chapter 3 • TMS320C6x Programming
3–46 ECE 5655/4655 Real-Time DSP
• A simple example
• Accessing C global variables in assembly:
Child.C
int child(int a, int b){
return(a + b);}
Child.CChild.C
int child(int a, int b)int child(int a, int b){{
return(a + b);return(a + b);}}
Child.ASMChild.ASM
.global _child.global _child_child:_child:
addadd a4a4,,b4b4,,a4a4bb b3b3nopnop 55
; end of subroutine; end of subroutine
�� ArgumentsArguments�� Return/ResultReturn/Result
Child.ASMChild.ASM
.global _child.global _child_child:_child:
addadd a4a4,,b4b4,,a4a4bb b3b3nopnop 55
; end of subroutine; end of subroutine
�� ArgumentsArguments�� Return/ResultReturn/Result
Parent.C
int child(int, int);int x = 7, y, w = 3;
void main (void){
y = child(x, 5);}
Parent.CParent.C
int child(int, int);int child(int, int);int x = 7, y, w = 3;int x = 7, y, w = 3;
void main (void)void main (void){ {
y = child(y = child(x, 5x, 5););}}
�� Declare Declare globalglobal labelslabels�� Use _Use _underscoreunderscore when accessing C variables (labels)when accessing C variables (labels)�� Advantages of declaring variables in C?Advantages of declaring variables in C?
�� Declaring in C is easierDeclaring in C is easier�� Compiler does variable initCompiler does variable init ( ( int w = 3 int w = 3 ))
Parent.C
int child2(int, int);int x = 7, y, w = 3;
void main (void){
y = child2(x, 5);}
Parent.CParent.C
int child2(int, int);int child2(int, int);int x = 7, y, int x = 7, y, w = 3w = 3;;
void main (void)void main (void){ {
y = child2(x, 5);y = child2(x, 5);}}
Child2.ASM
.global _child2
.global _w
_child2:mvkl _w , A1mvkh _w , A1ldw *A1, A0
Child2.ASMChild2.ASM
.global _child2.global _child2
.global _w.global _w
_child2:_child2:mvklmvkl _w , A1_w , A1mvkhmvkh _w , A1_w , A1ldwldw *A1, A0*A1, A0
Calling Assembly with C
ECE 5655/4655 Real-Time DSP 3–47
• Registers A10–A15 and B10–B15 must be saved/preserved
• There is actually a bit more to this (see below), but more later
112233445566778899
101011111212131314141515
00AA BBAA BB
These must be saved and These must be saved and restored if you use them restored if you use them
in Assemblyin Assembly
00112233445566778899101011111212131314141515
arg1/arg1/r_valr_val
arg3arg3
arg5arg5
arg7arg7
arg9arg9
AA
ret addrret addrarg2arg2
arg4arg4
arg6arg6
arg8arg8
arg10arg10
DPDP
BB
SPSP
arg1/arg1/r_valr_val
arg3arg3
arg5arg5
arg7arg7
arg9arg9
AA
arg1/arg1/r_valr_val
arg3arg3
arg5arg5
arg7arg7
arg9arg9
AA
ret addrret addrarg2arg2
arg4arg4
arg6arg6
arg8arg8
arg10arg10
DPDP
BB
SPSP
ret addrret addrarg2arg2
arg4arg4
arg6arg6
arg8arg8
arg10arg10
DPDP
BB
SPSP
extraextraargumentsarguments
StackStack
PriorPriorStackStack
ContentsContents
extraextraargumentsarguments
StackStack
PriorPriorStackStack
ContentsContents
Chapter 3 • TMS320C6x Programming
3–48 ECE 5655/4655 Real-Time DSP
Linear Assembly and Assembly OptimizationBeing able to call highly efficient linear assembly routines fromC is another powerful capability of the compiler tools. In thissection we explore the main points.
• Linear assembly has the ease of C programming (almost) andthe efficiency approaching that of assembly, but without toomany headaches, as the tools do a lot of the work
• The development flow for linear assembly modules
• Features of linear assembly for subroutines include:
– Pass parameters
– Return results
– Use symbolic variable names
– Ignore pipeline issues (delay slots)
– Automatically return to the calling function
– Call other functions written in C or linear assembly
AssemblerAssemblerAssembler LinkerLinkerLinker.obj.obj .out.out
.c / ..c / .cppcpp
.asm.asm
Link.cmdLink.cmd.sa.sa
CompilerCompilerCompiler
AsmOptimizer
AsmAsmOptimizerOptimizer
TextEditorTextText
EditorEditor AssemblerAssemblerAssembler LinkerLinkerLinker.obj.obj .out.out
.c / ..c / .cppcpp
.asm.asm
Link.cmdLink.cmd.sa.sa
CompilerCompilerCompiler
AsmOptimizer
AsmAsmOptimizerOptimizer
TextEditorTextText
EditorEditor
Linear Assembly and Assembly Optimization
ECE 5655/4655 Real-Time DSP 3–49
• Consider a simple dot product example in C
• Rewriting in linear assembly (typically a .sa file) we have
– Assembly directives are required :(
– Functional unit management is not needed :)
– Register management not needed :)
intint DotP(short *m, short *n, DotP(short *m, short *n, intint count)count)
{ int i;{ int i;
intint product;product;
intint sum = 0;sum = 0;
for (i=0; i < count; i++)for (i=0; i < count; i++)
{{
product = m[i] * n[i];product = m[i] * n[i];
sum += product;sum += product;
}}
return(sum);return(sum);
}}
intint DotP(short *m, short *n, DotP(short *m, short *n, intint count)count)
{ int i;{ int i;
intint product;product;
intint sum = 0;sum = 0;
for (i=0; i < count; i++)for (i=0; i < count; i++)
{{
product = m[i] * n[i];product = m[i] * n[i];
sum += product;sum += product;
}}
return(sum);return(sum);
}}
_dotp: zero sum
loop: ldh *pm++, mldh *pn++, nmpy m, n, prodadd prod, sum, sum
sub count, 1, count[count] b loop
__dotpdotp:: zerozero sumsum
loop:loop: ldhldh *pm++, m*pm++, mldhldh **pnpn++, n++, nmpympy m, n, prodm, n, prodaddadd prod, sum, sumprod, sum, sum
subsub count, 1, countcount, 1, count[count] [count] bb looploop
Chapter 3 • TMS320C6x Programming
3–50 ECE 5655/4655 Real-Time DSP
• A special directive .cproc is used to declare the passedvariables, e.g.,
.cproc arg1, arg2, arg3
• The directive .endproc declares the end of the routine
• Symbolic names can be used throughout, which is very nice
• The completed dot product example
• The above performs the function
short dotp(short *a, short *x, int count)
_dotp: .cproc pm, pn, count
.reg m, n, prod, sum
zero sum
loop:
ldh *pm++, mldh *pn++, nmpy m, n, prodadd prod, sum, sum
sub count, 1, count[count] b loop
.return sum
.endproc
__dotpdotp:: ..cproccproc pm, pm, pnpn, count, count
..regreg m, n, prod, sum m, n, prod, sum
zerozero sumsum
loop:loop:
ldhldh *pm++, m*pm++, mldhldh **pnpn++, n++, nmpympy m, n, prodm, n, prodaddadd prod, sum, sumprod, sum, sum
subsub count, 1, countcount, 1, count[count] [count] bb looploop
.return.return sumsum
..endprocendproc
Example: Vector Norm Squared
ECE 5655/4655 Real-Time DSP 3–51
Calling from Linear Assembly
• Linear assembly can also call another subroutine
Linear Assembly Compiler Settings
• Specific assembly optimizer options are:
– Use -g -s for algorithm verification
– Use -k -mgt -o3 -pm for software pipelining
Example: Vector Norm SquaredIn this example we will be computing the squared length of avector using 16-bit (short) signed numbers. In mathematicalterms we are finding
__dotpdotp:: ..cproccproc..regreg valval
mvkmvk 5,5, valval
.call.call valval = _= _testcalltestcall((valval))
.return.return valval
..endprocendproc
_testcall:_testcall: ..cproccproc inputinput
addadd input, 5, inputinput, 5, input
.return.return input input
.endproc.endproc
Chapter 3 • TMS320C6x Programming
3–52 ECE 5655/4655 Real-Time DSP
(3.1)
where
(3.2)
is an -dimensional vector (column or row vector).
• The solution will be obtained in three different ways:
– Conventional C programming
– C6x assembly
– C6x linear assembly
• Optimization is not a concern at this point
• The focus here is to see by way of a simple example, how tocall a C routine from C (obvious), how to call an assemblyroutine from C, and how to call and write a simple linearassembly routine from C
C Version
• We implement this simple routine in C using a declared vec-tor length N and vector contents in the array A
• The C source, which includes the called function norm_sqis given below
A 2 An2
n 1=
N
¦=
A A1 … AN=
N
Example: Vector Norm Squared
ECE 5655/4655 Real-Time DSP 3–53
/******************************************************
Vector norm-squared routine in C
******************************************************/
#include <stdio.h>short norm(short *A, int N);
int main(){
int N = 5;short A[5] = {1, 2, 3, 6, 7};short norm_sq;norm_sq = norm(A, N);printf("Vector norm squared = %d",norm_sq);return 0;
}
short norm(short* V, int n){
int i;short out = 0;for(i=0; i<n; i++){
out += V[i]*V[i];}return out;
}
• The expected answer is 1 4 9 36 49+ + + + 99=
Chapter 3 • TMS320C6x Programming
3–54 ECE 5655/4655 Real-Time DSP
Running in CCS 5.1: The C code is put into a project for run-ning on the OMAP-L138 or the simulator as Norm_Squaredand debugged and profiled
• From the watch window we obtain the following when westep the program to the last line
Example: Vector Norm Squared
ECE 5655/4655 Real-Time DSP 3–55
• Enable the clock under the run menu to profile
• The cycle count at the function call level for the norm_sqfunction call is 152 in the simulator, did not try hardware
Starting address of array in memory
Other active windows in CCS 5.1
Time from 1st to 2ndbreakpoint
Chapter 3 • TMS320C6x Programming
3–56 ECE 5655/4655 Real-Time DSP
Assembly Version
• The parent C routine is the following:/******************************************************
Vector norm-squared routine in assembly
******************************************************/
#include <stdio.h>
short norm_asm(short *A, int N);
int main(){
int N = 5;short A[5] = {1, 2, 3, 6, 7};short norm_sq;
norm_sq = norm_asm(A, N);
printf("Vector norm squared = %d",norm_sq);
return 0;}
• From just the C source it is not obvious that the function pro-totype for norm_asm is actually an assembly routine
• The assembly routine is the following:; Vector norm in assembly
.global _norm_asm ;reference name from C
_norm_asm:
Example: Vector Norm Squared
ECE 5655/4655 Real-Time DSP 3–57
mv .l2 B4, B1 ;put loop ctr. in a proper reg.zero .l1 A2 ;initialize accumulator
loop:ldh .d1 *A4++, A1 ;ld vals pointed to by A4 in A1nop 4 ;required ldh delaympy .m1 A1, A1, A3;square each valuenop ;required mpy delayadd .l1 A3, A2, A2;accumulate the squared valuessub .l2 B1, 1, B1 ;decrement the loop counter
[B1]b .s2 loop ;branch until B1 == 0 nop 5 ;required branch delay
mv .d1 A2, A4 ;move result to return reg. A4b .s2 B3 ;branch back to address at B3nop 5 ;required branch delay
• Note that each line of assembly code takes the followingform:
label: || [cond] instruction .unit operand ;comment
– Labels must start in the first column, up to 200 characters,and must begin with a letter, the colon is optional
• When accessing from C the register calling convention isobserved, that is, when we enter the functionnorm_asm(arg1, arg2),
– arg1, is a pointer or address to the first value of the arrayA, and is stored in register A4
– arg2 is an int value, e.g., a full 32-bit signed integer,and is stored in register B4
• Since arg2 is the array dimension, we will use it as the loopcounter starting value
Chapter 3 • TMS320C6x Programming
3–58 ECE 5655/4655 Real-Time DSP
• B4 is not a suitable register for loop control, so we move(mv) the value stored in B4, in this case to B1
• We initialize the accumulator register, A2, using zero instruc-tion, alternatively mvk .s1 0,A2 works as well
• Starting at the top of the loop section, we begin by loading(ldh since we only have 16-bits) the values pointed to by A4into working register A1
– The pointer A4 is post incremented by just 2-bytes or 16-bits address steps following the load operation
– The default increment size is controlled by the data type,here it is halfwords (16-bits)
– Various pre- and post-increment options are available,including the offset amount, and wether it modifies theoriginal pointer or not (see the table below)
Example: Vector Norm Squared
ECE 5655/4655 Real-Time DSP 3–59
• To satisfy the pipeline delays, we follow the ldh with 4NOP’s
• Next, we perform a 16-bit multiply (MPY), actually a squar-ing; the result is stored in A3
• To satisfy the pipeline we follow the MPY with one NOP
• We accumulate the result into register A2 using ADD
• Next, we branch to loop subject to the state of B1
• The branch is followed by five NOP’s to satisfy the pipelinedelay
Table 3.1: Pointer incrementing methods; A1 showna
a. If [disp] is omitted the displacement is one unit of the data type, other-wise the displacement is by integer multiples of Word, Halfword, or Byte. If (disp) is used in stead of [disp] the displacement is (disp) bytes.
SyntaxPointer changed
Description
*A1 no Basic pointer
*+A1[disp] no +Pre-offset
*-A1[disp] no -Pre-offset
*++A1[disp] yes Pre-increment
*--A1[disp] yes Pre-decrement
*A1++[disp] yes Post-increment
*A1--[disp] yes Post-decrement
Chapter 3 • TMS320C6x Programming
3–60 ECE 5655/4655 Real-Time DSP
• Finally, the squared and accumulated value held in A2 issaved to the return register A4
• To return back to the C module, we must branch to theaddress saved in B3
• If we had needed to use registers A10–A15 or B10–B15, wewould of had to save and restore them accordingly
• The final numerical result is again 99
Running in CCS 2: The C code is put into a project for runningon the 6711 DSK as norm_sq_asm.pjt, and debugged andprofiled
• The profiling results of the new norm_sq function are:
• With the assembly routine the cycle count is reduced to 91,which as a ratio makes the C routine 152/91 = 1.67 timesslower, assuming no optimization
• With optimization the tables are turned and the C is faster bythe factor ?
Example: Vector Norm Squared
ECE 5655/4655 Real-Time DSP 3–61
The Linear Assembly Version
• The parent C calling routine is again of the form:/******************************************************
Vector norm-squared routine in linear assembly
******************************************************/
#include <stdio.h>short norm_sa(short *A, int N);
int main(){
int N = 5;short A[5] = {1, 2, 3, 6, 7};short norm_sq;norm_sq = norm_sa(A, N);printf("Vector norm squared = %d",norm_sq);return 0;
}
• The assembly routine is the following:; Vector norm in linear assembly
.global _norm_sa;reference name from C
_norm_sa:.cproc A, N ;input variables.reg m, sum ;working variableszero sum ;zero the accumulator
loop:
ldh *A++, m ;load values pointed to by A
Chapter 3 • TMS320C6x Programming
3–62 ECE 5655/4655 Real-Time DSP
mpy m, m, m ;square each valueadd m, sum, sum;accumulate the squared valuessub N, 1, N ;decrement the loop counter
[N]b loop ;branch until N == 0
.return sum ;return value
.endproc ;end linear assembly routine
• The function/subroutine is declared .global just as in theassembly case
• Following the assembly label _norm_sa, we begin the lin-ear assembly routine with .cproc followed by the inputvariables (may be dummy names);
• Working variables are declared using .reg
• The accumulator is cleared using the assembler instructionzero
• A loop is then set up in a similar fashion to the pure assemblyversion, except now the precise management of the registersis left to the assembly optimizer
• There is also no need to include NOP’s
• As before the final answer is 99
Running in CCS 2: The C code is put into a project for runningon the 6711 DSK as norm_sq_sa.pjt, and debugged andprofiled
Example: Vector Norm Squared
ECE 5655/4655 Real-Time DSP 3–63
• The profiling results of the new norm_sq function are:
• This result is very similar to the assembly result (on the 671390 .sa & 91 .asm)
• With say -o3 optimization the linear assembly is faster by theratio ?
• When debugging a linear assembly routine it is best to use themixed mode to display assembly interlisted with C and/or lin-ear assembly
• The registers window can then be used to watch what is hap-pening when the code is stepped
Chapter 3 • TMS320C6x Programming
3–64 ECE 5655/4655 Real-Time DSP