processor architectures and program mapping programmable digital signal processors 5kk10 tu/e henk...
Post on 21-Dec-2015
226 views
TRANSCRIPT
Processor Architectures and Program Mapping
Programmable Digital Signal Processors
5kk10TU/e
Henk Corporaal
Jef van Meerbergen
Bart Mesman
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
2
Topic 2: Programmable Digital Signal Processors
• real-time worst-case processing = need for more compute power sec instr cycles secprog prog instr cycle
CPI = 1• instruction level parallelism (ILP)• hardware support for loop control• attention for high level data types e.g. arrays, delaylines
(vs. scalars for CPUs)• difficult to compare architectures
• e.g. DIT, DIF, radix 2/4, FFT loop unrolling, scaling, shuffling, intialisation … can be included or forgotten
• benchmarking (Berkeley Design Technology Inc (BDTi))(compare to SpecInt benchmarks for CPs)
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
3
• architectures for programmable DSPs• multiplier-accumulator• modified Harvard architecture• extension with an ALU (decision making)• controller architectures
• examples: TI, Motorola, Philips • code generation• recent developments: VLIW (Very Long Instruction Word)
examples: C6 and TM
Outline
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
4
Goal = 1 cycle per iteration
•position ACR (1 or 2)•adder/subtractor•extra pipelines•asymmetric inputs•multi-precision
PR
ADDER
ACR
MPY(Booth,
Wallace..)
c(i) x(i)
c(i) * x(i)
Sum of products = basic operation for correlation, filtering, spectral analysis ... linear
expr.
Modifications •extra inputs/outputs
clockP_reg
control
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
5
• not every signal requires 32 bits• 2 types of DSP: floating point and integer• advantages FP: most specs are in FP
(conversion to int is time consuming since the behaviour may change)
• disadvantage FP: cost (area, speed, power)• wanted : type of output of an operation = type of input
(because both stored in RAM) • no problem for FP but for integer • integer multiplication doubles the number of bits: n * n => 2n• What about fractional numbers ?
0.90.90.81
x
DSP data types
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
6
• integer and fractional numbers are a special case of fixed pointfix <p,q> (ART designer & SystemC)
1 1 0 1 1 0 1 -19/8 = -2.3751fix <8,3>
negative weight2’s complement
if q=0 then integer e.g. int <8,0>if q=p-1 then fractional e.g. int <8,7>
DSP data types
Scale factor 1/8
pq
2-2 2-32-120212223-24
quantization error
Same alu handlesfix <8,1>, fix <8,2>, fix <8,3>, ...
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
7
0 1 1 0 1 -19/8
0 0 0 0 1 97/16
1
0
1 0 0 1 1 0 1 -1843/12811 1 1 1 0 0 01
Int <8,3>
Int <8,4>
s x x xs y y y--------
s s z z z z z zs z z z z z z 0 => if FRCT = 1
Some processors (C54) have special instructions for fractional Numbers (and symmetric number domain –2n-1 … 2n-1)
DSP data types
1 1
11
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
8
• continue (after multiplication) with msb only• represents the limit of the accuracy of the result
(can not be larger than the accuracy of the inputs)• more efficient solution
• continue with msb + lsb•sum-of-product operations generate accumulative noise at 32nd vs. 16th bit
• Still overflow for addition = overflow bits• double precision accumulator
+ extra overflow bits + shift, round, truncate unit
DSP data types
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
9
PR
ADDER
ACR
MPY(Booth,
Wallace..)
c(i) x(i)
SHIFTROUND
TRUNCATE
clockP_reg
clockP_reg
control
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
10
rounding value truncation magnitude truncation
x
xQ xQ xQ
x x
1 1 1 . 1 1 -0.25+ 0 0 0 . 1= 0 0 0 0
1 1 1 . 0 1 -0.75+ 0 0 0 . 1= 1 1 1 -1
1 1 1 . 1 1 -0.25
= 1 1 1 -1
1 1 1 . 0 1 -0.75
= 1 1 1 -1
1 1 1 . 1 1 -0.25+ 0 0 1 . = 0 0 0 0
1 1 1 . 0 1 -0.75+ 0 0 1 . = 0 0 0 0
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
11
saturation zeroing
sawtooth
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
12
Prog/datamemory
EXU
Von Neumann(sequencial)
progmem.
EXU
Harvard
datamem.
progmem.
EXU
datamem. 1
datamem. 2
Modified Harvard
c(i) * x(i)
Goal = 1 cycle per iteration
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
13
RAM_A RAM_B
ACU_A
AR_A
ACU_B
AR_B
MAC
DR_A DR_B
+1 PC
Interrupt address
Stack
Reset
ProgramMemory
IR
Control Bus
Rfile
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
14
*
Z-1
*
Z-1
*
Z-1
*
+
c4c5 c3 c2
x5 x4 x3 x2
y
Z-1
c1
x1
*
ci * xi
time loop
filter loop i
How updating the delayline ?
1 cycle/tap ?
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
15
Memorylocation
Outputsample 1
outputsample 2
outputsample 3
1 x1 x2 x32 x2 x3 x43 x3 x4 x54 x4 x5 x65 x5 x6 x7
Solution 1: blockmove in memory
2 possibilities • complete move after every output sample is calculated
• read and write the data twice • move after read of every datum separately
• write the data twice• need for a special instruction (TMS320)
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
16
Memorylocation
outputsample 1
outputsample 2
outputsample 3
outputsample 4
Outputsample 5
1 x1 x92 x2 x23 x3 x3 x34 x4 x4 x4 x45 x5 x5 x5 x5 x56 x6 x6 x6 x67 x7 x7 x78 x8 x8
Solution 2: indirect adressing
• use of a pointer to mark the begin of the delay line• update the pointer instead of moving the data• problem: trashing of the whole memory• solution: modulo addressing• need for a register to store the pointer
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
17
*
Z-1
*
Z-1
*
Z-1
*
+
c2c1 c3 c4
x
y2 y3 y4
y
Z-1y5y1
y1
y2
y3
y4
y5
pointerIIR filter
memory map
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
18
for j = 1..jtaps d(j) * y(j)
for i = 1..itaps c(i) * x(i)
time loop
2 filters
y1
y2
y3
y4
y5
pntr 2 modulo range 2
x1
x2
x3
x4
x5
pntr 1 modulo range 1
y1
y2
y3
y4
y5
x1
x2
x3
x4
x5
pntr 1m
odulo range
2 memory segments => 1 segment
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
19
x3
z-1
z-1z-1
z-1
x2
x1
y2
y3
y1
c2
c1
c4
c3
c5y1
y2
x1/y3
x2
x3
pntr 1m
odulo range
Mapping strategy• define positions in Ram
constraint: vars that form a delay line in consecutive places• find a schedule
example : c1 => c2 => c3 => c4 => c5• define ACU instructions
Mapping strategy
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
20
*
Z-1 Z-1
*
Z-1
*
+
c6
c7
c4
x7x6
x5x4
ye
Z-1x1x3
Z-1
*
x2Z-1Z-1
*
x8
c8
+ yo
*c5
*c3
*c1
c2
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
21
A S
Modulo
outputto RAM
Output reg A reg SRead_A A A SRead_S S A SincA A+1 A+1 SdecA A-1 A-1 SStep A+S A+S SInc_step S+1 A S+1
Modulo can beimplemented as a mask operation if the size is 2k
16 10 00023 10 111mask=hold
ACU architecture andInstruction set
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
22
x3
z-1
z-1z-1
z-1
x2
x1
y2
y3
y1
c2
c1
c4
c3
c5y1
y2
x1/y3
x2
x3
pntrm
odulo range
read_A 17incA 18incA 19incA 20incA 21step 19dec 18 prepare new pointer for next iteration
AssumeinitialisationA = pointer=17S = -2
1617181920212223
Mapping example
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
23
Addressing modes
• register ADD R4, R3 R[R4] = R[R4] + R[R3]• immediate ADD R4, #3 R[R4] = R[R4] + #3• direct ADD R4, (100) R[R4] = R[R4] + Mem[100]• indirect ADD R4, (R3) R[R4] = R[R4] + Mem[R[R3]]
• w. inc/dec ADD R4, (R3)± R[R4] = R[R4] + Mem[R[R3]] R[R3] = R[R3] ± 1
• indexed ADD R4, (R3±R2) R[R4] = R[R4] + Mem[R[R3]] R[R3] = R[R3] ± R[R2]
Remarks• direct = for static data• indirect = for arrays
• inc/dec = for stepping through arrays e.g. xn
• index = for stepping through arrays e.g. x2n
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
24
• 8 ARs (address or auxiliary register) available• extra indirect modes
•circular *ARn ± % post inc/dec by 1 - circular *ARn ± AR0 % post inc/dec by AR0 - circular
• bit reverse *ARn ± AR0 B post inc/dec by AR0 - bit rev.
Addressing modes: extra for DSP
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
25
• regular data-flow algorithms ==> MACfiltering, correlation, windowing etc …
• decision making ==> ALUsorting filters (e.g. median filters)interpolation (e.g. sqrt)absolute value calculationlogarithmic conversionfinite field aritmetic (e.g. Galois field)ViterbiVLC, VLDdivision
Incorporation of an ALU
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
26
+1 PC
Interrupt address
Stack
Reset
ProgramMemory
IR
ACU_A
AR_A
RAM_A
DR_A
ACU_B
AR_B
RAM_B
DR_B
MAC ALUControl Bus
Rfile
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
27
ALU SX SY DX DY RFACUA B
MULT SX SY DX DY RF ACUA B
Imm. data DX DY RFACUA B
Next address BR CondACUA B
00
01
10
11
Bus-oriented instruction encoding
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
28
LABEL ALU MPY-ACC RAM ACUAcc = 0 init (i=0)
init counterloop incr (=i+1)
read x(i)acc(i)=acc(i-1)+x(i)*c(i)
dec counter branch to loop if counter > 0
nop
c(i) * x(i)
6 clockcycles/samplelimit pipelines in the controller
first solution
resources
time (cc)
Not showncoefficient RAM+ACU
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
29
f
g
h
ai
bi
ci
di
f
g
h
a0
b0
c0
d0
f
g
h
a1
b1
c1
d1
f
g
h
a2
b2
c2
d2
h g f
ai
bi
bi-1ci-2
ci-1di-2
for i = 0 to n bi = f(ai) ci = g(bi) di = h(ci)
for i = 2 to n bi = f(ai) ci-1 = g(bi-1) di-2 = h(ci-2)
Loopfolding (software pipelining)
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
30
c(i) * x(i)
Pre- and postamble4 clockcycles /sample
LABEL ALU MPY-ACC RAM ACUacc(i-1)=0 init (i=1)
init counter read x(i) inc(=i+1)loop acc(i) = acc(i-1)+x(i)*c(i) read x(i+1) incr (=i+2)
dec counterbranch to loop if counter > 0nop
acc(n-1) = acc(n-2)+x(n-1)*c(n-1) read x(n)acc(n) = acc(n-1)+x(n)*c(n)
Loopfolding (software pipelining)
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
31
Label ALU MPY-ACC RAM ACUacc(i-1=0 init (i=1)
init counter read x(i) inc(=i+1)repeat n-2 acc(i)=acc(i-1)+x(i)*c(i) read x(i+1) incr(=i+2)
acc(n-1) = acc(n-2) + x(n-1)*c(n-1) read x(n)acc(n) = acc(n-1) + x(n)*c(n)
c(i) * x(i)
hardware support for loop control
1 clockcycles/samplerepeat instruction and repeat block
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
32
• architectures for programmable DSPs• multiplier-accumulator• modified Harvard architecture• extension with an ALU (decision making)• controller architectures
• examples: TI, Motorola, Philips • code generation• recent developments: VLIW (Very Long Instruction Word)
examples: C6 and TM
Outline
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
33
T register
Sign ctr Sign ctr Sign ctr Sign ctr Sign ctr
T
Multiplier (17*17)
A(40) B(40)
MUX
A
0
A
A B
B A
fractional MUX
Adder (40)
ZERO SAT ROUND
MALU (40)
UB
MUX
TAB CD
C D
Barrer shifter
MSW/LSWselect
E
COMP
TRN
TC
B
A
P C DD
TMS320C5000
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
34
Address bus
16 bits
EXTERNALADRESS SWITCH
Y Address
Y memory256-by-24-bit
RAM256-by-24-bit
ROM
AddressALU
X memory256-by-24-bit
RAM256-by-24-bit
ROM
2,048-by-24-bitPROGRAMMEMORY
ROM
X Address
P Address
EXTERNALDATA-BUS
SWITCH
INTERNAL DATA-BUS
SWITCH
24 BITS DATA
BUS
X-DATA
Y DATA
P DATA
GLOBAL DATA
DATA ALU
24-by-24 bitMULTIPLIER-
ACCUMULATORPRODUCING
56 BIT RESULT
PROGRAM CONTROLLER
ON CHIPPERIPHERALS,
HOST,SYNCHRONOUS
SERIAL INTERFACESERIAL COMMU-
NICATIONSINTERFACE,
PROGRAMMED I/O,BUS CONTROL
2 BITS
CLOCK
3 BITS
INTERRUPT
24 BITS
I/OPORTS
7 BITS
Motorola 56K family
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
35
X data
Y data
Z data
Buses for
X
X datamemory
16 bitbus
Y datamemory
16 bit bus
Two address Compution
units
Y
Inst
ruct
ion
d ec o
der
96-b
it in
stru
ctio
ns
Program control
unit
Programmemory (Z data)
16-bit bus
Two 16-by-16 bitmultipliers
Y0
Y1
X
Y0
Y1
X
PO P1
scale scale
Two 40 bit arithmic-logic units
SaturationSaturation
Four 40 bitaccumulators
Saturation/scale
shif
t
R.E.A.L.
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
36
memories Not included
Process 0.35, 5M
voltage 2.7-3.6 V
frequency 39 MHzTj = 85 °C, 2.7V, wcp
area 3.9 mm2
Power dissipation 2.1 mW/MHz
RD16021 DSP
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
37
Function DSPgroupOAK
MotorolaDSP561xx
ADIADSP-218x
LucentDSP16xx
TI TMS320C54x
TI320C62xx
LucentDSP16210
PhilipsRD16020
Real block FIR 835 925 841 1240 684 334 780 448Single sample FIR 21 23 22 26 18 17 16 20Complex block FIR 3018 3043 3122 3123 2922 1294 1681 1470LMS adaptive 90 64 59 101 58 33 55IIR (8 sections) 51 45 43 65 44 30 38 37Vector dot product 43 43 43 47 41 29 23 43Vector add 122 85 83 123 61 36 43 63Vector maximum 41 86 128 120 111 39 40Convolutionencoder
506 772 818 888 528 188 464 176
FSM 284 375 198 415 455 147 301 167256 pnt FFT 16514 12148 10633 21035 13234 4225 9016 5797
16 taps 40 samples 8 biquads
Instruction cycle counts for BDTi benchmarks
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
38
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
39
• architectures for programmable DSPs• multiplier-accumulator• modified Harvard architecture• extension with an ALU (decision making)• controller architectures
• examples: TI, Motorola, Philips • code generation• recent developments: VLIW (Very Long Instruction Word)
Outline
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
40
lexical analysis
syntax analysis
semantic analysis
Code selection
Register allocation
scheduling
Front end
Code generation
code
source
Intermediate machine independent
representation
1 instr = // opsorder of instr
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
41
a b
*
c d
+
+
*
c t1 := a * b t2 := c + d t3 := t1 + cout := t2 * t3
t1 t2
t3
BBi
BBj BBk
Intermediate machine independent
representation
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
42
Register transfer pattern (RTP) for a given datapathis any RT operation ( read - combinatorial logic - write) which can be executed on the datapath. [Leupers]
Notation ar := ar | ax + ay | af means ar := ar + ay or ar := ar + af or ar := ax + ay or ar := ax + af
Code selectionIntermediate representation RTP
match &cover
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
43
ax ay
ar
af mx my
mr
mf
+ -
x y x y
+ - *ALU MAC
d memory p memory ADSP[Analog Devices]
Code selection example
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
44
ar | mr | mx my | mf
*mr
+
mr | mf
ar | mr | mx my | mf
*mr
-
mr | mf
ar | mr | mx my | mf
*
mr | mf
mr | ar | ax ay | af
+
ar | af
mr | ar | ax ay | af
-
ar | af
Examples of RTPs on the ADSP-210 datapath
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
45
a b
*
c d
+
+
*
c
t1 t2
t3
mx := dmem my := pmem ax := dmem ay := pmem
mr := dmem
2:
1:
3: ar := ax + ay
my := ar
mr = mr * my
Mr := mr + (mx * my)
Example of code selection = covering of intermediate representation with RTPs
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
46
Problems• local decisions which have a global impact• phase coupling: example
• asap schedule• maximal freedom for scheduling• code selection during scheduling• register allocation comes afterwards• can lead to infeasible solutions
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
47
R1
R2 R3
alu2
alu1
(a) (b)
1
23
4
Move
(c)
1
23
4
phase coupling: example 1
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
48
Pu
Cu
Pv
Cv
Pu
Cu
Pv
Cv
u
v
u
v
if u and vshare the
same register
phase coupling: example 2
Example of coupling between scheduling and register allocation
[Mesman]
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
49
Traditional code generation
(heuristic)
OK ?constraints
no
yes
feasiblespace
design space seen by code generator
application
[Mesman]phase coupling: discussion
Phase coupling is difficult because of many constraints originatingfrom irregular interconnect, special purpose registers and non-orthogonal microcode.
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
50
Solution: 1. Solve code generation for DSPs2. Step back and rethink the architecture
develop an architecture which is still efficient but alsoa good model for building a compiler
Efficiency = exploit instruction level parallelism (ILP)compilation = systematic positioning of registers and regular interconnect= VLIW = Very Long Instruction Word
It is very difficult and almost impossible to develop robust and efficient DSP compilers. Current DSP practice = programming in assembler
phase coupling: discussion
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
51
• architectures for programmable DSPs• multiplier-accumulator• modified Harvard architecture• extension with an ALU (decision making)• controller architectures
• examples: TI, Motorola, Philips • code generation• recent developments: VLIW (Very Long Instruction Word)
• principles• central register file + example TM• clustered VLIW + example C6 • subword parallelism or SIMD
Outline
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
52
• multiple parallel FUs, possibly different and pipelined• pipelining is exposed to the compiler = no interlock mechanism
• load-store architectureall operands fetched from/stored in register files, possibly multi-ported
• each FU can receive an instruction every clock cycle• one instruction = many RISC instructions• each RISC instruction = one issue slot• no dependencies between different RISC instructions = orthogonal microcode = compiler friendly
VLIW principles
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
53
Execunit 1
Register file
Issue slot 1
Execunit 2
Issue slot 2
Execunit 3
Issue slot 3
Execunit 4
Issue slot 4
Execunit 5
Issue slot 5
Execunit 24
Issue slot 24
Execunit 25
Issue slot 25
R&W addr.instruction
...
...
• long instruction words e.g. (3*7+4)*25=625• many ports on the registerfile e.g. 75
VLIW architecture
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
54
Execunit 1
Execunit 2
Execunit 3
Register file
Issue slot 1
Execunit 4
Execunit 5
Execunit 6
Execunit 7
Execunit 8
Execunit 9
Issue slot 2 Issue slot 3
VLIW architecture: central Register File
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
55
Execunit
Execunit
Execunit
Execunit
Execunit
Register file (128 regs, 32 bit, 15 ports)
Instruction register (5 issue slots)
Data cache
(16 kB)
PCInstruction
cache (32kB)
5 constant5 ALU2 memory2 shift2 DSP-ALU2 DSP-mul3 branch2 FP ALU2 Int/FP ALU1 FP compare1 FP div/sqrt
TM1000 DSPCPU
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
56
TriMedia TM32A processor
D-cache I-Cache
IFM
UL
1IF
MU
L1
IFM
UL
2IF
MU
L2
(FL
OA
T)
(FL
OA
T)
(FL
OA
T)
(FL
OA
T)
DS
PM
UL
1D
SP
MU
L1 D
SP
MU
L2
DS
PM
UL
2
FT
OU
GH
1F
TO
UG
H1
SH
IFT
ER
1S
HIF
TE
R1
AL
U1
AL
U1
FC
OM
P2
FC
OM
P2
DS
PA
LU
2D
SP
AL
U2
AL
U2
AL
U2
AL
U4
AL
U4
AL
U0
AL
U0
AL
U3
AL
U3
FA
LU
0F
AL
U0
FA
LU
3F
AL
U3
DS
PA
LU
0D
SP
AL
U0
SH
IFT
ER
0S
HIF
TE
R0
TA
G
TA
G
TAG
TAG
SEQUENCER / DECODE
I/OINTERFACE
0.18 micronarea : 16.9mm2
200 MHz (typ)1.4 W
7 mW/MHz
(MIPS=0.9 mW/MHz)
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
57
Synthesised RF area (CMOS18, 64 bit)
0
1
2
3
4
5
6
7
8
9
0 5 10 15 20
Nr of ports
Are
a i
n m
m-s
q
32regs, after P&R
64regs, after P&R
128regs, after P&R
Poly. (128regs, after P&R)
Poly. (64regs, after P&R)
Poly. (32regs, after P&R)
Area, speed and power dissipation goes more than linear with thenumber of ports
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
58
Execunit 1
Execunit 2
copyunit
Register file 1
Execunit 3
Execunit 4
copyunit
Register file 2
Execunit 5
Execunit 6
copyunit
Register file 3
VLIW architecture: clustered Register Files
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
59
REGISTERFILE 1
FMULFADD
REGISTERFILE 2
IMULIADD
REGISTERFILE 3
IMULIADD
FMUL r1,r2,r3 IADD r1,r2,r3 IMUL r1,r2,r3
VLIW architecture: clustered Register Files
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
60
REGISTERFILE I0
IADD_01IMOV_01
:
FU00
IADD_00LAND_00
:
FU01
IMUL_00SHFT_00
:
FU02
REGISTERFILE I1
IADD_10IMOV_10
:
FU10
IADD_11LAND_10
:
FU01
IMUL_10SHFT_10
:
FU02
VLIW architecture: clustered Register Files
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
61
• performance loss (more instructions) compared to a central Register File (due to extra cycle for copy)•15-20 % for 2 clusters•20-30 % for 4 clusters
• limited scalability• not too many clusters• not too many registers within each cluster (too many RF ports)
• add of copy ops in the compiler = graph changes during scheduling
VLIW architecture: clustered Register Files
Discussion
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
62
Dst
src1
src2
Src_
upD
st_u
pD
stsr
c1sr
c2
Src_upD
st_upD
stsrc1src2
L1 S1M1
Store/loaddata Store/load
address
Dst
src1
src2
D1
Registerfile 0-15 (32 bits)
Store/loadaddress
Dst
src1
src2
D2
Dst
src1
src2
M2 S2 L2
loaddata
Registerfile 0-15
TMS320C62x VelociTI (fixed point)
Int addlogical
bit count
Int addlogical
bit manipshift
constantbranch
Int mult(16=>32)
Int addload/store
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
63
• parallelism (fetch-decode-execute) (max 8 issue slots)• pipeline critical sections (alu 1cc, mult 2 cc, 200 MHz)• Risc (simple, atomic, independent instructions)
performance comes from compiler (pipelining, unroll)• load-store• orthogonal (2 identical DP, add on 6 units)• deterministic (no interlock)• conditional instructions (=guarding)• instruction packing
VelociTI principles
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
64
n n A n n n n nn B n n n n n nn n n n n C n nn n n n n D n nn n n E n n n nF n n n n n n nn n n n n n G nn n n n n n n H
A B C D E F G H0 0 0 0 0 0 0 0
n B A n n C n nn n n E n D n nF n n n n n n nn n n n n n G H
A B C D E F G H1 1 0 1 0 0 1 0
A B C D E F G H1 1 1 1 1 1 1 0
A B C D E F G H
Fully serial
Mixed serial/parallel
Fully parallel
Velocity encoding
Classical encoding: fetching many nops
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
65
Function DSPgroupOAK
MotorolaDSP561xx
ADIADSP-218x
LucentDSP16xx
TI TMS320C54x
TI320C62xx
LucentDSP16210
PhilipsRD16020
Real block FIR 835 925 841 1240 684 334 780 448Single sample FIR 21 23 22 26 18 17 16 20Complex block FIR 3018 3043 3122 3123 2922 1294 1681 1470LMS adaptive 90 64 59 101 58 33 55IIR (8 sections) 51 45 43 65 44 30 38 37Vector dot product 43 43 43 47 41 29 23 43Vector add 122 85 83 123 61 36 43 63Vector maximum 41 86 128 120 111 39 40Convolutionencoder
506 772 818 888 528 188 464 176
FSM 284 375 198 415 455 147 301 167256 pnt FFT 16514 12148 10633 21035 13234 4225 9016 5797
Instruction cycle counts for BDTi benchmarks
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
66
byte3
op
byte3
byte3
byte2
op
byte2
byte2
byte1
op
byte1
byte1
byte0
op
byte0
byte0
Ex. +, - , min, max … => quadumin => quadumax ...
Subword parallelism(custom operators in TM)
1st input operand 2nd input operand
output operand
32 bits = 4 bytes are processedindependently
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
67
int size = 1000byte out[size], in1[size], in2[size]for i = 0; i < size; i+
out[ i ] = in1[ i ] + in2[ i ];
int size = 1000byte out[size], in1[size], in2[size]for i = 0; i < size; i+
packet4 t1 = packet4_load ( in1 );packet4 t2 = packet4_load ( in2 );packet4 t3 = packet4_add ( t1, t2 );packet4_store ( out, t3 );
Subword parallelism
+ faster execution- rewrite effort (e.g. different
types for in- and outputs)
Typical example : graphics ( 4 * 32 bit floating point)
(custom operators in TM)
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
68
for (i=0; i<64; I++){temp = ((back(i) + forward(i) +1) >> 1) +idct(i);if (temp > 255)
temp = 255;else if (temp < 0)
temp = 0;destination[i] = temp;}
Subword parallelism
MPEG example
Remark: simple example without interloop dependencies
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
69
for (i=0; i<64; i+=4){temp = ((back(i+0) + forward(i+0) +1) >> 1) +idct(i+0);if (temp > 255) temp = 255;else if (temp < 0) temp = 0;destination[i+0] = temp;
temp = ((back(i+1) + forward(i+1) +1) >> 1) +idct(i+1);if (temp > 255) temp = 255;else if (temp < 0) temp = 0;destination[i+1] = temp;
temp = ((back(i+2) + forward(i+2) +1) >> 1) +idct(i+2);if (temp > 255) temp = 255;else if (temp < 0) temp = 0;destination[i+2] = temp;
temp = ((back(i+3) + forward(i+3) +1) >> 1) +idct(i+3);if (temp > 255) temp = 255;else if (temp < 0) temp = 0;destination[i+3] = temp;}
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
70
temp0 = ((back(i+0) + forward(i+0) +1) >> 1) ;temp1 = ((back(i+1) + forward(i+1) +1) >> 1) ;temp2 = ((back(i+2) + forward(i+2) +1) >> 1) ;temp3 = ((back(i+3) + forward(i+3) +1) >> 1) ;
temp0 = idct(i+0);if (temp0 > 255) temp = 255;else if (temp0 < 0) temp0 = 0;temp1 = idct(i+1);if (temp1 > 255) temp1 = 255;else if (temp1 < 0) temp1 = 0;temp2 = idct(i+2);if (temp2 > 255) temp2 = 255;else if (temp2 < 0) temp2 = 0;temp3 = idct(i+3);if (temp3 > 255) temp3 = 255;else if (temp3 < 0) temp3 = 0;
destination[i+0] = temp0;destination[i+1] = temp1;destination[i+2] = temp2;destination[i+3] = temp3;
quadavg
dspuquadaddui
=
04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
71
Will embedded CPUs and DSPs converge ?• Converging forces
• both include a hardware multiplier• trend in DSPs towards caches and RTK• trend in DSPs towards C/C++• common trend towards VLIW
• Diverging forces• deeply embedded code (DSP) vs. end-user SW (CPU)• different RTKs
SPOX, Virtuoso (DSP) vs. pSOS, WinCE (top down)
Conclusions VLIW• good balance between hw and sw• between efficiency (ILP) and cost• fundamental problems: code size, interruptability