processor architectures and program mapping programmable digital signal processors 5kk10 tu/e henk...

71
Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

Post on 21-Dec-2015

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

Processor Architectures and Program Mapping

Programmable Digital Signal Processors

5kk10TU/e

Henk Corporaal

Jef van Meerbergen

Bart Mesman

Page 2: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

2

Topic 2: Programmable Digital Signal Processors

• real-time worst-case processing = need for more compute power sec instr cycles secprog prog instr cycle

CPI = 1• instruction level parallelism (ILP)• hardware support for loop control• attention for high level data types e.g. arrays, delaylines

(vs. scalars for CPUs)• difficult to compare architectures

• e.g. DIT, DIF, radix 2/4, FFT loop unrolling, scaling, shuffling, intialisation … can be included or forgotten

• benchmarking (Berkeley Design Technology Inc (BDTi))(compare to SpecInt benchmarks for CPs)

Page 3: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

3

• architectures for programmable DSPs• multiplier-accumulator• modified Harvard architecture• extension with an ALU (decision making)• controller architectures

• examples: TI, Motorola, Philips • code generation• recent developments: VLIW (Very Long Instruction Word)

examples: C6 and TM

Outline

Page 4: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

4

Goal = 1 cycle per iteration

•position ACR (1 or 2)•adder/subtractor•extra pipelines•asymmetric inputs•multi-precision

PR

ADDER

ACR

MPY(Booth,

Wallace..)

c(i) x(i)

c(i) * x(i)

Sum of products = basic operation for correlation, filtering, spectral analysis ... linear

expr.

Modifications •extra inputs/outputs

clockP_reg

control

Page 5: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

5

• not every signal requires 32 bits• 2 types of DSP: floating point and integer• advantages FP: most specs are in FP

(conversion to int is time consuming since the behaviour may change)

• disadvantage FP: cost (area, speed, power)• wanted : type of output of an operation = type of input

(because both stored in RAM) • no problem for FP but for integer • integer multiplication doubles the number of bits: n * n => 2n• What about fractional numbers ?

0.90.90.81

x

DSP data types

Page 6: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

6

• integer and fractional numbers are a special case of fixed pointfix <p,q> (ART designer & SystemC)

1 1 0 1 1 0 1 -19/8 = -2.3751fix <8,3>

negative weight2’s complement

if q=0 then integer e.g. int <8,0>if q=p-1 then fractional e.g. int <8,7>

DSP data types

Scale factor 1/8

pq

2-2 2-32-120212223-24

quantization error

Same alu handlesfix <8,1>, fix <8,2>, fix <8,3>, ...

Page 7: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

7

0 1 1 0 1 -19/8

0 0 0 0 1 97/16

1

0

1 0 0 1 1 0 1 -1843/12811 1 1 1 0 0 01

Int <8,3>

Int <8,4>

s x x xs y y y--------

s s z z z z z zs z z z z z z 0 => if FRCT = 1

Some processors (C54) have special instructions for fractional Numbers (and symmetric number domain –2n-1 … 2n-1)

DSP data types

1 1

11

Page 8: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

8

• continue (after multiplication) with msb only• represents the limit of the accuracy of the result

(can not be larger than the accuracy of the inputs)• more efficient solution

• continue with msb + lsb•sum-of-product operations generate accumulative noise at 32nd vs. 16th bit

• Still overflow for addition = overflow bits• double precision accumulator

+ extra overflow bits + shift, round, truncate unit

DSP data types

Page 9: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

9

PR

ADDER

ACR

MPY(Booth,

Wallace..)

c(i) x(i)

SHIFTROUND

TRUNCATE

clockP_reg

clockP_reg

control

Page 10: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

10

rounding value truncation magnitude truncation

x

xQ xQ xQ

x x

1 1 1 . 1 1 -0.25+ 0 0 0 . 1= 0 0 0 0

1 1 1 . 0 1 -0.75+ 0 0 0 . 1= 1 1 1 -1

1 1 1 . 1 1 -0.25

= 1 1 1 -1

1 1 1 . 0 1 -0.75

= 1 1 1 -1

1 1 1 . 1 1 -0.25+ 0 0 1 . = 0 0 0 0

1 1 1 . 0 1 -0.75+ 0 0 1 . = 0 0 0 0

Page 11: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

11

saturation zeroing

sawtooth

Page 12: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

12

Prog/datamemory

EXU

Von Neumann(sequencial)

progmem.

EXU

Harvard

datamem.

progmem.

EXU

datamem. 1

datamem. 2

Modified Harvard

c(i) * x(i)

Goal = 1 cycle per iteration

Page 13: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

13

RAM_A RAM_B

ACU_A

AR_A

ACU_B

AR_B

MAC

DR_A DR_B

+1 PC

Interrupt address

Stack

Reset

ProgramMemory

IR

Control Bus

Rfile

Page 14: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

14

*

Z-1

*

Z-1

*

Z-1

*

+

c4c5 c3 c2

x5 x4 x3 x2

y

Z-1

c1

x1

*

ci * xi

time loop

filter loop i

How updating the delayline ?

1 cycle/tap ?

Page 15: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

15

Memorylocation

Outputsample 1

outputsample 2

outputsample 3

1 x1 x2 x32 x2 x3 x43 x3 x4 x54 x4 x5 x65 x5 x6 x7

Solution 1: blockmove in memory

2 possibilities • complete move after every output sample is calculated

• read and write the data twice • move after read of every datum separately

• write the data twice• need for a special instruction (TMS320)

Page 16: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

16

Memorylocation

outputsample 1

outputsample 2

outputsample 3

outputsample 4

Outputsample 5

1 x1 x92 x2 x23 x3 x3 x34 x4 x4 x4 x45 x5 x5 x5 x5 x56 x6 x6 x6 x67 x7 x7 x78 x8 x8

Solution 2: indirect adressing

• use of a pointer to mark the begin of the delay line• update the pointer instead of moving the data• problem: trashing of the whole memory• solution: modulo addressing• need for a register to store the pointer

Page 17: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

17

*

Z-1

*

Z-1

*

Z-1

*

+

c2c1 c3 c4

x

y2 y3 y4

y

Z-1y5y1

y1

y2

y3

y4

y5

pointerIIR filter

memory map

Page 18: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

18

for j = 1..jtaps d(j) * y(j)

for i = 1..itaps c(i) * x(i)

time loop

2 filters

y1

y2

y3

y4

y5

pntr 2 modulo range 2

x1

x2

x3

x4

x5

pntr 1 modulo range 1

y1

y2

y3

y4

y5

x1

x2

x3

x4

x5

pntr 1m

odulo range

2 memory segments => 1 segment

Page 19: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

19

x3

z-1

z-1z-1

z-1

x2

x1

y2

y3

y1

c2

c1

c4

c3

c5y1

y2

x1/y3

x2

x3

pntr 1m

odulo range

Mapping strategy• define positions in Ram

constraint: vars that form a delay line in consecutive places• find a schedule

example : c1 => c2 => c3 => c4 => c5• define ACU instructions

Mapping strategy

Page 20: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

20

*

Z-1 Z-1

*

Z-1

*

+

c6

c7

c4

x7x6

x5x4

ye

Z-1x1x3

Z-1

*

x2Z-1Z-1

*

x8

c8

+ yo

*c5

*c3

*c1

c2

Page 21: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

21

A S

Modulo

outputto RAM

Output reg A reg SRead_A A A SRead_S S A SincA A+1 A+1 SdecA A-1 A-1 SStep A+S A+S SInc_step S+1 A S+1

Modulo can beimplemented as a mask operation if the size is 2k

16 10 00023 10 111mask=hold

ACU architecture andInstruction set

Page 22: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

22

x3

z-1

z-1z-1

z-1

x2

x1

y2

y3

y1

c2

c1

c4

c3

c5y1

y2

x1/y3

x2

x3

pntrm

odulo range

read_A 17incA 18incA 19incA 20incA 21step 19dec 18 prepare new pointer for next iteration

AssumeinitialisationA = pointer=17S = -2

1617181920212223

Mapping example

Page 23: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

23

Addressing modes

• register ADD R4, R3 R[R4] = R[R4] + R[R3]• immediate ADD R4, #3 R[R4] = R[R4] + #3• direct ADD R4, (100) R[R4] = R[R4] + Mem[100]• indirect ADD R4, (R3) R[R4] = R[R4] + Mem[R[R3]]

• w. inc/dec ADD R4, (R3)± R[R4] = R[R4] + Mem[R[R3]] R[R3] = R[R3] ± 1

• indexed ADD R4, (R3±R2) R[R4] = R[R4] + Mem[R[R3]] R[R3] = R[R3] ± R[R2]

Remarks• direct = for static data• indirect = for arrays

• inc/dec = for stepping through arrays e.g. xn

• index = for stepping through arrays e.g. x2n

Page 24: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

24

• 8 ARs (address or auxiliary register) available• extra indirect modes

•circular *ARn ± % post inc/dec by 1 - circular *ARn ± AR0 % post inc/dec by AR0 - circular

• bit reverse *ARn ± AR0 B post inc/dec by AR0 - bit rev.

Addressing modes: extra for DSP

Page 25: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

25

• regular data-flow algorithms ==> MACfiltering, correlation, windowing etc …

• decision making ==> ALUsorting filters (e.g. median filters)interpolation (e.g. sqrt)absolute value calculationlogarithmic conversionfinite field aritmetic (e.g. Galois field)ViterbiVLC, VLDdivision

Incorporation of an ALU

Page 26: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

26

+1 PC

Interrupt address

Stack

Reset

ProgramMemory

IR

ACU_A

AR_A

RAM_A

DR_A

ACU_B

AR_B

RAM_B

DR_B

MAC ALUControl Bus

Rfile

Page 27: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

27

ALU SX SY DX DY RFACUA B

MULT SX SY DX DY RF ACUA B

Imm. data DX DY RFACUA B

Next address BR CondACUA B

00

01

10

11

Bus-oriented instruction encoding

Page 28: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

28

LABEL ALU MPY-ACC RAM ACUAcc = 0 init (i=0)

init counterloop incr (=i+1)

read x(i)acc(i)=acc(i-1)+x(i)*c(i)

dec counter branch to loop if counter > 0

nop

c(i) * x(i)

6 clockcycles/samplelimit pipelines in the controller

first solution

resources

time (cc)

Not showncoefficient RAM+ACU

Page 29: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

29

f

g

h

ai

bi

ci

di

f

g

h

a0

b0

c0

d0

f

g

h

a1

b1

c1

d1

f

g

h

a2

b2

c2

d2

h g f

ai

bi

bi-1ci-2

ci-1di-2

for i = 0 to n bi = f(ai) ci = g(bi) di = h(ci)

for i = 2 to n bi = f(ai) ci-1 = g(bi-1) di-2 = h(ci-2)

Loopfolding (software pipelining)

Page 30: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

30

c(i) * x(i)

Pre- and postamble4 clockcycles /sample

LABEL ALU MPY-ACC RAM ACUacc(i-1)=0 init (i=1)

init counter read x(i) inc(=i+1)loop acc(i) = acc(i-1)+x(i)*c(i) read x(i+1) incr (=i+2)

dec counterbranch to loop if counter > 0nop

acc(n-1) = acc(n-2)+x(n-1)*c(n-1) read x(n)acc(n) = acc(n-1)+x(n)*c(n)

Loopfolding (software pipelining)

Page 31: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

31

Label ALU MPY-ACC RAM ACUacc(i-1=0 init (i=1)

init counter read x(i) inc(=i+1)repeat n-2 acc(i)=acc(i-1)+x(i)*c(i) read x(i+1) incr(=i+2)

acc(n-1) = acc(n-2) + x(n-1)*c(n-1) read x(n)acc(n) = acc(n-1) + x(n)*c(n)

c(i) * x(i)

hardware support for loop control

1 clockcycles/samplerepeat instruction and repeat block

Page 32: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

32

• architectures for programmable DSPs• multiplier-accumulator• modified Harvard architecture• extension with an ALU (decision making)• controller architectures

• examples: TI, Motorola, Philips • code generation• recent developments: VLIW (Very Long Instruction Word)

examples: C6 and TM

Outline

Page 33: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

33

T register

Sign ctr Sign ctr Sign ctr Sign ctr Sign ctr

T

Multiplier (17*17)

A(40) B(40)

MUX

A

0

A

A B

B A

fractional MUX

Adder (40)

ZERO SAT ROUND

MALU (40)

UB

MUX

TAB CD

C D

Barrer shifter

MSW/LSWselect

E

COMP

TRN

TC

B

A

P C DD

TMS320C5000

Page 34: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

34

Address bus

16 bits

EXTERNALADRESS SWITCH

Y Address

Y memory256-by-24-bit

RAM256-by-24-bit

ROM

AddressALU

X memory256-by-24-bit

RAM256-by-24-bit

ROM

2,048-by-24-bitPROGRAMMEMORY

ROM

X Address

P Address

EXTERNALDATA-BUS

SWITCH

INTERNAL DATA-BUS

SWITCH

24 BITS DATA

BUS

X-DATA

Y DATA

P DATA

GLOBAL DATA

DATA ALU

24-by-24 bitMULTIPLIER-

ACCUMULATORPRODUCING

56 BIT RESULT

PROGRAM CONTROLLER

ON CHIPPERIPHERALS,

HOST,SYNCHRONOUS

SERIAL INTERFACESERIAL COMMU-

NICATIONSINTERFACE,

PROGRAMMED I/O,BUS CONTROL

2 BITS

CLOCK

3 BITS

INTERRUPT

24 BITS

I/OPORTS

7 BITS

Motorola 56K family

Page 35: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

35

X data

Y data

Z data

Buses for

X

X datamemory

16 bitbus

Y datamemory

16 bit bus

Two address Compution

units

Y

Inst

ruct

ion

d ec o

der

96-b

it in

stru

ctio

ns

Program control

unit

Programmemory (Z data)

16-bit bus

Two 16-by-16 bitmultipliers

Y0

Y1

X

Y0

Y1

X

PO P1

scale scale

Two 40 bit arithmic-logic units

SaturationSaturation

Four 40 bitaccumulators

Saturation/scale

shif

t

R.E.A.L.

Page 36: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

36

memories Not included

Process 0.35, 5M

voltage 2.7-3.6 V

frequency 39 MHzTj = 85 °C, 2.7V, wcp

area 3.9 mm2

Power dissipation 2.1 mW/MHz

RD16021 DSP

Page 37: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

37

Function DSPgroupOAK

MotorolaDSP561xx

ADIADSP-218x

LucentDSP16xx

TI TMS320C54x

TI320C62xx

LucentDSP16210

PhilipsRD16020

Real block FIR 835 925 841 1240 684 334 780 448Single sample FIR 21 23 22 26 18 17 16 20Complex block FIR 3018 3043 3122 3123 2922 1294 1681 1470LMS adaptive 90 64 59 101 58 33 55IIR (8 sections) 51 45 43 65 44 30 38 37Vector dot product 43 43 43 47 41 29 23 43Vector add 122 85 83 123 61 36 43 63Vector maximum 41 86 128 120 111 39 40Convolutionencoder

506 772 818 888 528 188 464 176

FSM 284 375 198 415 455 147 301 167256 pnt FFT 16514 12148 10633 21035 13234 4225 9016 5797

16 taps 40 samples 8 biquads

Instruction cycle counts for BDTi benchmarks

Page 38: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

38

Page 39: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

39

• architectures for programmable DSPs• multiplier-accumulator• modified Harvard architecture• extension with an ALU (decision making)• controller architectures

• examples: TI, Motorola, Philips • code generation• recent developments: VLIW (Very Long Instruction Word)

Outline

Page 40: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

40

lexical analysis

syntax analysis

semantic analysis

Code selection

Register allocation

scheduling

Front end

Code generation

code

source

Intermediate machine independent

representation

1 instr = // opsorder of instr

Page 41: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

41

a b

*

c d

+

+

*

c t1 := a * b t2 := c + d t3 := t1 + cout := t2 * t3

t1 t2

t3

BBi

BBj BBk

Intermediate machine independent

representation

Page 42: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

42

Register transfer pattern (RTP) for a given datapathis any RT operation ( read - combinatorial logic - write) which can be executed on the datapath. [Leupers]

Notation ar := ar | ax + ay | af means ar := ar + ay or ar := ar + af or ar := ax + ay or ar := ax + af

Code selectionIntermediate representation RTP

match &cover

Page 43: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

43

ax ay

ar

af mx my

mr

mf

+ -

x y x y

+ - *ALU MAC

d memory p memory ADSP[Analog Devices]

Code selection example

Page 44: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

44

ar | mr | mx my | mf

*mr

+

mr | mf

ar | mr | mx my | mf

*mr

-

mr | mf

ar | mr | mx my | mf

*

mr | mf

mr | ar | ax ay | af

+

ar | af

mr | ar | ax ay | af

-

ar | af

Examples of RTPs on the ADSP-210 datapath

Page 45: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

45

a b

*

c d

+

+

*

c

t1 t2

t3

mx := dmem my := pmem ax := dmem ay := pmem

mr := dmem

2:

1:

3: ar := ax + ay

my := ar

mr = mr * my

Mr := mr + (mx * my)

Example of code selection = covering of intermediate representation with RTPs

Page 46: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

46

Problems• local decisions which have a global impact• phase coupling: example

• asap schedule• maximal freedom for scheduling• code selection during scheduling• register allocation comes afterwards• can lead to infeasible solutions

Page 47: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

47

R1

R2 R3

alu2

alu1

(a) (b)

1

23

4

Move

(c)

1

23

4

phase coupling: example 1

Page 48: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

48

Pu

Cu

Pv

Cv

Pu

Cu

Pv

Cv

u

v

u

v

if u and vshare the

same register

phase coupling: example 2

Example of coupling between scheduling and register allocation

[Mesman]

Page 49: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

49

Traditional code generation

(heuristic)

OK ?constraints

no

yes

feasiblespace

design space seen by code generator

application

[Mesman]phase coupling: discussion

Phase coupling is difficult because of many constraints originatingfrom irregular interconnect, special purpose registers and non-orthogonal microcode.

Page 50: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

50

Solution: 1. Solve code generation for DSPs2. Step back and rethink the architecture

develop an architecture which is still efficient but alsoa good model for building a compiler

Efficiency = exploit instruction level parallelism (ILP)compilation = systematic positioning of registers and regular interconnect= VLIW = Very Long Instruction Word

It is very difficult and almost impossible to develop robust and efficient DSP compilers. Current DSP practice = programming in assembler

phase coupling: discussion

Page 51: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

51

• architectures for programmable DSPs• multiplier-accumulator• modified Harvard architecture• extension with an ALU (decision making)• controller architectures

• examples: TI, Motorola, Philips • code generation• recent developments: VLIW (Very Long Instruction Word)

• principles• central register file + example TM• clustered VLIW + example C6 • subword parallelism or SIMD

Outline

Page 52: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

52

• multiple parallel FUs, possibly different and pipelined• pipelining is exposed to the compiler = no interlock mechanism

• load-store architectureall operands fetched from/stored in register files, possibly multi-ported

• each FU can receive an instruction every clock cycle• one instruction = many RISC instructions• each RISC instruction = one issue slot• no dependencies between different RISC instructions = orthogonal microcode = compiler friendly

VLIW principles

Page 53: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

53

Execunit 1

Register file

Issue slot 1

Execunit 2

Issue slot 2

Execunit 3

Issue slot 3

Execunit 4

Issue slot 4

Execunit 5

Issue slot 5

Execunit 24

Issue slot 24

Execunit 25

Issue slot 25

R&W addr.instruction

...

...

• long instruction words e.g. (3*7+4)*25=625• many ports on the registerfile e.g. 75

VLIW architecture

Page 54: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

54

Execunit 1

Execunit 2

Execunit 3

Register file

Issue slot 1

Execunit 4

Execunit 5

Execunit 6

Execunit 7

Execunit 8

Execunit 9

Issue slot 2 Issue slot 3

VLIW architecture: central Register File

Page 55: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

55

Execunit

Execunit

Execunit

Execunit

Execunit

Register file (128 regs, 32 bit, 15 ports)

Instruction register (5 issue slots)

Data cache

(16 kB)

PCInstruction

cache (32kB)

5 constant5 ALU2 memory2 shift2 DSP-ALU2 DSP-mul3 branch2 FP ALU2 Int/FP ALU1 FP compare1 FP div/sqrt

TM1000 DSPCPU

Page 56: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

56

TriMedia TM32A processor

D-cache I-Cache

IFM

UL

1IF

MU

L1

IFM

UL

2IF

MU

L2

(FL

OA

T)

(FL

OA

T)

(FL

OA

T)

(FL

OA

T)

DS

PM

UL

1D

SP

MU

L1 D

SP

MU

L2

DS

PM

UL

2

FT

OU

GH

1F

TO

UG

H1

SH

IFT

ER

1S

HIF

TE

R1

AL

U1

AL

U1

FC

OM

P2

FC

OM

P2

DS

PA

LU

2D

SP

AL

U2

AL

U2

AL

U2

AL

U4

AL

U4

AL

U0

AL

U0

AL

U3

AL

U3

FA

LU

0F

AL

U0

FA

LU

3F

AL

U3

DS

PA

LU

0D

SP

AL

U0

SH

IFT

ER

0S

HIF

TE

R0

TA

G

TA

G

TAG

TAG

SEQUENCER / DECODE

I/OINTERFACE

0.18 micronarea : 16.9mm2

200 MHz (typ)1.4 W

7 mW/MHz

(MIPS=0.9 mW/MHz)

Page 57: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

57

Synthesised RF area (CMOS18, 64 bit)

0

1

2

3

4

5

6

7

8

9

0 5 10 15 20

Nr of ports

Are

a i

n m

m-s

q

32regs, after P&R

64regs, after P&R

128regs, after P&R

Poly. (128regs, after P&R)

Poly. (64regs, after P&R)

Poly. (32regs, after P&R)

Area, speed and power dissipation goes more than linear with thenumber of ports

Page 58: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

58

Execunit 1

Execunit 2

copyunit

Register file 1

Execunit 3

Execunit 4

copyunit

Register file 2

Execunit 5

Execunit 6

copyunit

Register file 3

VLIW architecture: clustered Register Files

Page 59: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

59

REGISTERFILE 1

FMULFADD

REGISTERFILE 2

IMULIADD

REGISTERFILE 3

IMULIADD

FMUL r1,r2,r3 IADD r1,r2,r3 IMUL r1,r2,r3

VLIW architecture: clustered Register Files

Page 60: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

60

REGISTERFILE I0

IADD_01IMOV_01

:

FU00

IADD_00LAND_00

:

FU01

IMUL_00SHFT_00

:

FU02

REGISTERFILE I1

IADD_10IMOV_10

:

FU10

IADD_11LAND_10

:

FU01

IMUL_10SHFT_10

:

FU02

VLIW architecture: clustered Register Files

Page 61: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

61

• performance loss (more instructions) compared to a central Register File (due to extra cycle for copy)•15-20 % for 2 clusters•20-30 % for 4 clusters

• limited scalability• not too many clusters• not too many registers within each cluster (too many RF ports)

• add of copy ops in the compiler = graph changes during scheduling

VLIW architecture: clustered Register Files

Discussion

Page 62: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

62

Dst

src1

src2

Src_

upD

st_u

pD

stsr

c1sr

c2

Src_upD

st_upD

stsrc1src2

L1 S1M1

Store/loaddata Store/load

address

Dst

src1

src2

D1

Registerfile 0-15 (32 bits)

Store/loadaddress

Dst

src1

src2

D2

Dst

src1

src2

M2 S2 L2

loaddata

Registerfile 0-15

TMS320C62x VelociTI (fixed point)

Int addlogical

bit count

Int addlogical

bit manipshift

constantbranch

Int mult(16=>32)

Int addload/store

Page 63: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

63

• parallelism (fetch-decode-execute) (max 8 issue slots)• pipeline critical sections (alu 1cc, mult 2 cc, 200 MHz)• Risc (simple, atomic, independent instructions)

performance comes from compiler (pipelining, unroll)• load-store• orthogonal (2 identical DP, add on 6 units)• deterministic (no interlock)• conditional instructions (=guarding)• instruction packing

VelociTI principles

Page 64: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

64

n n A n n n n nn B n n n n n nn n n n n C n nn n n n n D n nn n n E n n n nF n n n n n n nn n n n n n G nn n n n n n n H

A B C D E F G H0 0 0 0 0 0 0 0

n B A n n C n nn n n E n D n nF n n n n n n nn n n n n n G H

A B C D E F G H1 1 0 1 0 0 1 0

A B C D E F G H1 1 1 1 1 1 1 0

A B C D E F G H

Fully serial

Mixed serial/parallel

Fully parallel

Velocity encoding

Classical encoding: fetching many nops

Page 65: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

65

Function DSPgroupOAK

MotorolaDSP561xx

ADIADSP-218x

LucentDSP16xx

TI TMS320C54x

TI320C62xx

LucentDSP16210

PhilipsRD16020

Real block FIR 835 925 841 1240 684 334 780 448Single sample FIR 21 23 22 26 18 17 16 20Complex block FIR 3018 3043 3122 3123 2922 1294 1681 1470LMS adaptive 90 64 59 101 58 33 55IIR (8 sections) 51 45 43 65 44 30 38 37Vector dot product 43 43 43 47 41 29 23 43Vector add 122 85 83 123 61 36 43 63Vector maximum 41 86 128 120 111 39 40Convolutionencoder

506 772 818 888 528 188 464 176

FSM 284 375 198 415 455 147 301 167256 pnt FFT 16514 12148 10633 21035 13234 4225 9016 5797

Instruction cycle counts for BDTi benchmarks

Page 66: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

66

byte3

op

byte3

byte3

byte2

op

byte2

byte2

byte1

op

byte1

byte1

byte0

op

byte0

byte0

Ex. +, - , min, max … => quadumin => quadumax ...

Subword parallelism(custom operators in TM)

1st input operand 2nd input operand

output operand

32 bits = 4 bytes are processedindependently

Page 67: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

67

int size = 1000byte out[size], in1[size], in2[size]for i = 0; i < size; i+

out[ i ] = in1[ i ] + in2[ i ];

int size = 1000byte out[size], in1[size], in2[size]for i = 0; i < size; i+

packet4 t1 = packet4_load ( in1 );packet4 t2 = packet4_load ( in2 );packet4 t3 = packet4_add ( t1, t2 );packet4_store ( out, t3 );

Subword parallelism

+ faster execution- rewrite effort (e.g. different

types for in- and outputs)

Typical example : graphics ( 4 * 32 bit floating point)

(custom operators in TM)

Page 68: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

68

for (i=0; i<64; I++){temp = ((back(i) + forward(i) +1) >> 1) +idct(i);if (temp > 255)

temp = 255;else if (temp < 0)

temp = 0;destination[i] = temp;}

Subword parallelism

MPEG example

Remark: simple example without interloop dependencies

Page 69: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

69

for (i=0; i<64; i+=4){temp = ((back(i+0) + forward(i+0) +1) >> 1) +idct(i+0);if (temp > 255) temp = 255;else if (temp < 0) temp = 0;destination[i+0] = temp;

temp = ((back(i+1) + forward(i+1) +1) >> 1) +idct(i+1);if (temp > 255) temp = 255;else if (temp < 0) temp = 0;destination[i+1] = temp;

temp = ((back(i+2) + forward(i+2) +1) >> 1) +idct(i+2);if (temp > 255) temp = 255;else if (temp < 0) temp = 0;destination[i+2] = temp;

temp = ((back(i+3) + forward(i+3) +1) >> 1) +idct(i+3);if (temp > 255) temp = 255;else if (temp < 0) temp = 0;destination[i+3] = temp;}

Page 70: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

70

temp0 = ((back(i+0) + forward(i+0) +1) >> 1) ;temp1 = ((back(i+1) + forward(i+1) +1) >> 1) ;temp2 = ((back(i+2) + forward(i+2) +1) >> 1) ;temp3 = ((back(i+3) + forward(i+3) +1) >> 1) ;

temp0 = idct(i+0);if (temp0 > 255) temp = 255;else if (temp0 < 0) temp0 = 0;temp1 = idct(i+1);if (temp1 > 255) temp1 = 255;else if (temp1 < 0) temp1 = 0;temp2 = idct(i+2);if (temp2 > 255) temp2 = 255;else if (temp2 < 0) temp2 = 0;temp3 = idct(i+3);if (temp3 > 255) temp3 = 255;else if (temp3 < 0) temp3 = 0;

destination[i+0] = temp0;destination[i+1] = temp1;destination[i+2] = temp2;destination[i+3] = temp3;

quadavg

dspuquadaddui

=

Page 71: Processor Architectures and Program Mapping Programmable Digital Signal Processors 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

71

Will embedded CPUs and DSPs converge ?• Converging forces

• both include a hardware multiplier• trend in DSPs towards caches and RTK• trend in DSPs towards C/C++• common trend towards VLIW

• Diverging forces• deeply embedded code (DSP) vs. end-user SW (CPU)• different RTKs

SPOX, Virtuoso (DSP) vs. pSOS, WinCE (top down)

Conclusions VLIW• good balance between hw and sw• between efficiency (ILP) and cost• fundamental problems: code size, interruptability