processor architectures and program mapping programmable digital signal processors 5kk10 tu/e henk...

Processor Architectures and Program Mapping

Programmable Digital Signal Processors

5kk10TU/e

Henk Corporaal

Jef van Meerbergen

Bart Mesman

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

2

Topic 2: Programmable Digital Signal Processors

• real-time worst-case processing = need for more compute power sec instr cycles secprog prog instr cycle

CPI = 1• instruction level parallelism (ILP)• hardware support for loop control• attention for high level data types e.g. arrays, delaylines

(vs. scalars for CPUs)• difficult to compare architectures

• e.g. DIT, DIF, radix 2/4, FFT loop unrolling, scaling, shuffling, intialisation … can be included or forgotten

• benchmarking (Berkeley Design Technology Inc (BDTi))(compare to SpecInt benchmarks for CPs)


3

• architectures for programmable DSPs• multiplier-accumulator• modified Harvard architecture• extension with an ALU (decision making)• controller architectures

• examples: TI, Motorola, Philips • code generation• recent developments: VLIW (Very Long Instruction Word)

examples: C6 and TM

Outline


4

Goal = 1 cycle per iteration

•position ACR (1 or 2)•adder/subtractor•extra pipelines•asymmetric inputs•multi-precision

PR

ADDER

ACR

MPY(Booth,

Wallace..)

c(i) x(i)

c(i) * x(i)

Sum of products = basic operation for correlation, filtering, spectral analysis ... linear

expr.

Modifications •extra inputs/outputs

clockP_reg

control


5

• not every signal requires 32 bits• 2 types of DSP: floating point and integer• advantages FP: most specs are in FP

(conversion to int is time consuming since the behaviour may change)

• disadvantage FP: cost (area, speed, power)• wanted : type of output of an operation = type of input

(because both stored in RAM) • no problem for FP but for integer • integer multiplication doubles the number of bits: n * n => 2n• What about fractional numbers ?

0.90.90.81

x

DSP data types


6

• integer and fractional numbers are a special case of fixed pointfix <p,q> (ART designer & SystemC)

1 1 0 1 1 0 1 -19/8 = -2.3751fix <8,3>

negative weight2’s complement

if q=0 then integer e.g. int <8,0>if q=p-1 then fractional e.g. int <8,7>

DSP data types

Scale factor 1/8

pq

2-2 2-32-120212223-24

quantization error

Same alu handlesfix <8,1>, fix <8,2>, fix <8,3>, ...


7

0 1 1 0 1 -19/8

0 0 0 0 1 97/16

1

0

1 0 0 1 1 0 1 -1843/12811 1 1 1 0 0 01

Int <8,3>

Int <8,4>

s x x xs y y y--------

s s z z z z z zs z z z z z z 0 => if FRCT = 1

Some processors (C54) have special instructions for fractional Numbers (and symmetric number domain –2n-1 … 2n-1)

DSP data types

1 1

11


8

• continue (after multiplication) with msb only• represents the limit of the accuracy of the result

(can not be larger than the accuracy of the inputs)• more efficient solution

• continue with msb + lsb•sum-of-product operations generate accumulative noise at 32nd vs. 16th bit

• Still overflow for addition = overflow bits• double precision accumulator

+ extra overflow bits + shift, round, truncate unit

DSP data types


9

PR

ADDER

ACR

MPY(Booth,

Wallace..)

c(i) x(i)

SHIFTROUND

TRUNCATE

clockP_reg

clockP_reg

control


10

rounding value truncation magnitude truncation

x

xQ xQ xQ

x x

1 1 1 . 1 1 -0.25+ 0 0 0 . 1= 0 0 0 0

1 1 1 . 0 1 -0.75+ 0 0 0 . 1= 1 1 1 -1

1 1 1 . 1 1 -0.25

= 1 1 1 -1

1 1 1 . 0 1 -0.75

= 1 1 1 -1

1 1 1 . 1 1 -0.25+ 0 0 1 . = 0 0 0 0

1 1 1 . 0 1 -0.75+ 0 0 1 . = 0 0 0 0


11

saturation zeroing

sawtooth


12

Prog/datamemory

EXU

Von Neumann(sequencial)

progmem.

EXU

Harvard

datamem.

progmem.

EXU

datamem. 1

datamem. 2

Modified Harvard

c(i) * x(i)

Goal = 1 cycle per iteration


13

RAM_A RAM_B

ACU_A

AR_A

ACU_B

AR_B

MAC

DR_A DR_B

+1 PC

Interrupt address

Stack

Reset

ProgramMemory

IR

Control Bus

Rfile


14

*

Z-1

*

Z-1

*

Z-1

*

+

c4c5 c3 c2

x5 x4 x3 x2

y

Z-1

c1

x1

*

ci * xi

time loop

filter loop i

How updating the delayline ?

1 cycle/tap ?


15

Memorylocation

Outputsample 1

outputsample 2

outputsample 3

1 x1 x2 x32 x2 x3 x43 x3 x4 x54 x4 x5 x65 x5 x6 x7

Solution 1: blockmove in memory

2 possibilities • complete move after every output sample is calculated

• read and write the data twice • move after read of every datum separately

• write the data twice• need for a special instruction (TMS320)


16

Memorylocation

outputsample 1

outputsample 2

outputsample 3

outputsample 4

Outputsample 5

1 x1 x92 x2 x23 x3 x3 x34 x4 x4 x4 x45 x5 x5 x5 x5 x56 x6 x6 x6 x67 x7 x7 x78 x8 x8

Solution 2: indirect adressing

• use of a pointer to mark the begin of the delay line• update the pointer instead of moving the data• problem: trashing of the whole memory• solution: modulo addressing• need for a register to store the pointer


17

*

Z-1

*

Z-1

*

Z-1

*

+

c2c1 c3 c4

x

y2 y3 y4

y

Z-1y5y1

y1

y2

y3

y4

y5

pointerIIR filter

memory map


18

for j = 1..jtaps d(j) * y(j)

for i = 1..itaps c(i) * x(i)

time loop

2 filters

y1

y2

y3

y4

y5

pntr 2 modulo range 2

x1

x2

x3

x4

x5

pntr 1 modulo range 1

y1

y2

y3

y4

y5

x1

x2

x3

x4

x5

pntr 1m

odulo range

2 memory segments => 1 segment


19

x3

z-1

z-1z-1

z-1

x2

x1

y2

y3

y1

c2

c1

c4

c3

c5y1

y2

x1/y3

x2

x3

pntr 1m

odulo range

Mapping strategy• define positions in Ram

constraint: vars that form a delay line in consecutive places• find a schedule

example : c1 => c2 => c3 => c4 => c5• define ACU instructions

Mapping strategy


20

*

Z-1 Z-1

*

Z-1

*

+

c6

c7

c4

x7x6

x5x4

ye

Z-1x1x3

Z-1

*

x2Z-1Z-1

*

x8

c8

+ yo

*c5

*c3

*c1

c2


21

A S

Modulo

outputto RAM

Output reg A reg SRead_A A A SRead_S S A SincA A+1 A+1 SdecA A-1 A-1 SStep A+S A+S SInc_step S+1 A S+1

Modulo can beimplemented as a mask operation if the size is 2k

16 10 00023 10 111mask=hold

ACU architecture andInstruction set


22

x3

z-1

z-1z-1

z-1

x2

x1

y2

y3

y1

c2

c1

c4

c3

c5y1

y2

x1/y3

x2

x3

pntrm

odulo range

read_A 17incA 18incA 19incA 20incA 21step 19dec 18 prepare new pointer for next iteration

AssumeinitialisationA = pointer=17S = -2

1617181920212223

Mapping example


23

Addressing modes

• register ADD R4, R3 R[R4] = R[R4] + R[R3]• immediate ADD R4, #3 R[R4] = R[R4] + #3• direct ADD R4, (100) R[R4] = R[R4] + Mem[100]• indirect ADD R4, (R3) R[R4] = R[R4] + Mem[R[R3]]

• w. inc/dec ADD R4, (R3)± R[R4] = R[R4] + Mem[R[R3]] R[R3] = R[R3] ± 1

• indexed ADD R4, (R3±R2) R[R4] = R[R4] + Mem[R[R3]] R[R3] = R[R3] ± R[R2]

Remarks• direct = for static data• indirect = for arrays

• inc/dec = for stepping through arrays e.g. xn

• index = for stepping through arrays e.g. x2n


24

• 8 ARs (address or auxiliary register) available• extra indirect modes

•circular *ARn ± % post inc/dec by 1 - circular *ARn ± AR0 % post inc/dec by AR0 - circular

• bit reverse *ARn ± AR0 B post inc/dec by AR0 - bit rev.

Addressing modes: extra for DSP


25

• regular data-flow algorithms ==> MACfiltering, correlation, windowing etc …

• decision making ==> ALUsorting filters (e.g. median filters)interpolation (e.g. sqrt)absolute value calculationlogarithmic conversionfinite field aritmetic (e.g. Galois field)ViterbiVLC, VLDdivision

Incorporation of an ALU


26

+1 PC

Interrupt address

Stack

Reset

ProgramMemory

IR

ACU_A

AR_A

RAM_A

DR_A

ACU_B

AR_B

RAM_B

DR_B

MAC ALUControl Bus

Rfile


27

ALU SX SY DX DY RFACUA B

MULT SX SY DX DY RF ACUA B

Imm. data DX DY RFACUA B

Next address BR CondACUA B

00

01

10

11

Bus-oriented instruction encoding


28

LABEL ALU MPY-ACC RAM ACUAcc = 0 init (i=0)

init counterloop incr (=i+1)

read x(i)acc(i)=acc(i-1)+x(i)*c(i)

dec counter branch to loop if counter > 0

nop

c(i) * x(i)

6 clockcycles/samplelimit pipelines in the controller

first solution

resources

time (cc)

Not showncoefficient RAM+ACU


29

f

g

h

ai

bi

ci

di

f

g

h

a0

b0

c0

d0

f

g

h

a1

b1

c1

d1

f

g

h

a2

b2

c2

d2

h g f

ai

bi

bi-1ci-2

ci-1di-2

for i = 0 to n bi = f(ai) ci = g(bi) di = h(ci)

for i = 2 to n bi = f(ai) ci-1 = g(bi-1) di-2 = h(ci-2)

Loopfolding (software pipelining)


30

c(i) * x(i)

Pre- and postamble4 clockcycles /sample

LABEL ALU MPY-ACC RAM ACUacc(i-1)=0 init (i=1)

init counter read x(i) inc(=i+1)loop acc(i) = acc(i-1)+x(i)*c(i) read x(i+1) incr (=i+2)

dec counterbranch to loop if counter > 0nop

acc(n-1) = acc(n-2)+x(n-1)*c(n-1) read x(n)acc(n) = acc(n-1)+x(n)*c(n)

Loopfolding (software pipelining)


31

Label ALU MPY-ACC RAM ACUacc(i-1=0 init (i=1)

init counter read x(i) inc(=i+1)repeat n-2 acc(i)=acc(i-1)+x(i)*c(i) read x(i+1) incr(=i+2)

acc(n-1) = acc(n-2) + x(n-1)*c(n-1) read x(n)acc(n) = acc(n-1) + x(n)*c(n)

c(i) * x(i)

hardware support for loop control

1 clockcycles/samplerepeat instruction and repeat block


32



examples: C6 and TM

Outline


33

T register

Sign ctr Sign ctr Sign ctr Sign ctr Sign ctr

T

Multiplier (17*17)

A(40) B(40)

MUX

A

0

A

A B

B A

fractional MUX

Adder (40)

ZERO SAT ROUND

MALU (40)

UB

MUX

TAB CD

C D

Barrer shifter

MSW/LSWselect

E

COMP

TRN

TC

B

A

P C DD

TMS320C5000


34

Address bus

16 bits

EXTERNALADRESS SWITCH

Y Address

Y memory256-by-24-bit

RAM256-by-24-bit

ROM

AddressALU

X memory256-by-24-bit

RAM256-by-24-bit

ROM

2,048-by-24-bitPROGRAMMEMORY

ROM

X Address

P Address

EXTERNALDATA-BUS

SWITCH

INTERNAL DATA-BUS

SWITCH

24 BITS DATA

BUS

X-DATA

Y DATA

P DATA

GLOBAL DATA

DATA ALU

24-by-24 bitMULTIPLIER-

ACCUMULATORPRODUCING

56 BIT RESULT

PROGRAM CONTROLLER

ON CHIPPERIPHERALS,

HOST,SYNCHRONOUS

SERIAL INTERFACESERIAL COMMU-

NICATIONSINTERFACE,

PROGRAMMED I/O,BUS CONTROL

2 BITS

CLOCK

3 BITS

INTERRUPT

24 BITS

I/OPORTS

7 BITS

Motorola 56K family


35

X data

Y data

Z data

Buses for

X

X datamemory

16 bitbus

Y datamemory

16 bit bus

Two address Compution

units

Y

Inst

ruct

ion

d ec o

der

96-b

it in

stru

ctio

ns

Program control

unit

Programmemory (Z data)

16-bit bus

Two 16-by-16 bitmultipliers

Y0

Y1

X

Y0

Y1

X

PO P1

scale scale

Two 40 bit arithmic-logic units

SaturationSaturation

Four 40 bitaccumulators

Saturation/scale

shif

t

R.E.A.L.


36

memories Not included

Process 0.35, 5M

voltage 2.7-3.6 V

frequency 39 MHzTj = 85 °C, 2.7V, wcp

area 3.9 mm2

Power dissipation 2.1 mW/MHz

RD16021 DSP


37

Function DSPgroupOAK

MotorolaDSP561xx

ADIADSP-218x

LucentDSP16xx

TI TMS320C54x

TI320C62xx

LucentDSP16210

PhilipsRD16020

Real block FIR 835 925 841 1240 684 334 780 448Single sample FIR 21 23 22 26 18 17 16 20Complex block FIR 3018 3043 3122 3123 2922 1294 1681 1470LMS adaptive 90 64 59 101 58 33 55IIR (8 sections) 51 45 43 65 44 30 38 37Vector dot product 43 43 43 47 41 29 23 43Vector add 122 85 83 123 61 36 43 63Vector maximum 41 86 128 120 111 39 40Convolutionencoder

506 772 818 888 528 188 464 176

FSM 284 375 198 415 455 147 301 167256 pnt FFT 16514 12148 10633 21035 13234 4225 9016 5797

16 taps 40 samples 8 biquads

Instruction cycle counts for BDTi benchmarks


38


39



Outline


40

lexical analysis

syntax analysis

semantic analysis

Code selection

Register allocation

scheduling

Front end

Code generation

code

source

Intermediate machine independent

representation

1 instr = // opsorder of instr


41

a b

*

c d

+

+

*

c t1 := a * b t2 := c + d t3 := t1 + cout := t2 * t3

t1 t2

t3

BBi

BBj BBk

Intermediate machine independent

representation


42

Register transfer pattern (RTP) for a given datapathis any RT operation ( read - combinatorial logic - write) which can be executed on the datapath. [Leupers]

Notation ar := ar | ax + ay | af means ar := ar + ay or ar := ar + af or ar := ax + ay or ar := ax + af

Code selectionIntermediate representation RTP

match &cover


43

ax ay

ar

af mx my

mr

mf

+ -

x y x y

+ - *ALU MAC

d memory p memory ADSP[Analog Devices]

Code selection example


45

a b

*

c d

+

+

*

c

t1 t2

t3

mx := dmem my := pmem ax := dmem ay := pmem

mr := dmem

2:

1:

3: ar := ax + ay

my := ar

mr = mr * my

Mr := mr + (mx * my)

Example of code selection = covering of intermediate representation with RTPs


46

Problems• local decisions which have a global impact• phase coupling: example

• asap schedule• maximal freedom for scheduling• code selection during scheduling• register allocation comes afterwards• can lead to infeasible solutions


47

R1

R2 R3

alu2

alu1

(a) (b)

1

23

4

Move

(c)

1

23

4

phase coupling: example 1


48

Pu

Cu

Pv

Cv

Pu

Cu

Pv

Cv

u

v

u

v

if u and vshare the

same register

phase coupling: example 2

Example of coupling between scheduling and register allocation

[Mesman]


49

Traditional code generation

(heuristic)

OK ?constraints

no

yes

feasiblespace

design space seen by code generator

application

[Mesman]phase coupling: discussion

Phase coupling is difficult because of many constraints originatingfrom irregular interconnect, special purpose registers and non-orthogonal microcode.


50

Solution: 1. Solve code generation for DSPs2. Step back and rethink the architecture

develop an architecture which is still efficient but alsoa good model for building a compiler

Efficiency = exploit instruction level parallelism (ILP)compilation = systematic positioning of registers and regular interconnect= VLIW = Very Long Instruction Word

It is very difficult and almost impossible to develop robust and efficient DSP compilers. Current DSP practice = programming in assembler

phase coupling: discussion


51



• principles• central register file + example TM• clustered VLIW + example C6 • subword parallelism or SIMD

Outline


52

• multiple parallel FUs, possibly different and pipelined• pipelining is exposed to the compiler = no interlock mechanism

• load-store architectureall operands fetched from/stored in register files, possibly multi-ported

• each FU can receive an instruction every clock cycle• one instruction = many RISC instructions• each RISC instruction = one issue slot• no dependencies between different RISC instructions = orthogonal microcode = compiler friendly

VLIW principles


53

Execunit 1

Register file

Issue slot 1

Execunit 2

Issue slot 2

Execunit 3

Issue slot 3

Execunit 4

Issue slot 4

Execunit 5

Issue slot 5

Execunit 24

Issue slot 24

Execunit 25

Issue slot 25

R&W addr.instruction

...

...

• long instruction words e.g. (3*7+4)*25=625• many ports on the registerfile e.g. 75

VLIW architecture


54

Execunit 1

Execunit 2

Execunit 3

Register file

Issue slot 1

Execunit 4

Execunit 5

Execunit 6

Execunit 7

Execunit 8

Execunit 9

Issue slot 2 Issue slot 3

VLIW architecture: central Register File


55

Execunit

Execunit

Execunit

Execunit

Execunit

Register file (128 regs, 32 bit, 15 ports)

Instruction register (5 issue slots)

Data cache

(16 kB)

PCInstruction

cache (32kB)

5 constant5 ALU2 memory2 shift2 DSP-ALU2 DSP-mul3 branch2 FP ALU2 Int/FP ALU1 FP compare1 FP div/sqrt

TM1000 DSPCPU


56

TriMedia TM32A processor

D-cache I-Cache

IFM

UL

1IF

MU

L1

IFM

UL

2IF

MU

L2

(FL

OA

T)

(FL

OA

T)

(FL

OA

T)

(FL

OA

T)

DS

PM

UL

1D

SP

MU

L1 D

SP

MU

L2

DS

PM

UL

2

FT

OU

GH

1F

TO

UG

H1

SH

IFT

ER

1S

HIF

TE

R1

AL

U1

AL

U1

FC

OM

P2

FC

OM

P2

DS

PA

LU

2D

SP

AL

U2

AL

U2

AL

U2

AL

U4

AL

U4

AL

U0

AL

U0

AL

U3

AL

U3

FA

LU

0F

AL

U0

FA

LU

3F

AL

U3

DS

PA

LU

0D

SP

AL

U0

SH

IFT

ER

0S

HIF

TE

R0

TA

G

TA

G

TAG

TAG

SEQUENCER / DECODE

I/OINTERFACE

0.18 micronarea : 16.9mm2

200 MHz (typ)1.4 W

7 mW/MHz

(MIPS=0.9 mW/MHz)


57

Synthesised RF area (CMOS18, 64 bit)

0

1

2

3

4

5

6

7

8

9

0 5 10 15 20

Nr of ports

Are

a i

n m

m-s

q

32regs, after P&R

64regs, after P&R

128regs, after P&R

Poly. (128regs, after P&R)



Area, speed and power dissipation goes more than linear with thenumber of ports


58

Execunit 1

Execunit 2

copyunit

Register file 1

Execunit 3

Execunit 4

copyunit

Register file 2

Execunit 5

Execunit 6

copyunit

Register file 3

VLIW architecture: clustered Register Files


59

REGISTERFILE 1

FMULFADD

REGISTERFILE 2

IMULIADD

REGISTERFILE 3

IMULIADD

FMUL r1,r2,r3 IADD r1,r2,r3 IMUL r1,r2,r3



60

REGISTERFILE I0

IADD_01IMOV_01

:

FU00

IADD_00LAND_00

:

FU01

IMUL_00SHFT_00

:

FU02

REGISTERFILE I1

IADD_10IMOV_10

:

FU10

IADD_11LAND_10

:

FU01

IMUL_10SHFT_10

:

FU02



61

• performance loss (more instructions) compared to a central Register File (due to extra cycle for copy)•15-20 % for 2 clusters•20-30 % for 4 clusters

• limited scalability• not too many clusters• not too many registers within each cluster (too many RF ports)

• add of copy ops in the compiler = graph changes during scheduling


Discussion


62

Dst

src1

src2

Src_

upD

st_u

pD

stsr

c1sr

c2

Src_upD

st_upD

stsrc1src2

L1 S1M1

Store/loaddata Store/load

address

Dst

src1

src2

D1

Registerfile 0-15 (32 bits)

Store/loadaddress

Dst

src1

src2

D2

Dst

src1

src2

M2 S2 L2

loaddata

Registerfile 0-15

TMS320C62x VelociTI (fixed point)

Int addlogical

bit count

Int addlogical

bit manipshift

constantbranch

Int mult(16=>32)

Int addload/store


63

• parallelism (fetch-decode-execute) (max 8 issue slots)• pipeline critical sections (alu 1cc, mult 2 cc, 200 MHz)• Risc (simple, atomic, independent instructions)

performance comes from compiler (pipelining, unroll)• load-store• orthogonal (2 identical DP, add on 6 units)• deterministic (no interlock)• conditional instructions (=guarding)• instruction packing

VelociTI principles


64

n n A n n n n nn B n n n n n nn n n n n C n nn n n n n D n nn n n E n n n nF n n n n n n nn n n n n n G nn n n n n n n H

A B C D E F G H0 0 0 0 0 0 0 0

n B A n n C n nn n n E n D n nF n n n n n n nn n n n n n G H

A B C D E F G H1 1 0 1 0 0 1 0

A B C D E F G H1 1 1 1 1 1 1 0

A B C D E F G H

Fully serial

Mixed serial/parallel

Fully parallel

Velocity encoding

Classical encoding: fetching many nops


65

Function DSPgroupOAK

MotorolaDSP561xx

ADIADSP-218x

LucentDSP16xx

TI TMS320C54x

TI320C62xx

LucentDSP16210

PhilipsRD16020

Real block FIR 835 925 841 1240 684 334 780 448Single sample FIR 21 23 22 26 18 17 16 20Complex block FIR 3018 3043 3122 3123 2922 1294 1681 1470LMS adaptive 90 64 59 101 58 33 55IIR (8 sections) 51 45 43 65 44 30 38 37Vector dot product 43 43 43 47 41 29 23 43Vector add 122 85 83 123 61 36 43 63Vector maximum 41 86 128 120 111 39 40Convolutionencoder

506 772 818 888 528 188 464 176

FSM 284 375 198 415 455 147 301 167256 pnt FFT 16514 12148 10633 21035 13234 4225 9016 5797

Instruction cycle counts for BDTi benchmarks


66

byte3

op

byte3

byte3

byte2

op

byte2

byte2

byte1

op

byte1

byte1

byte0

op

byte0

byte0

Ex. +, - , min, max … => quadumin => quadumax ...

Subword parallelism(custom operators in TM)

1st input operand 2nd input operand

output operand

32 bits = 4 bytes are processedindependently


67

int size = 1000byte out[size], in1[size], in2[size]for i = 0; i < size; i+

out[ i ] = in1[ i ] + in2[ i ];

int size = 1000byte out[size], in1[size], in2[size]for i = 0; i < size; i+

packet4 t1 = packet4_load ( in1 );packet4 t2 = packet4_load ( in2 );packet4 t3 = packet4_add ( t1, t2 );packet4_store ( out, t3 );

Subword parallelism

+ faster execution- rewrite effort (e.g. different

types for in- and outputs)

Typical example : graphics ( 4 * 32 bit floating point)

(custom operators in TM)


68

for (i=0; i<64; I++){temp = ((back(i) + forward(i) +1) >> 1) +idct(i);if (temp > 255)

temp = 255;else if (temp < 0)

temp = 0;destination[i] = temp;}

Subword parallelism

MPEG example

Remark: simple example without interloop dependencies


69

for (i=0; i<64; i+=4){temp = ((back(i+0) + forward(i+0) +1) >> 1) +idct(i+0);if (temp > 255) temp = 255;else if (temp < 0) temp = 0;destination[i+0] = temp;

temp = ((back(i+1) + forward(i+1) +1) >> 1) +idct(i+1);if (temp > 255) temp = 255;else if (temp < 0) temp = 0;destination[i+1] = temp;

temp = ((back(i+2) + forward(i+2) +1) >> 1) +idct(i+2);if (temp > 255) temp = 255;else if (temp < 0) temp = 0;destination[i+2] = temp;

temp = ((back(i+3) + forward(i+3) +1) >> 1) +idct(i+3);if (temp > 255) temp = 255;else if (temp < 0) temp = 0;destination[i+3] = temp;}


70

temp0 = ((back(i+0) + forward(i+0) +1) >> 1) ;temp1 = ((back(i+1) + forward(i+1) +1) >> 1) ;temp2 = ((back(i+2) + forward(i+2) +1) >> 1) ;temp3 = ((back(i+3) + forward(i+3) +1) >> 1) ;

temp0 = idct(i+0);if (temp0 > 255) temp = 255;else if (temp0 < 0) temp0 = 0;temp1 = idct(i+1);if (temp1 > 255) temp1 = 255;else if (temp1 < 0) temp1 = 0;temp2 = idct(i+2);if (temp2 > 255) temp2 = 255;else if (temp2 < 0) temp2 = 0;temp3 = idct(i+3);if (temp3 > 255) temp3 = 255;else if (temp3 < 0) temp3 = 0;

destination[i+0] = temp0;destination[i+1] = temp1;destination[i+2] = temp2;destination[i+3] = temp3;

quadavg

dspuquadaddui

=


71

Will embedded CPUs and DSPs converge ?• Converging forces

• both include a hardware multiplier• trend in DSPs towards caches and RTK• trend in DSPs towards C/C++• common trend towards VLIW

• Diverging forces• deeply embedded code (DSP) vs. end-user SW (CPU)• different RTKs

SPOX, Virtuoso (DSP) vs. pSOS, WinCE (top down)

Conclusions VLIW• good balance between hw and sw• between efficiency (ILP) and cost• fundamental problems: code size, interruptability

processor architectures and program mapping programmable digital signal processors 5kk10 tu/e henk...

Documents