Download - Transmedia Reconf IRISA Cairn - Inriapeople.rennes.inria.fr/.../presentations/CGReconf_IRISA_Cairn.pdf · Collaboration between IRISA/Cairn on DART ... WCDMA/UMTS Receiver 3900 MOPS

1

CAIRN Project-Team

Energy-Efficient Reconfigurable System-on-Chip DART Coarse-Grain Reconfigurable Architecture

Olivier Sentieys [email protected]

with contribution from Raphaël David (CEA List), Sébastien Pillement (IRISA)

2

Agenda Motivations and Challenges Dynamically Reconfigurable Architectures

Anatomy of an RSoC From Applications to Architecture

Coarse-grain Reconfigurable Architecture DART architecture (Mozaic platform) Morea architecture

Conclusion

A cairn in Bréhat

2

3

[L. Ducousso, STMicroelectronics]

State-of-the-art SoC at HDTV Set Top Box

Domain-specific SoC Functionality is inside

Various applications and standards inside

MPEG2, H264 Satellite, Wifi/LAN Hard disk, …

65nm, 150 MTr 1 B$ mask set 60 weeks design 6-18 months lifetime

Heterogeneity 16 processors, 38 IPs 5-6000 MIPS 140 memory blocks 5 Gbytes/s on-chip

interconnection network

HW: 5M RTL code lines SW: 60M code lines

• OS, Middleware, HAL, Firmware

®

4

Challenges and limitations High-performance applications

e.g. H264 codec, 802.11n MIMO, … Energy and Power constraints

Battery life, manufacturing cost Rapidly changing application standards

SW updates vs. HW redesign Compilation and synthesis tools targeting

heterogeneous SoC Technological impacts

Manufacturing problems, transient errors, silicon bugs

3

5

A road for reconfigurable chips

Dynamically adapt the hardware to the application energy-performance-cost trade-

off Self-adapting devices

continuously adapt to changing environments

Other advantages regularity of the layout high-performance, parallel error and fault tolerance

Fresh SoC from CEA with DART IP from IRISA

“Flexible Software on Flexible Hardware”

6




Conclusion

A cairn in Bréhat

4

7

HW Processor Memory Hierarchy

Fine grain Reconfigurable

Coarse grain Reconfigurable HW

Reconfigurable system-on-chip Programmable processors, specialized HW blocks Reconfigurable hardware

fine-grain, coarse-grain "on-the-fly ASIC"

Reconfigurable interconnect and memory structures

8

HW Processor Memory Hierarchy

Fine grain Reconfigurable

Coarse grain Reconfigurable HW

Reconfigurable system-on-chip Multithreaded applications

Thread compilation to reconfigurable hardware Fixed-point specification

Reconfiguration management Hardware abstraction layer Static or dynamic (at run-time) reconfiguration

5

9

Design Space

RECONFIGURABLE ARCHITECTURES (R-SoC)

FINE GRAIN (FPGA)

MULTI GRANULARITY (Heterogeneous)

COARSE GRAIN

Processor + Coprocessor

Tile-Based Architecture

Coarse Grain Coprocessor

Fine Grain Coprocessor

Island Topology

Hierarchical Topology

Linear Topology

Hierarchical Topology

Mesh Topology

• Chameleon • REMARC • Morphosys • PACT XPP

• Pleiades • Garp • FIPSOC • Triscend E5 • Triscend A7 • Xilinx Virtex-II Pro • Altera Excalibur • Atmel FPSIC

• Xilinx Virtex • Xilinx Spartran • Atmel AT40K • Lattice ispXPGA

• Altera Stratix • Altera Apex • Altera Cyclone

• Systolic Ring • RaPiD • PipeRench

• DART • FPFA

• RAW • CHESS • MATRIX • KressArray • Systolix Pulsedsp

• aSoC • E-FPFA

[Bossuet03]

10

High Performance (12 GOPS) Low Power (500 mW)

24MOPS/mW@12GOPS

Source

Data

Audio

Video

Source Coding

V34, V8, H225, H245, ...

EFR, AMR, CELP, RPE-LTP, ...

MPEGx, H26x, ... Channel Coding

Viterbi, Turbo coder, Reed Solomon, ...

Access Technique

TDMA, FDMA, W-CDMA, ...

Modulation

PSK, MSK, ASK, QAM, ...

Viterbi, turbo dec., Reed Solomon, ...

Channel Decoding Access Technique

TDMA, WCDMA, ...

Demodulation

PSK, MSK, ASK, QAM, ...

3G Wireless Terminal Flexibility

Applications Services

Multiple granularity Arithmetic Logic

6

11

Reconfigurable Architectures

Image

Music Demult. Multiple

Access Channel Decoder

Demodul. Equalizer

Source Decoder

Voice

Processor Processor

Reconfigurable Coprocessor

time

Wireless Multimedia Receiver

12

Processing Model

T3

T1

T2b T2a T2c

RA4

t

T1

T2a

T3

T2b [adapted from Leray08]

RA: Reconfigurable Area CM: Configuration Management

RA5

RA2 RA3 RA1

T2c

7

13




Conclusion

A cairn in Bréhat

14

DART Architecture

Architecture Principles of DART

Compilation Workflow

Validation and Silicon Prototype

8

15

Overall Objectives Coarse-grained reconfigurable architecture Energy-efficiency Dynamic reconfiguration (4 to 20 cycles) Compilation from a C code specification (no

place and route)

16

Energy Efficiency

Technological parameter CS

Applicative parameters Nop.Fclk , α

Potential optimisations Actrl , Amem , Aop , α , VDD

9

17

Cost of control Minimize the configuration data volume ( Actrl)

Limited number of operation types and data format Various processing patterns Reconfiguration at the data-path level (rather than at the gate

level as in the case of FPGA) Reduce the frequency of reconfigurations ( α)

Loop body • Limited number of operations • Regular patterns

Each loop can be implemented as a unique configuration which is maintained during the processinf time

18

Example: Motion Estimation (ME) Video coding MPEGx, H26x

Motion Vector (u,v)

Reference Block NxN

Matched Block NxN

N+2p

Search Window

p

sadmin = MAXINT; mvx=0; mvy=0; for (u=-p; u<=p; u++) { for (v=-p; v<=p; v++) { sad = 0; for (i=0; i<N; i++) { for (j=0; j<N; j++) { sad = sad + ABS[BR(i,j)-FR(i+u,j+v)] /* if (sad>=sadmin) break; */ } } if (sad<sadmin) { sadmin = sad; mvx = u; mvy = v; } } }

10

19

Data access cost

Minimize memory access cost ( Amem) Storage capacity High bandwith Memory hierarchy

Minimize number of memory accesses ( α) Minimize the number of temporary data storage Avoid redundant access to data Local registers

0

20

40

60

80

100

120

140

160

64 256 1024 16536

Number of words

pJ p

er

acce

ss

20

Operator

Reconfigurable Operators

Operation 2 Operation 0

( α)

Operation 1

Input 1 Input 2

Sortie

Control

( Aop)

11

21

System Architecture of DART

Data Memory

Instruction M

emory

I/O Ctrl

Cluster 3 Cluster 4

Cluster 1 Cluster 2

Task Controller C

onfiguration M

emory

22

Cluster Architecture

Config. Memory FPGA

DMA Ctrl

Configuration Controller

RDP1

RDP2

RDP3

RDP4

RDP5

RDP6

Data M

emory

Segmented N

etwork

12

23

reg1 reg2 FU1 FU2 FU3 FU4

Multi-Bus Crossbar Network

Data Mem1

Data Mem2

Data Mem3

Data Mem4

AG1 AG2 AG3 AG4

HW Loop Management

Global Bus

Reconfigurable Data Path Architecture • FU1 • FU2 • Crossbar • Bus • AG1 • Loop

24

reg1 reg2 FU1 FU2 FU3 FU4

Multi-Bus Crossbar Network

Data Mem1

Data Mem2

Data Mem3

Data Mem4

AG1 AG2 AG3 AG4

HW Loop Management

Global Bus

Reconfigurable Data Path Architecture

92 bits

34 bits

826 bits to reconfigure the arithmetic resources of a cluster

13

25

Irregular and Regular Software

for (n=0;n<1024;n++){! tmp=0;! for (i=0;i<N;i++){! tmp+=x[i]*h[N-i];! }! y[n]=tmp<<6;! X[0]=x[n]+128;!}!

Irregular Processing few parallelism few regularity less complex

Regular Processing massively parallel very regular complex

26

rec

4 cycles

Mem3

- X

Configuration 2

y(i)=(x(i)-x(i-1))²

Mem1

Configuration 1

tmp+=x(i)*h(N-i);

X +

Mem1 Mem2

HW Reconfiguration DART potential is fully exploited

Optimal flexibility of operators and network Use of registers Multiple DPR chaining via segmented network

14

27

Configuration 1

C=A+B

+

Mem1 Mem2

rec

1 cycle

Configuration 2

E=C*D

X

Mem4 Mem1

SW Reconfiguration Reduced flexibility of the DPR

Operator configuration Operator source configuration No operator or DPR chaining

28

SCMD Single Configuration Multiple Data

Irregular processing have few parallelism Implementation on one DPR

Massively parallel processings are very regular Redundancy in DPR configurations

Configuration data stream can be reduced if the regularity is exploited Simultaneous broadcast of common configuration

data toward several DPRs

15

29

SCMD at work

RDP1

RDP2

RDP6 configuration

data

configuration

data

configuration

data

Configuration bits

RDP1 Validation

RDP2 Validation

RDP6 Validation

LATCH

LATCH

LATCH

30

DART Architecture




16

31


SystemC Simulation (SCDART) • BA-CA Simulation • Performance Estimation

Synthesis (gDART) • DFG scheduling • Operator binding • HW configuration generation

Compilation (cDART, ACG) • SW configuration generation • Code compilation for address generators

Front-End (SUIF) • Code Optimisation • Code Extraction

C Code

32

Compiler front-end Currently based on SUIF High-level source optimisations Parallelism extraction

Partial loop unrolling Semi-automatic partitioning

Regular processing (loops) • HW configurations

Irregular processing and data management • SW configurations and AG instructions

17

Compilation Front End

SUIF

C Code

void main(){...!for (n=0;n<1024;n++){! tmp=0;! for (i=0;i<64;i+=2){! Mem1=x[i];Mem2=h[N-i];}! for (j=0;j<32;j+=2){! Mem3=x[j];Mem4=h[j];! z[j]=Mem5;}! y[n]=Mem6<<6;! X[0]=x[n]+128;!}!…}!

void main(int X0, int H0, !…, int *Y){! int tmp;! tmp=tmp+X0*H0;! tmp=tmp+X1*H1;! *Y=tmp;!}!

void main(int X0, int H0, !…, int *Z0, int *Z1){! *Z0=X0-H0;! *Z1=X1-H1;!}!

Loop body 1 Loop body 2 Irregular processing

Regular code extraction

Partial loop unrolling

void main(){!...!for (n=0;n<1024;n++){! tmp=0;! for (i=0;i<64;i++)//unroll 2! tmp+=x[i]*h[N-i];! for (j=0;j<32;j++)//unroll 2! z[j]=x[j]-h[j];! y[n]=tmp<<6;! X[0]=x[n]+128;!}!...!

SUIF Front-end

void main(){!...!for (n=0;n<1024;n++){! tmp=0;! for (i=0;i<64;i+=2){! tmp+=x[i]*h[N-i];! tmp+=x[i+1]*h[N-(i+1)];! }! for (j=0;j<32;j+=2){! z[j]=x[j]-h[j];! z[j+1]=x[j+1]-h[j+1];! }! y[n]=tmp<<6;! X[0]=x[n]+128;!}!...!

34

Framework

ARMOR model of

DART Compilation Gateways

Specialized Information

Source Code

Optimized Binary Code

Compilation Library

Code selection

Register allocation

Scheduling

Retargeting compilation framework CALIFE

18

35

*

*

*

*

+ + + + *

*

*

*

+

+

+ +

*

*

*

*

+ + + +

HW configuration generation gDART transforms the nested loops (regular processing) into HW

configurations HW Based on classical techniques used in high-level synthesis

Loop reduction, merging, … Graph depth reduction Resource binding, memory allocation

36

Simulator SCDART is a bit-accurate et cycle-accurate

simulator developed in SystemC at the Register Transfer Level

Verification Power and performance estimation

19

37

DART Architecture




38

DART Architecture

5-10 GOPS/cluster @ 130nm 300 mW @ 200MHz 16 MOPS/mW @ 5 GOPS Simulator, Compiler Tools Delivered as an RTL model Circuit (ST 130nm) in june 2005

Config Mem. FPGA

Ctrl DMA

Ctrl

RDP1

RDP2

RDP3

RDP4

RDP5

RDP6

Data. Mem.

Segmented N

etwork

reg reg FU1 FU2 FU3 FU4

Fully Connected Network

Data mem1

Data mem2

Data mem3

Data mem4

AG1 AG2 AG3 AG4

Loop Management

3G/UMTS Mobile Terminal 802.11a (Channel Est.)

STMicroelectronics CEA LIST/LETI

20

39

Fresh Circuit (CEA) 4G mobile terminals Technology: ST 0.13µ CPU core: ARM946 4.8 Mgates Chip area = 80 mm2 Package: TBGA 420 Core power supply: 1.2 V

Silicon prototype of DART Complex SoC including DART accelerator Collaboration between IRISA/Cairn on DART

(architectural design, synthesis, validation), CEA List (validation, integration), CEA Leti (integration, backend)

40

Récepteur WCDMA

D.C. h(n)

Nyquist Filter

A D

AGC

s(n)

RRC

Rake Receiver - Synchronisation - Channel estimation - Decoding

WCDMA/UMTS Receiver

3900 MOPS 500 MOPS

21

41

Complete receiver on a DART cluster

Filtrage (54613 cy.)

9 cy.

Synchronisation Fchip

(4608cy.)

9 cy.

Synchro. Fsymb

(36cy.)

9 cy.

Estim. Canal (8cy.)

3 cy.

Décodage (2560cy.)

3 cy.

114.8mW ⇔ 38.8 MOPS/mW @ 6.2 GOPS 1% 9%

6%

5%

79%

Instruction reading and decoding

Data access in the DPRs

Data access in the cluster

Address generator

Operators

42

10

15

20

25

30

35

40

1 10 100 1000 10000 100000 1000000

Number of symbols

Lo

g2(T

exec)

C64

DART

Xc200E

Real-Time Limit

Positioning DART

DSP is not real time Reconfiguration (2.7ms) overhead for the FPGA

Processing of several symbols (> 150 symbols)

Temporary results (> 1.2Mbits) in memory

C64: 1.5 MOPS/mW, Xc200E: 3 MOPS/mW DART: 39MOPS/mW

22

43

0

50

100

150

200

250

300

350

400

CPU64 TM-1000 320C62 Ring-8 Ring-64 PentiumI PentiumII NEC V830

# cy

cles

VLIW Superscalar

DART DART SWP

Reconfigurable

DCT implementation

44

Motion estimation Video coding MPEGx, H26x

Motion Vector (u,v)

Reference Block NxN

Matched Block NxN

N+2p

Search Window

p

sadmin = MAXINT; mvx=0; mvy=0; for (u=-p; u<=p; u++) { for (v=-p; v<=p; v++) { sad = 0; for (i=0; i<N; i++) { for (j=0; j<N; j++) { sad = sad + ABS[BR(i,j)-FR(i+u,j+v)] /* if (sad>=sadmin) break; */ } } if (sad<sadmin) { sadmin = sad; mvx = u; mvy = v; } } }

23

45

Motion estimation SAD calculation

HW Configuration

- ABS

BR FR

sadmin = MAXINT; mvx=0; mvy=0; for (u=-p; u<=p; u++) { for (v=-p; v<=p; v++) { sad = 0; for (i=0; i<N; i++) { for (j=0; j<N; j++) { sad += ABS[BR(i,j)-FR(i+u,j+v)] } } if (sad<sadmin) { sadmin = sad; mvx = u; mvy = v; } } }

+

46

Gantt diagram for ME (tentative)

SAD on search window

SAD

Update Reference MB Current MB

HW SW

min

HW

256 cy 256 cy

12 cy

47 cy 1 cy

3 cy

16 cy

N=8

24

47

Power distribution inside a DART cluster during ME

1%25%

9%

16%

49%

Instruction reading and decoding

Data access inside the DPRs

Data access inside the cluster

Address generation

Operators

48

Conclusions 1/2 Définition d'une architecture reconfigurable

dynamiquement au niveau fonctionnel Hautes performances Maîtrise de la consommation Flexibilité Minimisation de l'impact de la reconfiguration

• Performances • Consommation

Organisation hiérarchique • Exploitation du parallélisme

25

49

Conclusions 2/2 Définition d'une chaîne de développement

Front-end C Partitionnement semi-automatique Méthode mixte compilation/synthèse architecturale Simulation RTL et estimation de consommation

Validation Comparaison des différents paradigmes de

reconfiguration • Performances • Consommation • Coût de la reconfiguration

50

Travaux en cours Plateforme Mozaic (thèse Julien Lallet)

DART reste spécifique à un domaine d’applications Rendre génériques

• la structure de l’architecture • les mécanismes de reconfiguration

Spécification par un langage ADL • Génération du code RTL

Architecture MOREA (thèse Erwan Grâce) Optimisation de la hiérarchie mémoire Mémoire et générateurs d’adresses « reconfigurables » Etude système multi-cluster Gestion de la reconfiguration Prototype FPGA (en cours)

26

51

Perspectives pour transmedi@ Spécialisation au domaine d’application

Vidéo, Audio Transcodage, multi-standard

Gestion des flux vidéos multiples Mode « co-processing » Gestion des interfaces

CAIRN Project-Team Energy-Efficient Reconfigurable System-on-Chip

DART Coarse-Grain Reconfigurable Architecture

Olivier Sentieys [email protected]

with contribution from Raphaël David (CEA List), Sébastien Pillement (IRISA)

27

53

Appendix Architecture Tools Validation

54

SIMD Multiplier Architecture

16

16-bit Booth-Wallace

mul/add 8-bit carry-save mul/

add

Input A

Input B

Output

SIMD

16

32

L L L L L : Latch

Shifter

8-bit carry-save mul/

add

Mux

Demux

OP

Shift

28

55

ALU architecture

Arithmetic Unit ADD, SUB, ABS

Input A

Input B

Output

SIMD

Shifter

Logic Unit AND, OR, …

Mux

Demux

Command

Output Shift

Shifter

L L L L

Input Shift

56

Fully connected network

Global Bus

Mem

FU1

FUs + registers

2:1

4:1 14:1

2:1

4:1 14:1

FU4

2:1

4:1 14:1

2:1

4:1 14:1

29

57

Connections to global bus

11:1 11:1

decod

11:1 11:1

Configuration Even Bus

Odd Bus Mem. UFs Reg.

2 4 4

‘ Z ’

58

Segmented network

RD

P i

Configurable Interconnection

RD

P i+

1

30

59

Mem 1 64x16

Data Mem1

decod

@

Instr

datapath @ 1

Seq1

Data Mem4

decod

@

Instr

datapath @ 4

Seq4

Zero - overhead loop support

Mem ‘ 64x16

Address generation unit Generate the address sequences for data processed inside the

DPRs Addressing modes:

Pre- ou Post-Increment, Modulo, Bit reverse, … Hardware loop management

Up to 4 nested loops Up to 8 instructions loop body

60

Address generation unit

Mem @

64x16b

Seq

RI decod

@

data

R2 R3

+/++/-/--/NoP modulo

NoP/ Bit_reverse

MUX1 $1

Push N, M

R4 R5

R0 R1

R6 R7

MUXA MUXB

latch @ data

MUX2

MUXC

31

61

Sequencer

PC

++ -M/NOP

M1+M2+M3+M4 M2+M3+M4

M3+M4

M4

push

LIFO_minus_M

load

threshold

pop

M

Cd_minus_M

clk

Pointer

reset push

62

M1 N1 M2 N2 M3 N3 M4 N4

Cpt 1 Cpt 2 Cpt 3

CPT

++

empty

reset

=N ?

pop

Data_out

Data_in

push

LIFO

load

=M ? load

pop

+ +

M N

Cd_minus_M

+

Hardware loop management

32

63

Appendix Architecture

Tools Validation

64

Les compilateurs cDART et ACG

ACG CDART

Compilation Compilation

Extraction accès aux données

Extraction code irregulier

Traitements irréguliers + manipulations de données

Instructions SW Instructions de génération d'adresses

Parser assembler -> Codes AG

Parser assembler -> Config SW

Compilation

33

65

FU2 FU3 FU4 FU1

network

Mem1 Mem2 Mem3 Mem4

Armor model of DART

AG1 AG2 AG3 AG4

Mem5 Mem6 Mem7 Mem8

AG5 AG6 AG7 AG8

Mem22 Mem22 Mem23 Mem24

AG21 AG22 AG23 AG24

Cluster Memory

Memory Controller

66

Appendix Architecture Outils Validation

34

67

Task vs. Operation parallelism

FIR, 6 DPRs

47 %

Rake Receiver 6

DPRs, 9 %

4 cy Other

threads, 6 DPRs,

44 %

11 cy

4 DPRs, 59 %

2 DPRs, 27 %

6 DPRs, 41%

11 cy

4 cy 2 DPR, 32 %

1 DPR, 53 %

1 DPR, 59 %

11 cy

4 cy

4 DPRs, 59 %

6 DPRs, 41%

68

Reconfiguration cost

Configuration data 1x1.4 Mbits for the FPGA

Control data Ncyclex256 bits for the DSP

14423016

520

13107200

53248

14423016

1716

2010624

208

1

10

100

1000

10000

100000

1000000

10000000

100000000

Data

Volume

(bits)

Configuration

(filtre)

Control (filtre) Configuration

(Rake)

Control (Rake)

C64

Xc200E

DART

35

69

Implementation of the WCDMA/FIR on a DART cluster with the SIMD mode

Nb of DPR nb cy/sampleresource usage rate

DPR usage rate

Nb configuration instructions

3 7 90 82,7 84 5 100 59,1 95 5 84 59,1 126 4 92 47,3 11

0

20

40

60

80

100

120

3 4 5 6

Number of allocated DPR

number of cycles needed to proceed a sampleresource usage rate (%)

DPR usage rate (%)

Number of configurationinstructions

70

0

20

40

60

80

100

120

1 2 3 4 5 6

Number of Allocated DPR

number of cycles (/50) needed to proceed a symbol

resource usage rate (%)

DPR usage rate (%)

Number of configurationinstructions

Implementation of the WCDMA/Rake on a DART cluster with the SIMD mode

Nb of DPRnb cy/symbol (x100)

resource usage rate

DPR usage rate

Nb configuration instructions

1 15,51 100 17,8 42 7,78 100 8,9 43 5,21 100 5,9 44 5,21 75 5,9 45 5,21 60 5,9 46 2,63 100 3 4

Download - Transmedia Reconf IRISA Cairn - Inriapeople.rennes.inria.fr/.../presentations/CGReconf_IRISA_Cairn.pdf · Collaboration between IRISA/Cairn on DART ... WCDMA/UMTS Receiver 3900 MOPS

Top Related