1
CAIRN Project-Team
Energy-Efficient Reconfigurable System-on-Chip DART Coarse-Grain Reconfigurable Architecture
Olivier Sentieys [email protected]
with contribution from Raphaël David (CEA List), Sébastien Pillement (IRISA)
2
Agenda Motivations and Challenges Dynamically Reconfigurable Architectures
Anatomy of an RSoC From Applications to Architecture
Coarse-grain Reconfigurable Architecture DART architecture (Mozaic platform) Morea architecture
Conclusion
A cairn in Bréhat
2
3
[L. Ducousso, STMicroelectronics]
State-of-the-art SoC at HDTV Set Top Box
Domain-specific SoC Functionality is inside
Various applications and standards inside
MPEG2, H264 Satellite, Wifi/LAN Hard disk, …
65nm, 150 MTr 1 B$ mask set 60 weeks design 6-18 months lifetime
Heterogeneity 16 processors, 38 IPs 5-6000 MIPS 140 memory blocks 5 Gbytes/s on-chip
interconnection network
HW: 5M RTL code lines SW: 60M code lines
• OS, Middleware, HAL, Firmware
®
4
Challenges and limitations High-performance applications
e.g. H264 codec, 802.11n MIMO, … Energy and Power constraints
Battery life, manufacturing cost Rapidly changing application standards
SW updates vs. HW redesign Compilation and synthesis tools targeting
heterogeneous SoC Technological impacts
Manufacturing problems, transient errors, silicon bugs
3
5
A road for reconfigurable chips
Dynamically adapt the hardware to the application energy-performance-cost trade-
off Self-adapting devices
continuously adapt to changing environments
Other advantages regularity of the layout high-performance, parallel error and fault tolerance
Fresh SoC from CEA with DART IP from IRISA
“Flexible Software on Flexible Hardware”
6
Agenda Motivations and Challenges Dynamically Reconfigurable Architectures
Anatomy of an RSoC From Applications to Architecture
Coarse-grain Reconfigurable Architecture DART architecture (Mozaic platform) Morea architecture
Conclusion
A cairn in Bréhat
4
7
HW Processor Memory Hierarchy
Fine grain Reconfigurable
Coarse grain Reconfigurable HW
Reconfigurable system-on-chip Programmable processors, specialized HW blocks Reconfigurable hardware
fine-grain, coarse-grain "on-the-fly ASIC"
Reconfigurable interconnect and memory structures
8
HW Processor Memory Hierarchy
Fine grain Reconfigurable
Coarse grain Reconfigurable HW
Reconfigurable system-on-chip Multithreaded applications
Thread compilation to reconfigurable hardware Fixed-point specification
Reconfiguration management Hardware abstraction layer Static or dynamic (at run-time) reconfiguration
5
9
Design Space
RECONFIGURABLE ARCHITECTURES (R-SoC)
FINE GRAIN (FPGA)
MULTI GRANULARITY (Heterogeneous)
COARSE GRAIN
Processor + Coprocessor
Tile-Based Architecture
Coarse Grain Coprocessor
Fine Grain Coprocessor
Island Topology
Hierarchical Topology
Linear Topology
Hierarchical Topology
Mesh Topology
• Chameleon • REMARC • Morphosys • PACT XPP
• Pleiades • Garp • FIPSOC • Triscend E5 • Triscend A7 • Xilinx Virtex-II Pro • Altera Excalibur • Atmel FPSIC
• Xilinx Virtex • Xilinx Spartran • Atmel AT40K • Lattice ispXPGA
• Altera Stratix • Altera Apex • Altera Cyclone
• Systolic Ring • RaPiD • PipeRench
• DART • FPFA
• RAW • CHESS • MATRIX • KressArray • Systolix Pulsedsp
• aSoC • E-FPFA
[Bossuet03]
10
High Performance (12 GOPS) Low Power (500 mW)
24MOPS/mW@12GOPS
Source
Data
Audio
Video
Source Coding
V34, V8, H225, H245, ...
EFR, AMR, CELP, RPE-LTP, ...
MPEGx, H26x, ... Channel Coding
Viterbi, Turbo coder, Reed Solomon, ...
Access Technique
TDMA, FDMA, W-CDMA, ...
Modulation
PSK, MSK, ASK, QAM, ...
Viterbi, turbo dec., Reed Solomon, ...
Channel Decoding Access Technique
TDMA, WCDMA, ...
Demodulation
PSK, MSK, ASK, QAM, ...
3G Wireless Terminal Flexibility
Applications Services
Multiple granularity Arithmetic Logic
6
11
Reconfigurable Architectures
Image
Music Demult. Multiple
Access Channel Decoder
Demodul. Equalizer
Source Decoder
Voice
Processor Processor
Reconfigurable Coprocessor
time
Wireless Multimedia Receiver
12
Processing Model
T3
T1
T2b T2a T2c
RA4
t
T1
T2a
T3
T2b [adapted from Leray08]
RA: Reconfigurable Area CM: Configuration Management
RA5
RA2 RA3 RA1
T2c
7
13
Agenda Motivations and Challenges Dynamically Reconfigurable Architectures
Anatomy of an RSoC From Applications to Architecture
Coarse-grain Reconfigurable Architecture DART architecture (Mozaic platform) Morea architecture
Conclusion
A cairn in Bréhat
14
DART Architecture
Architecture Principles of DART
Compilation Workflow
Validation and Silicon Prototype
8
15
Overall Objectives Coarse-grained reconfigurable architecture Energy-efficiency Dynamic reconfiguration (4 to 20 cycles) Compilation from a C code specification (no
place and route)
16
Energy Efficiency
Technological parameter CS
Applicative parameters Nop.Fclk , α
Potential optimisations Actrl , Amem , Aop , α , VDD
9
17
Cost of control Minimize the configuration data volume ( Actrl)
Limited number of operation types and data format Various processing patterns Reconfiguration at the data-path level (rather than at the gate
level as in the case of FPGA) Reduce the frequency of reconfigurations ( α)
Loop body • Limited number of operations • Regular patterns
Each loop can be implemented as a unique configuration which is maintained during the processinf time
18
Example: Motion Estimation (ME) Video coding MPEGx, H26x
Motion Vector (u,v)
Reference Block NxN
Matched Block NxN
N+2p
Search Window
p
sadmin = MAXINT; mvx=0; mvy=0; for (u=-p; u<=p; u++) { for (v=-p; v<=p; v++) { sad = 0; for (i=0; i<N; i++) { for (j=0; j<N; j++) { sad = sad + ABS[BR(i,j)-FR(i+u,j+v)] /* if (sad>=sadmin) break; */ } } if (sad<sadmin) { sadmin = sad; mvx = u; mvy = v; } } }
10
19
Data access cost
Minimize memory access cost ( Amem) Storage capacity High bandwith Memory hierarchy
Minimize number of memory accesses ( α) Minimize the number of temporary data storage Avoid redundant access to data Local registers
0
20
40
60
80
100
120
140
160
64 256 1024 16536
Number of words
pJ p
er
acce
ss
20
Operator
Reconfigurable Operators
Operation 2 Operation 0
( α)
Operation 1
Input 1 Input 2
Sortie
Control
( Aop)
11
21
System Architecture of DART
Data Memory
Instruction M
emory
I/O Ctrl
Cluster 3 Cluster 4
Cluster 1 Cluster 2
Task Controller C
onfiguration M
emory
22
Cluster Architecture
Config. Memory FPGA
DMA Ctrl
Configuration Controller
RDP1
RDP2
RDP3
RDP4
RDP5
RDP6
Data M
emory
Segmented N
etwork
12
23
reg1 reg2 FU1 FU2 FU3 FU4
Multi-Bus Crossbar Network
Data Mem1
Data Mem2
Data Mem3
Data Mem4
AG1 AG2 AG3 AG4
HW Loop Management
Global Bus
Reconfigurable Data Path Architecture • FU1 • FU2 • Crossbar • Bus • AG1 • Loop
24
reg1 reg2 FU1 FU2 FU3 FU4
Multi-Bus Crossbar Network
Data Mem1
Data Mem2
Data Mem3
Data Mem4
AG1 AG2 AG3 AG4
HW Loop Management
Global Bus
Reconfigurable Data Path Architecture
92 bits
34 bits
826 bits to reconfigure the arithmetic resources of a cluster
13
25
Irregular and Regular Software
for (n=0;n<1024;n++){! tmp=0;! for (i=0;i<N;i++){! tmp+=x[i]*h[N-i];! }! y[n]=tmp<<6;! X[0]=x[n]+128;!}!
Irregular Processing few parallelism few regularity less complex
Regular Processing massively parallel very regular complex
26
rec
4 cycles
Mem3
- X
Configuration 2
y(i)=(x(i)-x(i-1))²
Mem1
Configuration 1
tmp+=x(i)*h(N-i);
X +
Mem1 Mem2
HW Reconfiguration DART potential is fully exploited
Optimal flexibility of operators and network Use of registers Multiple DPR chaining via segmented network
14
27
Configuration 1
C=A+B
+
Mem1 Mem2
rec
1 cycle
Configuration 2
E=C*D
X
Mem4 Mem1
SW Reconfiguration Reduced flexibility of the DPR
Operator configuration Operator source configuration No operator or DPR chaining
28
SCMD Single Configuration Multiple Data
Irregular processing have few parallelism Implementation on one DPR
Massively parallel processings are very regular Redundancy in DPR configurations
Configuration data stream can be reduced if the regularity is exploited Simultaneous broadcast of common configuration
data toward several DPRs
15
29
SCMD at work
RDP1
RDP2
RDP6 configuration
data
configuration
data
configuration
data
Configuration bits
RDP1 Validation
RDP2 Validation
RDP6 Validation
LATCH
LATCH
LATCH
30
DART Architecture
Architecture Principles of DART
Compilation Workflow
Validation and Silicon Prototype
16
31
Compilation Workflow
SystemC Simulation (SCDART) • BA-CA Simulation • Performance Estimation
Synthesis (gDART) • DFG scheduling • Operator binding • HW configuration generation
Compilation (cDART, ACG) • SW configuration generation • Code compilation for address generators
Front-End (SUIF) • Code Optimisation • Code Extraction
C Code
32
Compiler front-end Currently based on SUIF High-level source optimisations Parallelism extraction
Partial loop unrolling Semi-automatic partitioning
Regular processing (loops) • HW configurations
Irregular processing and data management • SW configurations and AG instructions
17
Compilation Front End
SUIF
C Code
void main(){...!for (n=0;n<1024;n++){! tmp=0;! for (i=0;i<64;i+=2){! Mem1=x[i];Mem2=h[N-i];}! for (j=0;j<32;j+=2){! Mem3=x[j];Mem4=h[j];! z[j]=Mem5;}! y[n]=Mem6<<6;! X[0]=x[n]+128;!}!…}!
void main(int X0, int H0, !…, int *Y){! int tmp;! tmp=tmp+X0*H0;! tmp=tmp+X1*H1;! *Y=tmp;!}!
void main(int X0, int H0, !…, int *Z0, int *Z1){! *Z0=X0-H0;! *Z1=X1-H1;!}!
Loop body 1 Loop body 2 Irregular processing
Regular code extraction
Partial loop unrolling
void main(){!...!for (n=0;n<1024;n++){! tmp=0;! for (i=0;i<64;i++)//unroll 2! tmp+=x[i]*h[N-i];! for (j=0;j<32;j++)//unroll 2! z[j]=x[j]-h[j];! y[n]=tmp<<6;! X[0]=x[n]+128;!}!...!
SUIF Front-end
void main(){!...!for (n=0;n<1024;n++){! tmp=0;! for (i=0;i<64;i+=2){! tmp+=x[i]*h[N-i];! tmp+=x[i+1]*h[N-(i+1)];! }! for (j=0;j<32;j+=2){! z[j]=x[j]-h[j];! z[j+1]=x[j+1]-h[j+1];! }! y[n]=tmp<<6;! X[0]=x[n]+128;!}!...!
34
Framework
ARMOR model of
DART Compilation Gateways
Specialized Information
Source Code
Optimized Binary Code
Compilation Library
Code selection
Register allocation
Scheduling
Retargeting compilation framework CALIFE
18
35
*
*
*
*
+ + + + *
*
*
*
+
+
+ +
*
*
*
*
+ + + +
HW configuration generation gDART transforms the nested loops (regular processing) into HW
configurations HW Based on classical techniques used in high-level synthesis
Loop reduction, merging, … Graph depth reduction Resource binding, memory allocation
36
Simulator SCDART is a bit-accurate et cycle-accurate
simulator developed in SystemC at the Register Transfer Level
Verification Power and performance estimation
19
37
DART Architecture
Architecture Principles of DART
Compilation Workflow
Validation and Silicon Prototype
38
DART Architecture
5-10 GOPS/cluster @ 130nm 300 mW @ 200MHz 16 MOPS/mW @ 5 GOPS Simulator, Compiler Tools Delivered as an RTL model Circuit (ST 130nm) in june 2005
Config Mem. FPGA
Ctrl DMA
Ctrl
RDP1
RDP2
RDP3
RDP4
RDP5
RDP6
Data. Mem.
Segmented N
etwork
reg reg FU1 FU2 FU3 FU4
Fully Connected Network
Data mem1
Data mem2
Data mem3
Data mem4
AG1 AG2 AG3 AG4
Loop Management
3G/UMTS Mobile Terminal 802.11a (Channel Est.)
STMicroelectronics CEA LIST/LETI
20
39
Fresh Circuit (CEA) 4G mobile terminals Technology: ST 0.13µ CPU core: ARM946 4.8 Mgates Chip area = 80 mm2 Package: TBGA 420 Core power supply: 1.2 V
Silicon prototype of DART Complex SoC including DART accelerator Collaboration between IRISA/Cairn on DART
(architectural design, synthesis, validation), CEA List (validation, integration), CEA Leti (integration, backend)
40
Récepteur WCDMA
D.C. h(n)
Nyquist Filter
A D
AGC
s(n)
RRC
Rake Receiver - Synchronisation - Channel estimation - Decoding
WCDMA/UMTS Receiver
3900 MOPS 500 MOPS
21
41
Complete receiver on a DART cluster
Filtrage (54613 cy.)
9 cy.
Synchronisation Fchip
(4608cy.)
9 cy.
Synchro. Fsymb
(36cy.)
9 cy.
Estim. Canal (8cy.)
3 cy.
Décodage (2560cy.)
3 cy.
114.8mW ⇔ 38.8 MOPS/mW @ 6.2 GOPS 1% 9%
6%
5%
79%
Instruction reading and decoding
Data access in the DPRs
Data access in the cluster
Address generator
Operators
42
10
15
20
25
30
35
40
1 10 100 1000 10000 100000 1000000
Number of symbols
Lo
g2(T
exec)
C64
DART
Xc200E
Real-Time Limit
Positioning DART
DSP is not real time Reconfiguration (2.7ms) overhead for the FPGA
Processing of several symbols (> 150 symbols)
Temporary results (> 1.2Mbits) in memory
C64: 1.5 MOPS/mW, Xc200E: 3 MOPS/mW DART: 39MOPS/mW
22
43
0
50
100
150
200
250
300
350
400
CPU64 TM-1000 320C62 Ring-8 Ring-64 PentiumI PentiumII NEC V830
# cy
cles
VLIW Superscalar
DART DART SWP
Reconfigurable
DCT implementation
44
Motion estimation Video coding MPEGx, H26x
Motion Vector (u,v)
Reference Block NxN
Matched Block NxN
N+2p
Search Window
p
sadmin = MAXINT; mvx=0; mvy=0; for (u=-p; u<=p; u++) { for (v=-p; v<=p; v++) { sad = 0; for (i=0; i<N; i++) { for (j=0; j<N; j++) { sad = sad + ABS[BR(i,j)-FR(i+u,j+v)] /* if (sad>=sadmin) break; */ } } if (sad<sadmin) { sadmin = sad; mvx = u; mvy = v; } } }
23
45
Motion estimation SAD calculation
HW Configuration
- ABS
BR FR
sadmin = MAXINT; mvx=0; mvy=0; for (u=-p; u<=p; u++) { for (v=-p; v<=p; v++) { sad = 0; for (i=0; i<N; i++) { for (j=0; j<N; j++) { sad += ABS[BR(i,j)-FR(i+u,j+v)] } } if (sad<sadmin) { sadmin = sad; mvx = u; mvy = v; } } }
+
46
Gantt diagram for ME (tentative)
SAD on search window
SAD
Update Reference MB Current MB
HW SW
min
HW
256 cy 256 cy
12 cy
47 cy 1 cy
3 cy
16 cy
N=8
24
47
Power distribution inside a DART cluster during ME
1%25%
9%
16%
49%
Instruction reading and decoding
Data access inside the DPRs
Data access inside the cluster
Address generation
Operators
48
Conclusions 1/2 Définition d'une architecture reconfigurable
dynamiquement au niveau fonctionnel Hautes performances Maîtrise de la consommation Flexibilité Minimisation de l'impact de la reconfiguration
• Performances • Consommation
Organisation hiérarchique • Exploitation du parallélisme
25
49
Conclusions 2/2 Définition d'une chaîne de développement
Front-end C Partitionnement semi-automatique Méthode mixte compilation/synthèse architecturale Simulation RTL et estimation de consommation
Validation Comparaison des différents paradigmes de
reconfiguration • Performances • Consommation • Coût de la reconfiguration
50
Travaux en cours Plateforme Mozaic (thèse Julien Lallet)
DART reste spécifique à un domaine d’applications Rendre génériques
• la structure de l’architecture • les mécanismes de reconfiguration
Spécification par un langage ADL • Génération du code RTL
Architecture MOREA (thèse Erwan Grâce) Optimisation de la hiérarchie mémoire Mémoire et générateurs d’adresses « reconfigurables » Etude système multi-cluster Gestion de la reconfiguration Prototype FPGA (en cours)
26
51
Perspectives pour transmedi@ Spécialisation au domaine d’application
Vidéo, Audio Transcodage, multi-standard
Gestion des flux vidéos multiples Mode « co-processing » Gestion des interfaces
CAIRN Project-Team Energy-Efficient Reconfigurable System-on-Chip
DART Coarse-Grain Reconfigurable Architecture
Olivier Sentieys [email protected]
with contribution from Raphaël David (CEA List), Sébastien Pillement (IRISA)
27
53
Appendix Architecture Tools Validation
54
SIMD Multiplier Architecture
16
16-bit Booth-Wallace
mul/add 8-bit carry-save mul/
add
Input A
Input B
Output
SIMD
16
32
L L L L L : Latch
Shifter
8-bit carry-save mul/
add
Mux
Demux
OP
Shift
28
55
ALU architecture
Arithmetic Unit ADD, SUB, ABS
Input A
Input B
Output
SIMD
Shifter
Logic Unit AND, OR, …
Mux
Demux
Command
Output Shift
Shifter
L L L L
Input Shift
56
Fully connected network
Global Bus
Mem
FU1
FUs + registers
2:1
4:1 14:1
2:1
4:1 14:1
FU4
2:1
4:1 14:1
2:1
4:1 14:1
29
57
Connections to global bus
11:1 11:1
decod
11:1 11:1
Configuration Even Bus
Odd Bus Mem. UFs Reg.
2 4 4
‘ Z ’
58
Segmented network
RD
P i
Configurable Interconnection
RD
P i+
1
30
59
Mem 1 64x16
Data Mem1
decod
@
Instr
datapath @ 1
Seq1
Data Mem4
decod
@
Instr
datapath @ 4
Seq4
Zero - overhead loop support
Mem ‘ 64x16
Address generation unit Generate the address sequences for data processed inside the
DPRs Addressing modes:
Pre- ou Post-Increment, Modulo, Bit reverse, … Hardware loop management
Up to 4 nested loops Up to 8 instructions loop body
60
Address generation unit
Mem @
64x16b
Seq
RI decod
@
data
R2 R3
+/++/-/--/NoP modulo
NoP/ Bit_reverse
MUX1 $1
Push N, M
R4 R5
R0 R1
R6 R7
MUXA MUXB
latch @ data
MUX2
MUXC
31
61
Sequencer
PC
++ -M/NOP
M1+M2+M3+M4 M2+M3+M4
M3+M4
M4
push
LIFO_minus_M
load
threshold
pop
M
Cd_minus_M
clk
Pointer
reset push
62
M1 N1 M2 N2 M3 N3 M4 N4
Cpt 1 Cpt 2 Cpt 3
CPT
++
empty
reset
=N ?
pop
Data_out
Data_in
push
LIFO
load
=M ? load
pop
+ +
M N
Cd_minus_M
+
Hardware loop management
32
63
Appendix Architecture
Tools Validation
64
Les compilateurs cDART et ACG
ACG CDART
Compilation Compilation
Extraction accès aux données
Extraction code irregulier
Traitements irréguliers + manipulations de données
Instructions SW Instructions de génération d'adresses
Parser assembler -> Codes AG
Parser assembler -> Config SW
Compilation
33
65
FU2 FU3 FU4 FU1
network
Mem1 Mem2 Mem3 Mem4
Armor model of DART
AG1 AG2 AG3 AG4
Mem5 Mem6 Mem7 Mem8
AG5 AG6 AG7 AG8
Mem22 Mem22 Mem23 Mem24
AG21 AG22 AG23 AG24
Cluster Memory
Memory Controller
66
Appendix Architecture Outils Validation
34
67
Task vs. Operation parallelism
FIR, 6 DPRs
47 %
Rake Receiver 6
DPRs, 9 %
4 cy Other
threads, 6 DPRs,
44 %
11 cy
4 DPRs, 59 %
2 DPRs, 27 %
6 DPRs, 41%
11 cy
4 cy 2 DPR, 32 %
1 DPR, 53 %
1 DPR, 59 %
11 cy
4 cy
4 DPRs, 59 %
6 DPRs, 41%
68
Reconfiguration cost
Configuration data 1x1.4 Mbits for the FPGA
Control data Ncyclex256 bits for the DSP
14423016
520
13107200
53248
14423016
1716
2010624
208
1
10
100
1000
10000
100000
1000000
10000000
100000000
Data
Volume
(bits)
Configuration
(filtre)
Control (filtre) Configuration
(Rake)
Control (Rake)
C64
Xc200E
DART
35
69
Implementation of the WCDMA/FIR on a DART cluster with the SIMD mode
Nb of DPR nb cy/sampleresource usage rate
DPR usage rate
Nb configuration instructions
3 7 90 82,7 84 5 100 59,1 95 5 84 59,1 126 4 92 47,3 11
0
20
40
60
80
100
120
3 4 5 6
Number of allocated DPR
number of cycles needed to proceed a sampleresource usage rate (%)
DPR usage rate (%)
Number of configurationinstructions
70
0
20
40
60
80
100
120
1 2 3 4 5 6
Number of Allocated DPR
number of cycles (/50) needed to proceed a symbol
resource usage rate (%)
DPR usage rate (%)
Number of configurationinstructions
Implementation of the WCDMA/Rake on a DART cluster with the SIMD mode
Nb of DPRnb cy/symbol (x100)
resource usage rate
DPR usage rate
Nb configuration instructions
1 15,51 100 17,8 42 7,78 100 8,9 43 5,21 100 5,9 44 5,21 75 5,9 45 5,21 60 5,9 46 2,63 100 3 4