compilation for embedded reconfigurable computing...
TRANSCRIPT
Compilation for Embedded
Reconfigurable Computing
Architectures: Part AArchitectures: Part AJoão M. P. Cardoso, and Pedro C. Diniz
3rd Summer School on
Generative and Transformational Techniques in Software Engineering
6–11 July, 2009, Braga, Portugal
Outline
• First part of the Tutorial: Architectures– Motivation– Technology Trends– Reconfigurable Computing– Embedded and Reconfigurable Architectures– Reconfiguration– Reconfiguration– Improving Performance
• Second part of the Tutorial: Compiling– Main Compiler and Execution Concepts– Compiling to Fine-Grained Reconfigurable Architectures– Optimizations
• Conclusions
2
Embedded Computing
• High-performance with low
energy and at low cost
• Short time-to-market
• Upgrades during product’s • Upgrades during product’s
lifetime (imply reprogramability)
3
Motivation
• Speedups achieved by Reconfigurable Computing
• Speedup of Cray XD1 (FPGAs @ 200 MHz) over an
Opteron CPU (@ 2.4 GHz)
– DNA sequencing– DNA sequencing
• 695× using 1 FPGA
• 2,794× using 6 FPGAs
– Data Encryption Standard (DES) cipher
• 12,162×
• Power savings of 148× and 608×, respectively
4El-Ghazawi et al, IEEE Computer, 2008
The Reconfigurable Computing
Paradigm
• Traditional computing
– start: algorithm (variable)
– architecture: fixed structure
• Reconfigurable Computing• Reconfigurable Computing
– start: algorithm (variable)
– architecture: variable structure
5
See Nick Tredennick’s Paradigm Classification Scheme
TECHNOLOGY TRENDS
6
Why higher levels of abstraction?
• Hardware and software design gaps
7Source: THE INTERNATIONAL TECHNOLOGY ROADMAP FOR SEMICONDUCTORS: 2007 7
System-level design requirements
• near-term years
8Source: THE INTERNATIONAL TECHNOLOGY ROADMAP FOR SEMICONDUCTORS: 2007 8
System-level design requirements
• long-term years
9Source: THE INTERNATIONAL TECHNOLOGY ROADMAP FOR SEMICONDUCTORS: 2007 9
Reconfigurability is seen
as a “must” for future
10
as a “must” for future
embedded computing
systems
RECONFIGURABLE COMPUTING
11
Reconfigurable (Custom) Computing
• Hardware resources can be “configured” for a specific architecture– Specialized Functional Elements and
Processing Elements
– Interconnect between “Nodes” Custom to Data flow in the application
– Interconnect between “Nodes” Custom to Data flow in the application
– Configurable on-chip memories (size, data-width, indexing, etc.)
– Execution Models (Pipelined, Multithreading, VLIW)
All possible in the same reconfigurable fabric
12
Reconfigurable (customizable) Fabrics
Customized
Customized
F(a,b,c,d)
Customized
memoriesCustomized
interconnects
13
Reconfigurable (customizable) Fabrics
� Data travel on paths statically
or dynamically defined
� Many on-chip Memories
• Parallel accesses
� Native support for data � Native support for data
streaming applications
� Custom Pipelining
• On-chip configurable
memories can be adapted
to communication needs
14
Reconfigurable (customizable) Fabrics
� Data travel on paths statically
or dynamically defined
� Many on-chip Memories
• Parallel accesses
� Native support for data Computing
MemoryFIFO
� Native support for data
streaming applications
� Custom Pipelining
• On-chip configurable
memories can be adapted
to communication needs
15
Computing
EngineComputing
Engine
Memory Memory
FIFO
Reconfigurable (Custom) Computing
• Orders of magnitude speed-ups over traditional computing systems
• Why? Customization is the key:
– High operation- and task-level parallelism
Increased by storage organization (data • Increased by storage organization (data replication/distribution over multiple on-chip memories)
– Non-Standard Numeric Formats (fixed-point, etc.)
– Custom Routing
16
Reconfigurable (Custom) Computing
• Benefits:– Reconfiguration is ideal for fast
prototyping and early evaluation of realistic performance
– Performance
– Tolerate Defects– Tolerate Defects
• Costs:– Added complexity of execution
models makes programming very hard (we have not yet solved the parallel programming problem yet, sort of…)
17
Reconfigurable Computing
Many companies: Cray,
SGI, SRC, ARC, PACT,
PicoChip, Tilera, etc.
Based on source: Bezdek, J.C, Fuzzy
models - what are they, and why,
IEEE Trans. on Fuzzy Systems, 1993.
Reconfigurable Computing has
already achieved this point!
18
Reconfigurable Computing
• The Sony PSP Example
– Reconfigurable Architecture: Virtual Mobile Engine (VME): audio
• 24-bit data width
• 166 MHz
• Single-cycle context switch
19
• Single-cycle context switch
http://www.hotchips.org/archives/hc16/3_Tue/8_HC16_Sess8_Pres1_bw.pdf 19
Reconfigurable Computing
• The Sony PSP Example
– Reconfigurable Architecture: Virtual Mobile Engine (VME): audio
• 24-bit data width
• 166 MHz
• Single-cycle context switch
20
• Single-cycle context switch
20http://www.hotchips.org/archives/hc16/3_Tue/8_HC16_Sess8_Pres1_bw.pdf
Are Architectures Merging?
Multi-(Many)-core vs. Reconfigurable
� Regularity of Reconfigurable Fabrics (e.g., FPGAs)
allow them to ride Moore Law
• Unbelievable large number of devices
• Hard-macro cores can be plugged-in
MulticoreManycoreReconfigurable
Fabrics
• Hard-macro cores can be plugged-in
21
RECONFIGURABLE COMPUTING
ARCHITECTURES
22
Basic Concepts
• Fine-Grained Reconfigurable Arrays:
– Main Cell: Logic block
– Main example: FPGAs (Field Programmable Gate
Arrays)Arrays)
23
FPGA Example
• Virtex-5 (5VLX330): – 207,360 6-LUTs/FFs
– 10,368 Kbits BRAM
– 3,420 Kbits Distributed RAM
– 192 DSP48E Slices (includes a 25 x 18 multiplier)
• multiplier (32x32 → 32):
• 3 DSP48E (∼9.7 ns) ⇒ 64 multipliers!, or• 3 DSP48E (∼9.7 ns) ⇒ 64 multipliers!, or
• 754 LUTs (∼11 ns) ⇒ 275 multipliers!
• adder (32+32 → 32): • 32 LUTs (∼5 ns) ⇒ 6,480 adders!
• RISC processor (Microblaze), @200 MHz and about 1,400 6-LUTs in Virtex-5 (1650 w/ FPU)– Virtex-5 (5VLX330) has LUTs for 148
microprocessors!
24http://www.xilinx.com
Basic Concepts
• Coarse-Grained Reconfigurable Arrays:
– Main cell: Functional Unit
25
destinations
Configuration
Controller
Functional Unit
(FU)
A B
MUX MUX
result
reg
Register
File (RF)
sources
Coarse-Grained Reconfigurable
Array Example
• The PACT XPP-3C: high
performance fixed point DSP
26http://www.pactxpp.com
VLIW vs. Coarse-Grained
Reconfigurable Arrays� Example: r=a×c+b×d;
� r6=r1*r2;
� r7=r3*r4;
r5=r6+r7;
a
x
c b
x
d
+� r5=r6+r7;
� VLIW: Multi-Port Register File
PE 1 . . .PE 2 PE N
ALU
r7=r3*r4;
r5=r6+r7;
r6=r1*r2;
nop;
PE2PE1
r
+
27
VLIW vs. Coarse-Grained
Reconfigurable Arrays
� Example: r=a×c+b×d;
� r6=r1*r2;
� r7=r3*r4;
r5=r6+r7;
a
x
c b
x
d
+
FU FU FU FURegister File
28
� r5=r6+r7;
� CGRA:r
+
x x +Register File FU
ab c d r
VLIW vs. Coarse-Grained
Reconfigurable Arrays� Example: r=a×c+b×d;
� r6=r1*r2;
� r7=r3*r4;
r5=r6+r7;
a
x
c b
x
d
+
PE 1 PE 2 PE 3
PE 4 PE 5 PE 6
PE 7 PE 8 PE 9
� r5=r6+r7;
� CGRA:r
+
29
VLIW vs. Coarse-Grained
Reconfigurable Arraysa
x
c b
x
d
+
� Example: r=a×c+b×d;
� r6=r1*r2;
� r7=r3*r4;
r5=r6+r7;PE 1 PE 2 PE 3
PE 4 PE 5 PE 6
PE 7 PE 8 PE 9
r7=r3*r4;
r5=r6+r7;
r6=r1*r2;
PE2PE1 PE4
r
+� r5=r6+r7;
� CGRA:
30
Past and Notable Architectures
Xilinx XC6200 Field
Programmable Gate
Arrays, 1995-2001
Chameleon (1997) RCP architecture,
2000-2003
Triscend
(1997),
2001-2004
31
Granularity in RC Architectures
32
Granularity in RC Architectures
• Fine-Grained Fabrics are able to host a wide
range of architectures
33
Granularity in RC Architectures
• Fine-Grained Fabrics are not tied to a
computational model
– E.g., load/store (a) vs. data streaming (b)
34
RECONFIGURATION
35
Reconfiguration
• Possibility to modify the computing structures
in the field (i.e., after fabrication)
• Static
– During setup or before the beginning of the – During setup or before the beginning of the
computations
• Dynamic or in Runtime
– During the computations
36
Reconfiguration
• The PACT XPP example (a Coarse-Grained
Reconfigurable Architectures)
Configuration Cache
fetch
configure
PE
PEPE
PE
Configuration Manager
(CM)
Cache(CC)
fetch
CMPort0
CMPort1
M
M
37
Reconfiguration
• Coarse-Grained Reconfigurable Architectures
– The PACT XPP reconfiguration flow
Fetch (f) Configure (c) Compute (comp)
fetch configure
c0;If(CMPort0) then c1;If(CMPort1) then c2;
c1
<N
CMport0CMport1
c2
c0Configuration
Cache(CC)
Configuration Manager
(CM)
c0
38
• The PACT XPP reconfiguration flow
Reconfiguration
begin
end
Conf. 1
Conf. 2
39 39
• The PACT XPP reconfiguration flow
begin
Reconfiguration
40
Conf. 1
Conf. 2
end
Conf. 3
Conf. 4 Conf. 5
40
Remarks about Reconfigurable
Computing Architectures
• Reconfigurable fine-grained architectures
– have the potential to virtually implement any
architecture
• Reconfigurable coarse-grained architectures • Reconfigurable coarse-grained architectures
– are more computing oriented with granularity
close to the data widths used in data processing
• Both permit customization and allow energy
savings and high-performance
41
IMPROVING PERFORMANCE
42
Main Target Architecture
• Microprocessor extended with Hardware
Accelerators (e.g., coarse-grained
reconfigurable arrays)
43
Improving Performance
• For a given input application and a target embedded computing system:
• If requirements are not satisfied on:– execution time, power dissipation, energy
consumption, memory bandwidthconsumption, memory bandwidth
• Need to: – perform code optimizations
– migrate sections of the code to a hardware accelerator
– redesign the entire system (e.g., including different processors)
44
Improve Execution Time
• Delegating to compiler optimizations such as -
O3 might not be enough!
• Find the most critical sections of the code and
then try to optimize those sectionsthen try to optimize those sections
– Use profiling tools (e.g., gprof) to identify them
– 90-10 rule of dumb: “90% of global execution time
is spent in 10% of the code”
45
Amdahl's Law
• The global speedup we can achieve is limited
– by the fraccion (f) of the execution time of the
application, and
– by the speedup (S) we can achieve for that – by the speedup (S) we can achieve for that
fraction
S
ff
Speedup
+−
=
)1(
1
46
Lessons from Amdahl's Law
• Two corollaries of this law:
– Improve fractions of the
application that reflect a
significant part of the global
exec. time
Small f: optimizations will
– But, the segments that we
ignore also limit the
speedup
• As S increases, the
speedup will tend to
0
10
20
30
40
50
60
70
80
90
100
0
0.0
4
0.0
8
0.1
2
0.1
6
0.2
0.2
4
0.2
8
0.3
2
0.3
6
0.4
0.4
4
0.4
8
0.5
2
0.5
6
0.6
0.6
4
0.6
8
0.7
2
0.7
6
0.8
0.8
4
0.8
8
0.9
2
0.9
6
Sp
ee
du
p
f
• Small f: optimizations will
have a minor impact
fSpeedup
−
=
1
1
47
Example
• Libmad for MP3 decoding
• Analysing theoretical limits for possible accelerations:– considering dct32, a speedup upper bound of 1.11 is expected!
– to achieve an upper bound of 10 we need to take into account the first 9 funtions!
– Note that the analysis considers zero-execution time for each accelerated function and zero communication costs (it is always useful as a first analysis…)
37.5140% Libmad execution time
48Libmad: http://www.underbit.com/products/mad/.
16.48
10.02
5.37 5.01 4.8 4.68 3.59 3.3 2.24 2.18 1.65 0.66 0.56 0.49 0.4 0.23 0.23 0.2 0.10
5
10
15
20
25
30
35% Libmad execution time
Example
• Tools, such as Compilers and Design Space Exploration (DES) environments, should give programmers an easy way to evaluate different solutions
37.5140% Libmad execution time
49
16.48
10.02
5.37 5.01 4.8 4.68 3.59 3.3 2.24 2.18 1.65 0.66 0.56 0.49 0.4 0.23 0.23 0.2 0.10
5
10
15
20
25
30
35% Libmad execution time
CONCLUSIONS – PART A
50
Conclusions
• Reconfigurable computing architectures offer
flexible, low-cost, powerful, notable hardware
accelerators
• Unfortunately, • Unfortunately,
– Efficient compilation is hard
– Current topic of research, requiring a
multidisciplinary approach,…
51