compilation for embedded reconfigurable computing...

Compilation for Embedded

Reconfigurable Computing

Architectures: Part AArchitectures: Part AJoão M. P. Cardoso, and Pedro C. Diniz

3rd Summer School on

Generative and Transformational Techniques in Software Engineering

6–11 July, 2009, Braga, Portugal

Outline

• First part of the Tutorial: Architectures– Motivation– Technology Trends– Reconfigurable Computing– Embedded and Reconfigurable Architectures– Reconfiguration– Reconfiguration– Improving Performance

• Second part of the Tutorial: Compiling– Main Compiler and Execution Concepts– Compiling to Fine-Grained Reconfigurable Architectures– Optimizations

• Conclusions

2

Embedded Computing

• High-performance with low

energy and at low cost

• Short time-to-market

• Upgrades during product’s • Upgrades during product’s

lifetime (imply reprogramability)

3

Motivation

• Speedups achieved by Reconfigurable Computing

• Speedup of Cray XD1 (FPGAs @ 200 MHz) over an

Opteron CPU (@ 2.4 GHz)

– DNA sequencing– DNA sequencing

• 695× using 1 FPGA

• 2,794× using 6 FPGAs

– Data Encryption Standard (DES) cipher

• 12,162×

• Power savings of 148× and 608×, respectively

4El-Ghazawi et al, IEEE Computer, 2008

The Reconfigurable Computing

Paradigm

• Traditional computing

– start: algorithm (variable)

– architecture: fixed structure

• Reconfigurable Computing• Reconfigurable Computing

– start: algorithm (variable)

– architecture: variable structure

5

See Nick Tredennick’s Paradigm Classification Scheme

TECHNOLOGY TRENDS

6

Why higher levels of abstraction?

• Hardware and software design gaps

7Source: THE INTERNATIONAL TECHNOLOGY ROADMAP FOR SEMICONDUCTORS: 2007 7

System-level design requirements

• near-term years


System-level design requirements

• long-term years


Reconfigurability is seen

as a “must” for future

10

as a “must” for future

embedded computing

systems

RECONFIGURABLE COMPUTING

11

Reconfigurable (Custom) Computing

• Hardware resources can be “configured” for a specific architecture– Specialized Functional Elements and

Processing Elements

– Interconnect between “Nodes” Custom to Data flow in the application

– Interconnect between “Nodes” Custom to Data flow in the application

– Configurable on-chip memories (size, data-width, indexing, etc.)

– Execution Models (Pipelined, Multithreading, VLIW)

All possible in the same reconfigurable fabric

12

Reconfigurable (customizable) Fabrics

Customized

Customized

F(a,b,c,d)

Customized

memoriesCustomized

interconnects

13


� Data travel on paths statically

or dynamically defined

� Many on-chip Memories

• Parallel accesses

� Native support for data � Native support for data

streaming applications

� Custom Pipelining

• On-chip configurable

memories can be adapted

to communication needs

14


� Data travel on paths statically

or dynamically defined

� Many on-chip Memories

• Parallel accesses

� Native support for data Computing

MemoryFIFO

� Native support for data

streaming applications

� Custom Pipelining

• On-chip configurable

memories can be adapted

to communication needs

15

Computing

EngineComputing

Engine

Memory Memory

FIFO


• Orders of magnitude speed-ups over traditional computing systems

• Why? Customization is the key:

– High operation- and task-level parallelism

Increased by storage organization (data • Increased by storage organization (data replication/distribution over multiple on-chip memories)

– Non-Standard Numeric Formats (fixed-point, etc.)

– Custom Routing

16


• Benefits:– Reconfiguration is ideal for fast

prototyping and early evaluation of realistic performance

– Performance

– Tolerate Defects– Tolerate Defects

• Costs:– Added complexity of execution

models makes programming very hard (we have not yet solved the parallel programming problem yet, sort of…)

17


Many companies: Cray,

SGI, SRC, ARC, PACT,

PicoChip, Tilera, etc.

Based on source: Bezdek, J.C, Fuzzy

models - what are they, and why,

IEEE Trans. on Fuzzy Systems, 1993.

Reconfigurable Computing has

already achieved this point!

18


• The Sony PSP Example

– Reconfigurable Architecture: Virtual Mobile Engine (VME): audio

• 24-bit data width

• 166 MHz

• Single-cycle context switch

19


http://www.hotchips.org/archives/hc16/3_Tue/8_HC16_Sess8_Pres1_bw.pdf 19


• The Sony PSP Example

– Reconfigurable Architecture: Virtual Mobile Engine (VME): audio

• 24-bit data width

• 166 MHz


20


20http://www.hotchips.org/archives/hc16/3_Tue/8_HC16_Sess8_Pres1_bw.pdf

Are Architectures Merging?

Multi-(Many)-core vs. Reconfigurable

� Regularity of Reconfigurable Fabrics (e.g., FPGAs)

allow them to ride Moore Law

• Unbelievable large number of devices

• Hard-macro cores can be plugged-in

MulticoreManycoreReconfigurable

Fabrics

• Hard-macro cores can be plugged-in

21

RECONFIGURABLE COMPUTING

ARCHITECTURES

22

Basic Concepts

• Fine-Grained Reconfigurable Arrays:

– Main Cell: Logic block

– Main example: FPGAs (Field Programmable Gate

Arrays)Arrays)

23

FPGA Example

• Virtex-5 (5VLX330): – 207,360 6-LUTs/FFs

– 10,368 Kbits BRAM

– 3,420 Kbits Distributed RAM

– 192 DSP48E Slices (includes a 25 x 18 multiplier)

• multiplier (32x32 → 32):

• 3 DSP48E (∼9.7 ns) ⇒ 64 multipliers!, or• 3 DSP48E (∼9.7 ns) ⇒ 64 multipliers!, or

• 754 LUTs (∼11 ns) ⇒ 275 multipliers!

• adder (32+32 → 32): • 32 LUTs (∼5 ns) ⇒ 6,480 adders!

• RISC processor (Microblaze), @200 MHz and about 1,400 6-LUTs in Virtex-5 (1650 w/ FPU)– Virtex-5 (5VLX330) has LUTs for 148

microprocessors!

24http://www.xilinx.com

Basic Concepts

• Coarse-Grained Reconfigurable Arrays:

– Main cell: Functional Unit

25

destinations

Configuration

Controller

Functional Unit

(FU)

A B

MUX MUX

result

reg

Register

File (RF)

sources

Coarse-Grained Reconfigurable

Array Example

• The PACT XPP-3C: high

performance fixed point DSP

26http://www.pactxpp.com

VLIW vs. Coarse-Grained

Reconfigurable Arrays� Example: r=a×c+b×d;

� r6=r1*r2;

� r7=r3*r4;

r5=r6+r7;

a

x

c b

x

d

+� r5=r6+r7;

� VLIW: Multi-Port Register File

PE 1 . . .PE 2 PE N

ALU

r7=r3*r4;

r5=r6+r7;

r6=r1*r2;

nop;

PE2PE1

r

+

27


Reconfigurable Arrays

� Example: r=a×c+b×d;

� r6=r1*r2;

� r7=r3*r4;

r5=r6+r7;

a

x

c b

x

d

+

FU FU FU FURegister File

28

� r5=r6+r7;

� CGRA:r

+

x x +Register File FU

ab c d r


Reconfigurable Arrays� Example: r=a×c+b×d;

� r6=r1*r2;

� r7=r3*r4;

r5=r6+r7;

a

x

c b

x

d

+

PE 1 PE 2 PE 3

PE 4 PE 5 PE 6

PE 7 PE 8 PE 9

� r5=r6+r7;

� CGRA:r

+

29


Reconfigurable Arraysa

x

c b

x

d

+

� Example: r=a×c+b×d;

� r6=r1*r2;

� r7=r3*r4;

r5=r6+r7;PE 1 PE 2 PE 3

PE 4 PE 5 PE 6

PE 7 PE 8 PE 9

r7=r3*r4;

r5=r6+r7;

r6=r1*r2;

PE2PE1 PE4

r

+� r5=r6+r7;

� CGRA:

30

Past and Notable Architectures

Xilinx XC6200 Field

Programmable Gate

Arrays, 1995-2001

Chameleon (1997) RCP architecture,

2000-2003

Triscend

(1997),

2001-2004

31

Granularity in RC Architectures

32


• Fine-Grained Fabrics are able to host a wide

range of architectures

33


• Fine-Grained Fabrics are not tied to a

computational model

– E.g., load/store (a) vs. data streaming (b)

34

RECONFIGURATION

35

Reconfiguration

• Possibility to modify the computing structures

in the field (i.e., after fabrication)

• Static

– During setup or before the beginning of the – During setup or before the beginning of the

computations

• Dynamic or in Runtime

– During the computations

36

Reconfiguration

• The PACT XPP example (a Coarse-Grained

Reconfigurable Architectures)

Configuration Cache

fetch

configure

PE

PEPE

PE

Configuration Manager

(CM)

Cache(CC)

fetch

CMPort0

CMPort1

M

M

37

Reconfiguration

• Coarse-Grained Reconfigurable Architectures

– The PACT XPP reconfiguration flow

Fetch (f) Configure (c) Compute (comp)

fetch configure

c0;If(CMPort0) then c1;If(CMPort1) then c2;

c1

<N

CMport0CMport1

c2

c0Configuration

Cache(CC)

Configuration Manager

(CM)

c0

38

• The PACT XPP reconfiguration flow

Reconfiguration

begin

end

Conf. 1

Conf. 2

39 39

• The PACT XPP reconfiguration flow

begin

Reconfiguration

40

Conf. 1

Conf. 2

end

Conf. 3

Conf. 4 Conf. 5

40

Remarks about Reconfigurable

Computing Architectures

• Reconfigurable fine-grained architectures

– have the potential to virtually implement any

architecture

• Reconfigurable coarse-grained architectures • Reconfigurable coarse-grained architectures

– are more computing oriented with granularity

close to the data widths used in data processing

• Both permit customization and allow energy

savings and high-performance

41

IMPROVING PERFORMANCE

42

Main Target Architecture

• Microprocessor extended with Hardware

Accelerators (e.g., coarse-grained

reconfigurable arrays)

43

Improving Performance

• For a given input application and a target embedded computing system:

• If requirements are not satisfied on:– execution time, power dissipation, energy

consumption, memory bandwidthconsumption, memory bandwidth

• Need to: – perform code optimizations

– migrate sections of the code to a hardware accelerator

– redesign the entire system (e.g., including different processors)

44

Improve Execution Time

• Delegating to compiler optimizations such as -

O3 might not be enough!

• Find the most critical sections of the code and

then try to optimize those sectionsthen try to optimize those sections

– Use profiling tools (e.g., gprof) to identify them

– 90-10 rule of dumb: “90% of global execution time

is spent in 10% of the code”

45

Amdahl's Law

• The global speedup we can achieve is limited

– by the fraccion (f) of the execution time of the

application, and

– by the speedup (S) we can achieve for that – by the speedup (S) we can achieve for that

fraction

S

ff

Speedup

+−

=

)1(

1

46

Lessons from Amdahl's Law

• Two corollaries of this law:

– Improve fractions of the

application that reflect a

significant part of the global

exec. time

Small f: optimizations will

– But, the segments that we

ignore also limit the

speedup

• As S increases, the

speedup will tend to

0

10

20

30

40

50

60

70

80

90

100

0

0.0

4

0.0

8

0.1

2

0.1

6

0.2

0.2

4

0.2

8

0.3

2

0.3

6

0.4

0.4

4

0.4

8

0.5

2

0.5

6

0.6

0.6

4

0.6

8

0.7

2

0.7

6

0.8

0.8

4

0.8

8

0.9

2

0.9

6

Sp

ee

du

p

f

• Small f: optimizations will

have a minor impact

fSpeedup

−

=

1

1

47

Example

• Libmad for MP3 decoding

• Analysing theoretical limits for possible accelerations:– considering dct32, a speedup upper bound of 1.11 is expected!

– to achieve an upper bound of 10 we need to take into account the first 9 funtions!

– Note that the analysis considers zero-execution time for each accelerated function and zero communication costs (it is always useful as a first analysis…)

37.5140% Libmad execution time

48Libmad: http://www.underbit.com/products/mad/.

16.48

10.02

5.37 5.01 4.8 4.68 3.59 3.3 2.24 2.18 1.65 0.66 0.56 0.49 0.4 0.23 0.23 0.2 0.10

5

10

15

20

25

30

35% Libmad execution time

Example

• Tools, such as Compilers and Design Space Exploration (DES) environments, should give programmers an easy way to evaluate different solutions

37.5140% Libmad execution time

49

16.48

10.02

5.37 5.01 4.8 4.68 3.59 3.3 2.24 2.18 1.65 0.66 0.56 0.49 0.4 0.23 0.23 0.2 0.10

5

10

15

20

25

30

35% Libmad execution time

CONCLUSIONS – PART A

50

Conclusions

• Reconfigurable computing architectures offer

flexible, low-cost, powerful, notable hardware

accelerators

• Unfortunately, • Unfortunately,

– Efficient compilation is hard

– Current topic of research, requiring a

multidisciplinary approach,…

51

João M. P. Cardoso

THANK YOU!

João M. P. Cardoso

[email protected]

http://www.fe.up.pt/~jmpc

52

compilation for embedded reconfigurable computing...

Documents