mahdi hamzeh, aviral shrivastava, and sarma vrudhula school of computing, informatics, and decision...

12
REGIMap: Register-Aware Application Mapping on Coarse-Grained Reconfigurable Architectures Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula School of Computing, Informatics, and Decision Systems Engineering Arizona State University June 2013 This work was supported in part by CSR-EHS 0509540, CCF-0916652, CCF 1055094, NSF IUCRC for Embedded Systems (IIP-0856090), Center for Embedded Systems grant DWS-0086; Science Foundation Arizona grant SRG 0211-07, Raytheon and by the Stardust Foundation.

Upload: kelley-stewart

Post on 27-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula School of Computing, Informatics, and Decision Systems Engineering Arizona State University June 2013

REGIMap: Register-Aware Application Mapping on Coarse-

Grained Reconfigurable Architectures

Mahdi Hamzeh, Aviral Shrivastava, and Sarma VrudhulaSchool of Computing, Informatics, and Decision Systems Engineering

Arizona State UniversityJune 2013

This work was supported in part by CSR-EHS 0509540, CCF-0916652, CCF 1055094, NSF IUCRC for Embedded Systems (IIP-0856090), Center for Embedded Systems grant DWS-0086; Science Foundation Arizona grant SRG 0211-07, Raytheon and by the Stardust Foundation.

Page 2: Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula School of Computing, Informatics, and Decision Systems Engineering Arizona State University June 2013

2

Accelerators for Energy Efficiency

50 100 150 200 2501

10

100 ADRES[1] CGRA

Intel Core i7

NVIDIA Tesla™ c2050

Power (W)

Giga Opsper Sec

60 GOpS/W

1.4 GOpS/W 4.3 GOpS/W

• Demand for performance• Power consumption• Technology scaling

CoreAccelerator

Shared Cache

Private cache Private cache

[1] BOUWENS, F., BEREKOVIC, M., SUTTER, B. D., AND GAYDADJIEV, G. Architecture enhancements for the adres coarse-grained reconfigurable array. In Proc. HiPEAC (2008), pp. 66–81.

Page 3: Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula School of Computing, Informatics, and Decision Systems Engineering Arizona State University June 2013

3

Coarse-grained Reconfigurable Architectures

• 2D array of Processing Elements (PEs)• ALU + Local register file → PE• Mesh interconnection• Shared data bus– Data memory

• PE inputs:– 4 Neighboring PEs– Local register file

Page 4: Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula School of Computing, Informatics, and Decision Systems Engineering Arizona State University June 2013

P 12

Q 12

P 12

Q 12

P 12

Q 12

P 12

Q 12

P 12

Q 12

P 12

Q 12

P 12

Q 12

P 12

Q 12

a

b

c

d

Time

1

2

3

4

Map Loops on CGRA and Minimize Initiation Interval

a

b

c

d

a

b

c

d

aa

a aa

a

ab 4

2II is the performance metric

aRegister utilization decreases IIP 1

2 Q 12

Page 5: Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula School of Computing, Informatics, and Decision Systems Engineering Arizona State University June 2013

Register Files for Inter-Iteration DependenciesP 1

2Q 1

2P 12

Q 12P 1

2Q 1

2P 12

Q 12

a

c

e

f

1

3

6

8

3b

2P 1

2Q 1

2P 12

Q 12P 1

2Q 1

2P 12

Q 12

a

c b

b

e

b

f

f

a

c bb

e

b

f f

f

f

a

c b

b

f

2

4

5

7

Register Utilization is essential for Inter-iteration Data

Dependencies

P 12 Q 1

2

Page 6: Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula School of Computing, Informatics, and Decision Systems Engineering Arizona State University June 2013

6

• Size of resource graph ≈ O(n)• Partition the resources n+1 partitions• Huge number of possible partitions (exponential)

• Assign operations to sets such that • All operations are mapped• Data dependency between operations are obeyed

• Intractable• Existing techniques are• Exploratory

– Huge search space– If fail, start from scratch

• Adhoc register allocation

Insight to the problem

Page 7: Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula School of Computing, Informatics, and Decision Systems Engineering Arizona State University June 2013

7

• General Problem formulation• Reduce search space

– Partition the problem to Scheduling and integrated placement and register allocation

– No register in resource graph

• Constructive search• Integrated placement and register allocation• REGIMap

– Schedule DFG– Construct Resource graph– Construct a compatibility graph between DFG and resource graph– Model register requirement of operation in the weight of arcs in

compatibility graph– Find a restricted maximal clique

Contributions

Page 8: Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula School of Computing, Informatics, and Decision Systems Engineering Arizona State University June 2013

a

b

c

d

P Q

P Q

(, a)

(, a)

(, b)

(, b)

(, c)

(, c)(, d)

(, d)

P 12 Q 1

22

Page 9: Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula School of Computing, Informatics, and Decision Systems Engineering Arizona State University June 2013

9

• Loops from SPEC2006 and multimedia benchmarks

• 4 × 4 CGRA with enough instruction and data memory

• Shared data bus for each row• Latency is 1 cycle• Compared with register-aware DRESC [2]

Experimental Setup

[2] DE SUTTER, B., COENE, P., VANDER AA, T., AND MEI, B. Placement-and-routing-based register allocation for coarse-grained reconfigurable arrays. In Proc. LCTES (2008), pp. 151–160.

Page 10: Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula School of Computing, Informatics, and Decision Systems Engineering Arizona State University June 2013

10

Mapping Results

Swim

_Calc

YUV2RGB

Sobel

Lowpass SO

R

Laplac

eGSR

Wav

elet

Forw

ard

Compress

Mpeg2

Averag

e Res

h264ref

gobmk

hmmerdea

lIIbzip

2ast

ar

omnetpp

perl

povray

sphinx gcc

soplex

libquan

tum

Averag

e Rec

00.10.20.30.40.50.60.70.80.9

1

REGIDRECS

Perf

orm

ance

Rati

o (M

II/II)

Size of Register File = 2

Res Bounded Rec Bounded

0

0.2

0.4

0.6

0.8

1

REGIDRECS

Perf

orm

ance

Rati

o (M

II/II)

Size of Register File = 4

Res Bounded Rec Bounded

REGIMap improves performance on average by

1.8X more than DRESC*

Page 11: Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula School of Computing, Informatics, and Decision Systems Engineering Arizona State University June 2013

11

Reasonable Running Time

0.0001

0.01

1

100

10000

1000000

REGI

DRECS

Com

pila

tion

Tim

e (S

) Size of Register File = 2

Res Bounded Rec Bounded

0.001

0.1

10

1000

100000

10000000

REGIDRECS

Com

pila

tion

Tim

e (S

) Size of Register File = 4

Res Bounded Rec Bounded

REGIMap maps loops on average 56X faster than

DRESC*

Page 12: Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula School of Computing, Informatics, and Decision Systems Engineering Arizona State University June 2013

12

• Accelerators for energy efficiency• Coarse-grained reconfigurable architecture, a

programmable accelerator• Contributions– Problem formulation– Search space reduction– Constructive search– Integrated register allocation– REGIMap

• Better mappings 1.8X performance improvement• On average 56 times better compilation time

• Please join my poster presentation for more details

Summary