university of michigan electrical engineering and computer science flextream: adaptive compilation...

University of MichiganElectrical Engineering and Computer Science

Flextream: Adaptive Compilation of Streaming

Applications for Heterogeneous Architectures

Amir Hormati1, Yoonseo Choi1, Manjunath Kudlur3, Rodric Rabbah2, Trevor Mudge1, and Scott Mahlke1

1 University of Michigan

2 IBM T.J. Watson Research Lab.

3 NVIDIA Corp.


Courtesy: Gordon’06

Cores are the New Gates

1

1975

2

4

8

16

32

64

128

256

512

1980 1985 1990 1995 2000 2005 2010

400480088080 8086 286 386 486 Pentium P2 P3 P4

Athlon Itanium Itanium2

Power4 PA8800400480088080

PA8800

Opteron CoreDuo

Power6Xbox 360

BCM 1480Opteron 4P

Xeon

Niagara Cell

RAW

RAZA XLR Cavium

UnicoreHomogeneous MulticoreHeterogeneous Multicore

CISCO CSR1

Larrabee

PicoChip AMBRIC

AMD Fusion

NVIDIA G80

Core

Core2Duo

Core2Quad

# co

res/

chip

(Shekhar Borkar, Intel)

Courtesy: Gordon’06

C/C++/Java

CUDA

X10Peakstream

Fortress

AcceleratorCt

C T M

Rstream

Rapidmind

Stream Programming


Streaming Computing Is Everywhere!

• Prevalent computing domain with applications in embedded systems, desktops and high-end servers


StreamIt

• Main Constructs:– Filter: Encapsulate computation.

• Stateful• Stateless

– Pipeline Expressing pipeline parallelism.

– Splitjoin Expressing task/data-level parallelism.

• Exposes different types of parallelism

pipeline

filter

splitjoin


StreamIt Graph Tuning

• Parallelism can be tuned in streaming programs– Horizontal Replication

– Horizontal Fusion

– Vertical Fusion


StreamIt ExampleA

B1 B2

F

E

D

C

Splitter

Joiner

4343

10

246

326

566

10

6

6

A

F

CDE

10

1138

10

B86

A

B1 B3

F

E

D

C

Splitter

Joiner

21.521.5

10

246

326

566

10

6

6

B2 B421.5 21.5


Core

What Are We Solving?A

B1 B3

F

E

C

Splitter

Joiner

B2 B4

D1 D2

Splitter

Joiner

Memory

Core

Memory

Core

Memory

Core

Memory

?• Performing graph modulo

scheduling on a stream graph statically.

• What happens in case of dynamic resource changes?


Target Architecture

DMA

Master Processor

Slave

Local Store

DMA

Interconnect

Memory

. . .

. . .

Slave

Local Store

DMA

Local Store

Slave

DMA

Local Store

Slave

• Master processor acts as a controller.

• Each slave processor has its own local store and DMA engine.

• An interconnect network connects all the components together.


Overview of Flextream

Prepass Replication

Work Partitioning

Partition Refinement

Stage Assignment

Buffer Allocation

Static

Dynam

ic

Streaming Application

MSL Commands

Adjust the amount of parallelism for the target system by replicating actors.

Performs light-weight adaptation of the schedule for the current configuration of the target hardware.

Tunes actor-processor mapping to the real configuration of the underlying hardware. (Load balance)

Find an optimal schedule for a virtualized member of a family of processors.

Specifies how actors execute in time in the new actor-processor mapping.

Find optimal modulo schedule for a virtualized member of a family of processors.

Tries to efficiently allocate the storage requirements of the new schedule into available memory units.

Goal: To perform Adaptive Stream Graph Modulo Scheduling.


MSL : Multi-Core Streaming Layer

• Instruction set for heterogeneous multi-core systems

• A set of high-level commands for :– Actor Commands(Loading/Unloading)– Buffer Commands(Allocating local/global buffers)– Data Transfer Commands(Managing DMAs)

• Flextream’s online layer uses these commands to adapt the static schedule


Overall Execution Flow

• For every application may see multiple iterations of:

Resou

rce ch

ange

Req

uest

Resou

rce ch

ange

Gran

ted


Prepass Replication [static 1]

A B C D

E F E0E1

P0 : 10 P1 : 86 P2 : 246 P3 : 326

P4 : 566 P5 : 10 P6 : 0 P7 : 0 P4 : 283

D0

D1

P6 : 283 P7 : 163

P3 : 163P0 : 151.5 P1 : 147.5 P2 : 184.5 P3 : 163

P4 : 141.5 P5 : 151.5 P6 : 141.5 P7 : 163

E0

E1 E2 E3

C0 C1 C2C3

A

F

E

D

C

10

246

326

566

10

B86

C0 C2

S0

J0

61.5

6

6

C1 C3

D0

S1

J1

163

6

6

D1

E0 E2

S2

J2

6

6

E1 E3141.5

21

21

22

22

22

22


Work Partitioning [static 2]

• Finds optimal actor to processor mapping considering:– Actors’ work estimates– Communication cost– DMA cost– Memory requirements

• At the end, each actor is assigned to exactly one processor.


Partition Refinement [dynamic 1]

• Available resources at runtime can be more limited than resources in static target architecture.

• Partition refinement tunes actor to processor mapping for the active configuration.

• A greedy iterative algorithm is used to achieve this goal.


Partition Refinement Example

• Pick processors with most number of actors.

• Sort the actors

• Find processor with max work

• Assign min actors until threshold

A B

P0 : 184.5 P1 : 141.5 P2 : 171.5 P3 : 141.5

P4 : 151.5 P5 : 173 P7 : 159.5

D0D1

P6 : 140

E0

E1

E2E3C0 C1

C2

C3

S1S2

J0S0

J1 J2

BE2 C2C0 C3C1 S1S2 J0S0J1 J2

P5 : 183

S0C3

S1S2

J1J2

C2

C1

B

C0

E2

FJ0

P5 : 193P5 : 270.5P4 : 274.5

P1 : 283 P3 : 289


Stage Assignment [dynamic 2]

• Processor assignment only specifies how actors are overlapped across processors.

• Stage assignment finds how actors are overlapped in time.

• Relative start time of the actors is based on stage numbers.

• DMA operations will have a separate stage.


Stage Assignment ExampleA

F

B

C0 C2

S0

J0

C1 C3

D0

S1

J1

D1

E0 E2

S2

J2

E1 E3

A D0D1

E0

E1

E3

S0C3

S1S2

J1J2

C2

C1

B

C0

E2

FJ0

0

2

4

6

108

12

16

18

14


Buffer Allocation [dynamic 3]

• Slave processors have limited local store.

• Local store is faster than main memory.

• Utilize local stores first and then spill to main memory

• In case of spilling, DMAs have to be adjusted


Methodology• StreamIt Compiler

• Metis for graph partitioning

• 32 core heterogeneous distributed memory multi-core system

• Each slave core has a DMA engine and 128K local store

• System simulator to simulate the interconnect traffic.


Performance Comparison (DES)

0

5

10

15

20

25

30

35

Full Static Graph Partitioning Flextream

Number of Cores

Rel

ativ

e Sp

eedu

p


Performance Comparison

bitonic dct des fft filter bank

fm matrix mult.

mpeg2 ser-pent

tde av-er-age

0

5

10

15

20

25Graph Partitioning ApproachFlextream Approach

Slow

dow

n ( %

)


Dynamic Approach Time Comparison


fm matrix mult.

mpeg2 ser-pent

tde av-er-age

0

2

4

6

8

10

12Flextream Refinement ApproachGraph Partitioner Approach

Tim

e (m

s)


Overhead Comparison


fm matrix mult.

mpeg2 serpent tde average0.90

0.91

0.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1.00

3735

301

1117

125

705 887274

4588

695403

1283

2.6

4.3

2.3

2.7 4.1

1.3

4.9

2.8

3.3

5.2

11.3

6.9

5.9 8.9

5.8

7.38.4

7.68.1

8.43

7.4 5.4 3.2 4.5 3.6 5

Prepass Replication Work Refinement TimeStage Assignment Time Buffer Allocation Time

Frac

tion

of

Tim

e A

lloca

ted


Conclusion

• Static scheduling approaches are promising but not enough.

• Dynamic adaptation is necessary for future systems.

• Flextream provides a hybrid static/dynamic approach to improve efficiency.


Overhead Comparison


fm matrix mult.

mpeg2 serpent tde average0.90

0.91

0.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1.00

3735

301

1117

125

705 887274

4588

695403

1283

2.6

4.3

2.3

2.7 4.1

1.3

4.9

2.8

3.3

5.2

11.3

6.9

5.9 8.9

5.8

7.38.4

7.68.1

8.43

7.4 5.4 3.2 4.5 3.6 5

Prepass Replication Work Refinement TimeStage Assignment Time Buffer Allocation Time

Frac

tion

of

Tim

e A

lloca

ted


fm matrix mpeg2 serpent tde average0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Effect of Buffer Allocation on Performance

Min Mem Min Mem + (Max Mem - Min Mem)/5Min Mem + 2(Max Mem - Min Mem)/5 Min Mem + 3(Max Mem - Min Mem)/5Min Mem + 4(Max Mem - Min Mem)/5 Max Mem

Rel

ativ

e P

erfo

rman

ce


Prepass Replication

A

F

E

D

C

10

246

326

566

10

B86

C0 C2

S0

J0

61.5

6

6

C1 C3

D0

S1

J1

163

6

6

D1

E0 E2

S2

J2

6

6

E1 E3141.5

21

21

22

22

22

22


A B C D

E F E0E1

P0 : 10 P1 : 86 P2 : 246 P3 : 326

P4 : 566 P5 : 10 P6 : 0 P7 : 0 P4 : 283

D0

D1

P6 : 283 P7 : 163

P3 : 163P0 : 151.5 P1 : 147.5 P2 : 184.5 P3 : 163

P4 : 141.5 P5 : 151.5 P6 : 141.5 P7 : 163

E0

E1 E2 E3

C0 C1 C2C3


Outline

• Streaming Background

• Flextream’s Approach– Static phase– Dynamic phase

• Evaluation

• Conclusion


Introduction

• Single core performance stopped to scale.

• Multi-core and Many-core systems are every where.

• These systems have different configurations.

• Resource management is a challenging problem.

Cell Processor

Intel Larrabee

university of michigan electrical engineering and computer science flextream: adaptive compilation...

Documents

computer science core

computer science flextream

computer science msl

computer science courtesy

f f cde

b b c c d d e e f f

static schedule slide

slave processor