university of michigan electrical engineering and computer science flextream: adaptive compilation...

29
University of Michigan Electrical Engineering and Computer Science Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures Amir Hormati 1 , Yoonseo Choi 1 , Manjunath Kudlur 3 , Rodric Rabbah 2 , Trevor Mudge 1 , and Scott Mahlke 1 1 University of Michigan 2 IBM T.J. Watson Research Lab. 3 NVIDIA Corp.

Upload: rebecca-brady

Post on 28-Mar-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: University of Michigan Electrical Engineering and Computer Science Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures

University of MichiganElectrical Engineering and Computer Science

Flextream: Adaptive Compilation of Streaming

Applications for Heterogeneous Architectures

Amir Hormati1, Yoonseo Choi1, Manjunath Kudlur3, Rodric Rabbah2, Trevor Mudge1, and Scott Mahlke1

1 University of Michigan

2 IBM T.J. Watson Research Lab.

3 NVIDIA Corp.

Page 2: University of Michigan Electrical Engineering and Computer Science Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures

University of MichiganElectrical Engineering and Computer Science

Courtesy: Gordon’06

Cores are the New Gates

1

1975

2

4

8

16

32

64

128

256

512

1980 1985 1990 1995 2000 2005 2010

400480088080 8086 286 386 486 Pentium P2 P3 P4

Athlon Itanium Itanium2

Power4 PA8800400480088080

PA8800

Opteron CoreDuo

Power6Xbox 360

BCM 1480Opteron 4P

Xeon

Niagara Cell

RAW

RAZA XLR Cavium

UnicoreHomogeneous MulticoreHeterogeneous Multicore

CISCO CSR1

Larrabee

PicoChip AMBRIC

AMD Fusion

NVIDIA G80

Core

Core2Duo

Core2Quad

# co

res/

chip

(Shekhar Borkar, Intel)

Courtesy: Gordon’06

C/C++/Java

CUDA

X10Peakstream

Fortress

AcceleratorCt

C T M

Rstream

Rapidmind

Stream Programming

Page 3: University of Michigan Electrical Engineering and Computer Science Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures

University of MichiganElectrical Engineering and Computer Science

Streaming Computing Is Everywhere!

• Prevalent computing domain with applications in embedded systems, desktops and high-end servers

Page 4: University of Michigan Electrical Engineering and Computer Science Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures

University of MichiganElectrical Engineering and Computer Science

StreamIt

• Main Constructs:– Filter: Encapsulate computation.

• Stateful• Stateless

– Pipeline Expressing pipeline parallelism.

– Splitjoin Expressing task/data-level parallelism.

• Exposes different types of parallelism

pipeline

filter

splitjoin

Page 5: University of Michigan Electrical Engineering and Computer Science Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures

University of MichiganElectrical Engineering and Computer Science

StreamIt Graph Tuning

• Parallelism can be tuned in streaming programs– Horizontal Replication

– Horizontal Fusion

– Vertical Fusion

Page 6: University of Michigan Electrical Engineering and Computer Science Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures

University of MichiganElectrical Engineering and Computer Science

StreamIt ExampleA

B1 B2

F

E

D

C

Splitter

Joiner

4343

10

246

326

566

10

6

6

A

F

CDE

10

1138

10

B86

A

B1 B3

F

E

D

C

Splitter

Joiner

21.521.5

10

246

326

566

10

6

6

B2 B421.5 21.5

Page 7: University of Michigan Electrical Engineering and Computer Science Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures

University of MichiganElectrical Engineering and Computer Science

Core

What Are We Solving?A

B1 B3

F

E

C

Splitter

Joiner

B2 B4

D1 D2

Splitter

Joiner

Memory

Core

Memory

Core

Memory

Core

Memory

?• Performing graph modulo

scheduling on a stream graph statically.

• What happens in case of dynamic resource changes?

Page 8: University of Michigan Electrical Engineering and Computer Science Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures

University of MichiganElectrical Engineering and Computer Science

Target Architecture

DMA

Master Processor

Slave

Local Store

DMA

Interconnect

Memory

. . .

. . .

Slave

Local Store

DMA

Local Store

Slave

DMA

Local Store

Slave

• Master processor acts as a controller.

• Each slave processor has its own local store and DMA engine.

• An interconnect network connects all the components together.

Page 9: University of Michigan Electrical Engineering and Computer Science Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures

University of MichiganElectrical Engineering and Computer Science

Overview of Flextream

Prepass Replication

Work Partitioning

Partition Refinement

Stage Assignment

Buffer Allocation

Static

Dynam

ic

Streaming Application

MSL Commands

Adjust the amount of parallelism for the target system by replicating actors.

Performs light-weight adaptation of the schedule for the current configuration of the target hardware.

Tunes actor-processor mapping to the real configuration of the underlying hardware. (Load balance)

Find an optimal schedule for a virtualized member of a family of processors.

Specifies how actors execute in time in the new actor-processor mapping.

Find optimal modulo schedule for a virtualized member of a family of processors.

Tries to efficiently allocate the storage requirements of the new schedule into available memory units.

Goal: To perform Adaptive Stream Graph Modulo Scheduling.

Page 10: University of Michigan Electrical Engineering and Computer Science Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures

University of MichiganElectrical Engineering and Computer Science

MSL : Multi-Core Streaming Layer

• Instruction set for heterogeneous multi-core systems

• A set of high-level commands for :– Actor Commands(Loading/Unloading)– Buffer Commands(Allocating local/global buffers)– Data Transfer Commands(Managing DMAs)

• Flextream’s online layer uses these commands to adapt the static schedule

Page 11: University of Michigan Electrical Engineering and Computer Science Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures

University of MichiganElectrical Engineering and Computer Science

Overall Execution Flow

• For every application may see multiple iterations of:

Resou

rce ch

ange

Req

uest

Resou

rce ch

ange

Gran

ted

Page 12: University of Michigan Electrical Engineering and Computer Science Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures

University of MichiganElectrical Engineering and Computer Science

Prepass Replication [static 1]

A B C D

E F E0E1

P0 : 10 P1 : 86 P2 : 246 P3 : 326

P4 : 566 P5 : 10 P6 : 0 P7 : 0 P4 : 283

D0

D1

P6 : 283 P7 : 163

P3 : 163P0 : 151.5 P1 : 147.5 P2 : 184.5 P3 : 163

P4 : 141.5 P5 : 151.5 P6 : 141.5 P7 : 163

E0

E1 E2 E3

C0 C1 C2C3

A

F

E

D

C

10

246

326

566

10

B86

C0 C2

S0

J0

61.5

6

6

C1 C3

D0

S1

J1

163

6

6

D1

E0 E2

S2

J2

6

6

E1 E3141.5

21

21

22

22

22

22

Page 13: University of Michigan Electrical Engineering and Computer Science Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures

University of MichiganElectrical Engineering and Computer Science

Work Partitioning [static 2]

• Finds optimal actor to processor mapping considering:– Actors’ work estimates– Communication cost– DMA cost– Memory requirements

• At the end, each actor is assigned to exactly one processor.

Page 14: University of Michigan Electrical Engineering and Computer Science Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures

University of MichiganElectrical Engineering and Computer Science

Partition Refinement [dynamic 1]

• Available resources at runtime can be more limited than resources in static target architecture.

• Partition refinement tunes actor to processor mapping for the active configuration.

• A greedy iterative algorithm is used to achieve this goal.

Page 15: University of Michigan Electrical Engineering and Computer Science Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures

University of MichiganElectrical Engineering and Computer Science

Partition Refinement Example

• Pick processors with most number of actors.

• Sort the actors

• Find processor with max work

• Assign min actors until threshold

A B

P0 : 184.5 P1 : 141.5 P2 : 171.5 P3 : 141.5

P4 : 151.5 P5 : 173 P7 : 159.5

D0D1

P6 : 140

E0

E1

E2E3C0 C1

C2

C3

S1S2

J0S0

J1 J2

BE2 C2C0 C3C1 S1S2 J0S0J1 J2

P5 : 183

S0C3

S1S2

J1J2

C2

C1

B

C0

E2

FJ0

P5 : 193P5 : 270.5P4 : 274.5

P1 : 283 P3 : 289

Page 16: University of Michigan Electrical Engineering and Computer Science Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures

University of MichiganElectrical Engineering and Computer Science

Stage Assignment [dynamic 2]

• Processor assignment only specifies how actors are overlapped across processors.

• Stage assignment finds how actors are overlapped in time.

• Relative start time of the actors is based on stage numbers.

• DMA operations will have a separate stage.

Page 17: University of Michigan Electrical Engineering and Computer Science Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures

University of MichiganElectrical Engineering and Computer Science

Stage Assignment ExampleA

F

B

C0 C2

S0

J0

C1 C3

D0

S1

J1

D1

E0 E2

S2

J2

E1 E3

A D0D1

E0

E1

E3

S0C3

S1S2

J1J2

C2

C1

B

C0

E2

FJ0

0

2

4

6

108

12

16

18

14

Page 18: University of Michigan Electrical Engineering and Computer Science Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures

University of MichiganElectrical Engineering and Computer Science

Buffer Allocation [dynamic 3]

• Slave processors have limited local store.

• Local store is faster than main memory.

• Utilize local stores first and then spill to main memory

• In case of spilling, DMAs have to be adjusted

Page 19: University of Michigan Electrical Engineering and Computer Science Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures

University of MichiganElectrical Engineering and Computer Science

Methodology• StreamIt Compiler

• Metis for graph partitioning

• 32 core heterogeneous distributed memory multi-core system

• Each slave core has a DMA engine and 128K local store

• System simulator to simulate the interconnect traffic.

Page 20: University of Michigan Electrical Engineering and Computer Science Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures

University of MichiganElectrical Engineering and Computer Science

Performance Comparison (DES)

0

5

10

15

20

25

30

35

Full Static Graph Partitioning Flextream

Number of Cores

Rel

ativ

e Sp

eedu

p

Page 21: University of Michigan Electrical Engineering and Computer Science Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures

University of MichiganElectrical Engineering and Computer Science

Performance Comparison

bitonic dct des fft filter bank

fm matrix mult.

mpeg2 ser-pent

tde av-er-age

0

5

10

15

20

25Graph Partitioning ApproachFlextream Approach

Slow

dow

n ( %

)

Page 22: University of Michigan Electrical Engineering and Computer Science Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures

University of MichiganElectrical Engineering and Computer Science

Dynamic Approach Time Comparison

bitonic dct des fft filter bank

fm matrix mult.

mpeg2 ser-pent

tde av-er-age

0

2

4

6

8

10

12Flextream Refinement ApproachGraph Partitioner Approach

Tim

e (m

s)

Page 23: University of Michigan Electrical Engineering and Computer Science Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures

University of MichiganElectrical Engineering and Computer Science

Overhead Comparison

bitonic dct des fft filter bank

fm matrix mult.

mpeg2 serpent tde average0.90

0.91

0.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1.00

3735

301

1117

125

705 887274

4588

695403

1283

2.6

4.3

2.3

2.7 4.1

1.3

4.9

2.8

3.3

5.2

11.3

6.9

5.9 8.9

5.8

7.38.4

7.68.1

8.43

7.4 5.4 3.2 4.5 3.6 5

Prepass Replication Work Refinement TimeStage Assignment Time Buffer Allocation Time

Frac

tion

of

Tim

e A

lloca

ted

Page 24: University of Michigan Electrical Engineering and Computer Science Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures

University of MichiganElectrical Engineering and Computer Science

Conclusion

• Static scheduling approaches are promising but not enough.

• Dynamic adaptation is necessary for future systems.

• Flextream provides a hybrid static/dynamic approach to improve efficiency.

Page 25: University of Michigan Electrical Engineering and Computer Science Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures

University of MichiganElectrical Engineering and Computer Science

Overhead Comparison

bitonic dct des fft filter bank

fm matrix mult.

mpeg2 serpent tde average0.90

0.91

0.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1.00

3735

301

1117

125

705 887274

4588

695403

1283

2.6

4.3

2.3

2.7 4.1

1.3

4.9

2.8

3.3

5.2

11.3

6.9

5.9 8.9

5.8

7.38.4

7.68.1

8.43

7.4 5.4 3.2 4.5 3.6 5

Prepass Replication Work Refinement TimeStage Assignment Time Buffer Allocation Time

Frac

tion

of

Tim

e A

lloca

ted

bitonic dct des fft filter bank

fm matrix mpeg2 serpent tde average0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Effect of Buffer Allocation on Performance

Min Mem Min Mem + (Max Mem - Min Mem)/5Min Mem + 2(Max Mem - Min Mem)/5 Min Mem + 3(Max Mem - Min Mem)/5Min Mem + 4(Max Mem - Min Mem)/5 Max Mem

Rel

ativ

e P

erfo

rman

ce

Page 26: University of Michigan Electrical Engineering and Computer Science Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures

University of MichiganElectrical Engineering and Computer Science

Prepass Replication

A

F

E

D

C

10

246

326

566

10

B86

C0 C2

S0

J0

61.5

6

6

C1 C3

D0

S1

J1

163

6

6

D1

E0 E2

S2

J2

6

6

E1 E3141.5

21

21

22

22

22

22

Page 27: University of Michigan Electrical Engineering and Computer Science Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures

University of MichiganElectrical Engineering and Computer Science

A B C D

E F E0E1

P0 : 10 P1 : 86 P2 : 246 P3 : 326

P4 : 566 P5 : 10 P6 : 0 P7 : 0 P4 : 283

D0

D1

P6 : 283 P7 : 163

P3 : 163P0 : 151.5 P1 : 147.5 P2 : 184.5 P3 : 163

P4 : 141.5 P5 : 151.5 P6 : 141.5 P7 : 163

E0

E1 E2 E3

C0 C1 C2C3

Page 28: University of Michigan Electrical Engineering and Computer Science Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures

University of MichiganElectrical Engineering and Computer Science

Outline

• Streaming Background

• Flextream’s Approach– Static phase– Dynamic phase

• Evaluation

• Conclusion

Page 29: University of Michigan Electrical Engineering and Computer Science Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures

University of MichiganElectrical Engineering and Computer Science

Introduction

• Single core performance stopped to scale.

• Multi-core and Many-core systems are every where.

• These systems have different configurations.

• Resource management is a challenging problem.

Cell Processor

Intel Larrabee