streamit – a programming language for the era of...

81
StreamIt – A Programming Language for the Era of Multicores Saman Amarasinghe http://cag.csail.mit.edu/streamit

Upload: others

Post on 08-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

StreamIt – A Programming Language for the Era of Multicores

Saman Amarasinghe

http://cag.csail.mit.edu/streamit

Page 2: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

2

1

10

100

1000

10000

100000

1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016

Perfo

rman

ce (v

s. V

AX-

11/7

80)

25%/year

52%/year

??%/year

8086

286

386

486

PentiumP2

P3P4

ItaniumItanium 2

Moore’s Law

From David Patterson

1,000,000,000

100,000

10,000

1,000,000

10,000,000

100,000,000

From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006

Num

ber of Transistors

Page 3: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

3

8086

286

386

486

PentiumP2

P3P4

ItaniumItanium 2

1

10

100

1000

10000

100000

1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016

Per

form

ance

(vs.

VA

X-11

/780

)

25%/year

52%/year

Uniprocessor Performance (SPECint)

From David Patterson

1,000,000,000

100,000

10,000

1,000,000

10,000,000

100,000,000

From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006

Num

ber of Transistors

Page 4: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

4

8086

286

386

486

PentiumP2

P3P4

ItaniumItanium 2

1

10

100

1000

10000

100000

1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016

Perfo

rman

ce (v

s. V

AX-

11/7

80)

25%/year

52%/year

??%/year

Uniprocessor Performance (SPECint)

• General-purpose unicores have stopped historic performance scaling– Power consumption– Wire delays– DRAM access latency– Diminishing returns of more instruction-level parallelism

From David Patterson

1,000,000,000

100,000

10,000

1,000,000

10,000,000

100,000,000

From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006

Num

ber of Transistors

All major manufacturers moving to multicore

architectures

Page 5: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

5

• Two choices:• Bend over backwards to support

old languages like C/C++• Develop high-performance

architectures that are hard to program

Programming Languages for Modern Architectures

Modernarchitecture

C von-Neumannmachine

Page 6: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

6

Parallel Programmer’s Dilemma

Rapid prototyping- MATLAB- Ptolemy

highParallel Performance

MalleabilityPortability

Productivity

high

low

Natural parallelization- StreamIt

Automatic parallelization- FORTRAN compilers- C/C++ compilers

Manual parallelization- C/C++ with MPI

Optimal parallelization- assembly code

Page 7: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

7

• Graphics• Cryptography• Databases• Object recognition• Network processing

and security • Scientific codes• …

Stream Application Domain

Page 8: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

8

StreamIt Project• Language Semantics / Programmability

– StreamIt Language (CC 02) – Programming Environment in Eclipse (P-PHEC 05)

• Optimizations / Code Generation– Phased Scheduling (LCTES 03)– Cache Aware Optimization (LCTES 05)

• Domain Specific Optimizations– Linear Analysis and Optimization (PLDI 03)– Optimizations for bit streaming (PLDI 05)– Linear State Space Analysis (CASES 05)

• Parallelism– Teleport Messaging (PPOPP 05)– Compiling for Communication-Exposed

Architectures (ASPLOS 02)– Load-Balanced Rendering

(Graphics Hardware 05)• Applications

– SAR, DSP benchmarks, JPEG, – MPEG [IPDPS 06], DES and

Serpent [PLDI 05], …

StreamIt Program

Front-end

Stream-AwareOptimizations

Uniprocessorbackend

Clusterbackend

Rawbackend

IBM X10backend

C C per tile +msg code

StreamingX10 runtime

Annotated Java

MPI-likeC

Page 9: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

9

programmability

domain specificoptimizations

enable parallelexecution

target multicores, clusters, tiled architectures, DSPs, graphics processors, …

simple and effective optimizations for domain

specific abstractions

boost productivity, enable faster development and

rapid prototyping

Compiler-Aware Language Design

Page 10: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

10

Picture Reorder

joiner

joiner

IDCT

IQuantization

splitter

splitter

VLDmacroblocks, motion vectors

frequency encodedmacroblocks differentially coded

motion vectors

motion vectorsspatially encoded macroblocks

recovered picture

ZigZag

Saturation

Channel Upsample Channel Upsample

Motion Vector Decode

Y Cb Cr

quantization coefficients

picture type

<QC>

<QC>

reference picture

Motion Compensation

<PT1> referencepicture

Motion Compensation

<PT1>reference picture

Motion Compensation

<PT1>

<PT2>

Repeat

Color Space Conversion

<PT1, PT2>

add VLD(QC, PT1, PT2);

add splitjoin {

split roundrobin(N∗B, V);

add pipeline {add ZigZag(B);add IQuantization(B) to QC;add IDCT(B);add Saturation(B);

}add pipeline {

add MotionVectorDecode();add Repeat(V, N);

}

join roundrobin(B, V);}

add splitjoin {split roundrobin(4∗(B+V), B+V, B+V);

add MotionCompensation(4∗(B+V)) to PT1;for (int i = 0; i < 2; i++) {

add pipeline {add MotionCompensation(B+V) to PT1;add ChannelUpsample(B);

}}

join roundrobin(1, 1, 1);}

add PictureReorder(3∗W∗H) to PT2;

add ColorSpaceConversion(3∗W∗H);

MPEG bit stream

Streaming Application Design

• Structured block level diagram describes computation and flow of data

• Conceptually easy to understand– Clean abstraction of

functionality

MPEG-2 Decoder

Page 11: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

11

Picture Reorder

joiner

joiner

IDCT

IQuantization

splitter

splitter

VLDmacroblocks, motion vectors

frequency encodedmacroblocks differentially coded

motion vectors

motion vectorsspatially encoded macroblocks

recovered picture

ZigZag

Saturation

Channel Upsample Channel Upsample

Motion Vector Decode

Y Cb Cr

quantization coefficients

picture type

<QC>

<QC>

reference picture

Motion Compensation

<PT1> referencepicture

Motion Compensation

<PT1>reference picture

Motion Compensation

<PT1>

<PT2>

Repeat

Color Space Conversion

<PT1, PT2>

add VLD(QC, PT1, PT2);

add splitjoin {

split roundrobin(N∗B, V);

add pipeline {add ZigZag(B);add IQuantization(B) to QC;add IDCT(B);add Saturation(B);

}add pipeline {

add MotionVectorDecode();add Repeat(V, N);

}

join roundrobin(B, V);}

add splitjoin {split roundrobin(4∗(B+V), B+V, B+V);

add MotionCompensation(4∗(B+V)) to PT1;for (int i = 0; i < 2; i++) {

add pipeline {add MotionCompensation(B+V) to PT1;add ChannelUpsample(B);

}}

join roundrobin(1, 1, 1);}

add PictureReorder(3∗W∗H) to PT2;

add ColorSpaceConversion(3∗W∗H);

MPEG bit stream

StreamIt Philosophy

• Preserve program structure– Natural for application

developers to express

• Leverage program structure to discover parallelism and deliver high performance

• Programs remain clean– Portable and malleable

Page 12: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

12

StreamIt Philosophy

output to player

Picture Reorder

joiner

joiner

IDCT

IQuantization

splitter

splitter

VLDmacroblocks, motion vectors

frequency encodedmacroblocks differentially coded

motion vectors

motion vectorsspatially encoded macroblocks

recovered picture

ZigZag

Saturation

Channel Upsample Channel Upsample

Motion Vector Decode

Y Cb Cr

quantization coefficients

picture type

<QC>

<QC>

reference picture

Motion Compensation

<PT1> referencepicture

Motion Compensation

<PT1>reference picture

Motion Compensation

<PT1>

<PT2>

Repeat

Color Space Conversion

<PT1, PT2>

add VLD(QC, PT1, PT2);

add splitjoin {split roundrobin(N∗B, V);

add pipeline {add ZigZag(B);add IQuantization(B) to QC;add IDCT(B);add Saturation(B);

}add pipeline {

add MotionVectorDecode();add Repeat(V, N);

}

join roundrobin(B, V);}

add splitjoin {split roundrobin(4∗(B+V), B+V, B+V);

add MotionCompensation(4∗(B+V)) to PT1;for (int i = 0; i < 2; i++) {

add pipeline {add MotionCompensation(B+V) to PT1;add ChannelUpsample(B);

}}

join roundrobin(1, 1, 1);}

add PictureReorder(3∗W∗H) to PT2;

add ColorSpaceConversion(3∗W∗H);

MPEG bit stream

Page 13: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

13

Picture Reorder

joiner

joiner

IDCT

IQuantization

splitter

splitter

VLD

ZigZag

Saturation

Channel Upsample Channel Upsample

Motion Vector Decode

<QC>

<QC>

reference picture

Motion Compensation

<PT1> referencepicture

Motion Compensation

<PT1>reference picture

Motion Compensation

<PT1>

<PT2>

Repeat

Color Space Conversion

<PT1, PT2>

add VLD(QC, PT1, PT2);

add splitjoin {

split roundrobin(N∗B, V);

add pipeline {add ZigZag(B);add IQuantization(B) to QC;add IDCT(B);add Saturation(B);

}add pipeline {

add MotionVectorDecode();add Repeat(V, N);

}

join roundrobin(B, V);}

add splitjoin {split roundrobin(4∗(B+V), B+V, B+V);

add MotionCompensation(4∗(B+V)) to PT1;for (int i = 0; i < 2; i++) {

add pipeline {add MotionCompensation(B+V) to PT1;add ChannelUpsample(B);

}}

join roundrobin(1, 1, 1);}

add PictureReorder(3∗W∗H) to PT2;

add ColorSpaceConversion(3∗W∗H);

MPEG bit stream

splitjoins

filters

pipelines

Stream Abstractions in StreamIt

Page 14: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

14

StreamIt Language Highlights

• Filters

• Pipelines

• Splitjoins

• Teleport messaging

Page 15: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

15

Example StreamIt Filter

float→float filter FIR (int N) {

work push 1 pop 1 peek N {float result = 0;for (int i = 0; i < N; i++) {

result += weights[i] ∗ peek(i);}push(result);pop();

}}

0 1 2 3 4 5 6 7 8 9 10 11 input

output

FIR0 1

Page 16: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

16

FIR Filter in C

void FIR(int* src, int* dest, int* srcIndex, int* destIndex, int srcBufferSize, int destBufferSize,int N) {

float result = 0.0;for (int i = 0; i < N; i++) {result += weights[i] * src[(*srcIndex + i) % srcBufferSize];

}dest[*destIndex] = result;*srcIndex = (*srcIndex + 1) % srcBufferSize;*destIndex = (*destIndex + 1) % destBufferSize;

}

• FIR functionality obscured by buffer management details

• Programmer must commit to a particular buffer implementation strategy

Page 17: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

17

StreamIt Language Highlights

• Filters

• Pipelines

• Splitjoins

• Teleport messaging

Page 18: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

18

Example StreamIt Pipeline• Pipeline

– Connect components in sequence

– Expose pipeline parallelism Column_iDCTs

Row_iDCTs

float→float pipeline 2D_iDCT (int N) {

add Column_iDCTs(N);add Row_iDCTs(N);

}

Page 19: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

19

From Figures 7-1 and 7-4 of the MPEG-2 Specification (ISO 13818-2, P. 61, 66)

Preserving Program Structureint->int pipeline BlockDecode(

portal<InverseQuantisation> quantiserData, portal<MacroblockType> macroblockType) {

add ZigZagUnordering();add InverseQuantization() to quantiserData, macroblockType;add Saturation(-2048, 2047);add MismatchControl();add 2D_iDCT(8); add Saturation(-256, 255);

}

Can be reused for JPEG decoding

quantiserData,

quantiserData,

64 64 1 1 64

Page 20: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

20

Decode_Picture {for (;;) {

parser()for (;;) {

decode_macroblock();motion_compensation();if (condition)

then break;}

}frame_reorder();

}

decode_macroblock() {parser();motion_vectors();for (comp=0;comp<block_count;comp++) {

parser();Decode_MPEG2_Block();

}}

motion_vectors() {parser();decode_motion_vectorparser();

}

Decode_MPEG2_Block() {for (int i = 0;; i++) {

parsing();ZigZagUnordering();inverseQuantization();if (condition) then

break;}

}

motion_compensation() {for (channel=0;channel<3;channel++)

form_component_prediction();for (comp=0;comp<block_count;comp++) {

Saturate();IDCT();Add_Block();

}}

In Contrast: C Code Excerpt

• Explicit for-loops iterate through picture frames• Frames passed through global arrays, handled with pointers• Mixing of parser, motion compensation, and spatial decoding

EXTERN unsigned char *backward_reference_frame[3];EXTERN unsigned char *forward_reference_frame[3];EXTERN unsigned char *current_frame[3];...etc...

Page 21: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

21

StreamIt Language Highlights

• Filters

• Pipelines

• Splitjoins

• Teleport messaging

Page 22: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

22

Example StreamIt Splitjoin• Splitjoin

– Connect components in parallel

– Expose task parallelism and data distribution

joiner

splitterfloat→float splitjoin Row_iDCT (int N){

split roundrobin(N);for (int i = 0; i < N; i++) {

add 1D_iDCT(N);}join roundrobin(N);

}

Page 23: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

23

joiner

iDCT

splitter

iDCTiDCT iDCT

joiner

iDCT

splitter

iDCTiDCT iDCT

Example StreamIt Splitjoin

joiner

splitter

joiner

splitterfloat→float splitjoin Row_iDCT (int N){

split roundrobin(N);for (int i = 0; i < N; i++) {

add 1D_iDCT(N);}join roundrobin(N);

}

float→float pipeline 2D_iDCT (int N) {

add Column_iDCTs(N);add Row_iDCTs(N);

}

float→float splitjoin Column_iDCT (int N){

split roundrobin(1);for (int i = 0; i < N; i++) {

add 1D_iDCT(N);}join roundrobin(1);

}

Page 24: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

24

add splitjoin {split roundrobin(4*(B+V), B+V, B+V);

add MotionCompensation();for (int i = 0; i < 2; i++) {

add pipeline {add MotionCompensation();add ChannelUpsample(B);

}}

join roundrobin(1, 1, 1);}

joiner 1:1:1

splitter 4:1:1

recovered picture

Channel Upsample

Channel Upsample

Cr

MotionCompensation

MotionCompensation

Motion Compensation

Y Cb

Naturally Expose Data Distribution

Y Cb Cr

1 2

3 4 5 6

scatter macroblocks according to chroma format

gather one pixel at a time

Page 25: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

25

Stream Graph Malleability

joiner

splitter 4,1,1

Upsample Upsample

MCMCMC

Y Cb Cr

1 2

3 4 5 6

Y Cb Cr

1 2

3 4

5

87

6

4:2:2 chroma format

4:2:0 chroma format

joiner

splitter 1,1

Upsample Upsample

MCMCMC

splitter 4,(2+2)

joiner

Page 26: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

26

StreamIt Code Sample

// C = blocks per chroma channel per macroblock// C = 1 for 4:2:0, C = 2 for 4:2:2add splitjoin {

split roundrobin(4*(B+V), 2*C*(B+V));

add MotionCompensation();add splitjoin {

split roundrobin(B+V, B+V);

for (int i = 0; i < 2; i++) {add pipeline {

add MotionCompensation()add ChannelUpsample(C,B);

}}

join roundrobin(1, 1);}

join roundrobin(1, 1, 1);}

red = code added or modified to support 4:2:2 format

joiner

splitter 4,1,1

Upsample Upsample

MCMCMC

joiner

splitter 1,1

Upsample Upsample

MCMCMC

splitter 4,(2+2)

joiner

Page 27: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

27

In Contrast: C Code Excerpt

/* Y */form_component_prediction(src[0]+(sfield?lx2>>1:0),dst[0]+(dfield?lx2>>1:0),

lx,lx2,w,h,x,y,dx,dy,average_flag);

if (chroma_format!=CHROMA444) {lx>>=1; lx2>>=1; w>>=1; x>>=1; dx/=2;

}if (chroma_format==CHROMA420) {

h>>=1; y>>=1; dy/=2;}

/* Cb */form_component_prediction(src[1]+(sfield?lx2>>1:0),dst[1]+(dfield?lx2>>1:0),

lx,lx2,w,h,x,y,dx,dy,average_flag);

/* Cr */form_component_prediction(src[2]+(sfield?lx2>>1:0),dst[2]+(dfield?lx2>>1:0),

lx,lx2,w,h,x,y,dx,dy,average_flag);

Adjust values used for address calculations depending on the chroma format used.

red = pointers used for address calculations

Page 28: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

28

StreamIt Language Highlights

• Filters

• Pipelines

• Splitjoins

• Teleport messaging

Page 29: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

29

Teleport Messaging

• Avoids muddling data streams with control relevant information

• Localized interactions in large applications– A scalable alternative to

global variables or excessive parameter passing

Order

IQ

VLD

MCMCMC

Page 30: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

30

Motion Prediction and Messaging

portal<MotionCompensation> PT;

add splitjoin {split roundrobin(4*(B+V), B+V, B+V);

add MotionCompensation() to PT;for (int i = 0; i < 2; i++) {

add pipeline {add MotionCompensation() to PT;add ChannelUpsample(B);

}}

join roundrobin(1, 1, 1);}

joiner

splitter

recovered picture

Channel Upsample

Channel Upsample

Cr

MotionCompensation

MotionCompensation

Motion Compensation

Y Cb

picture type

Page 31: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

31

• Looks like method call, but timed relative to data in the stream

• Simple and precise for user– Exposes dependences to compiler– Adjustable latency– Can send upstream or downstream

void setPicturetype(int p) {reconfigure(p);

}

TargetFilter x;if newPictureType(p) {

x.setPictureType(p) @ 0;}

Teleport Messaging Overview

Page 32: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

32

Frame Reordering

DecodePicture

Decode Macroblock

Decode MotionVectors

Motion Compensation

Decode Block

Motion Compensation For Single Channel

Saturate

IDCT

ZigZagUnordering InverseQuantization

File Parsing

Global Variable

Space

The MPEG Bitstream

Output Video

Messaging Equivalent in C

Page 33: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

33

programmability

domain specificoptimizations

enable parallelexecution

target multicores, clusters, tiled architectures, DSPs, graphics processors, …

simple and effective optimizations for domain

specific abstractions

boost productivity, enable faster development and

rapid prototyping

Compiler-Aware Language Design

Page 34: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

34

Multicores Are Here!

1985 199019801970 1975 1995 2000

4004

8008

80868080 286 386 486 Pentium P2 P3P4Itanium

Itanium 2

2005

Raw

Power4 Opteron

Power6

Niagara

YonahPExtreme

Tanglewood

Cell

IntelTflops

Xbox360

CaviumOcteon

RazaXLR

PA-8800

CiscoCSR-1

PicochipPC102

Broadcom 1480

20??

# ofcores

1

2

4

8

16

32

64

128256

512

Opteron 4PXeon MP

Athlon

AmbricAM2045

Tilera

Page 35: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

35

Von Neumann Languages• Why C (FORTRAN, C++ etc.) became very successful?

– Abstracted out the differences of von Neumann machines– Directly expose the common properties– Can have a very efficient mapping to a von Neumann machine– “C is the portable machine language for von Numann machines”

• von Neumann languages are a curse for Multicores– We have squeezed out all the performance out of C– But, cannot easily map C into multicores

Page 36: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

36

Common Machine Languages

Single memory imageSingle flow of controlCommon Properties

Unicores:

ISA

Functional Units

Register FileDifferences:

Register AllocationInstruction Selection

Instruction Scheduling

Multiple local memoriesMultiple flows of controlCommon PropertiesMulticores:

Communication Model

Synchronization Model

Number and capabilities of coresDifferences:

von-Neumann languages represent the common properties and abstract away the differences

Page 37: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

37

Bridging the Abstraction layers

• StreamIt exposes the data movement– Graph structure is architecture independent

• StreamIt exposes the parallelism– Explicit task parallelism– Implicit but inherent data and pipeline parallelism

• Each multicore is different in granularity and topology– Communication is exposed to the compiler

• The compiler needs to efficiently bridge the abstraction– Map the computation and communication pattern of the program

to the cores, memory and the communication substrate

Page 38: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

38

Types of Parallelism

Task Parallelism– Parallelism explicit in algorithm– Between filters without

producer/consumer relationship

Data Parallelism– Peel iterations of filter, place within

scatter/gather pair (fission)– parallelize filters with state

Pipeline Parallelism– Between producers and consumers– Stateful filters can be parallelized

Scatter

Gather

Task

Page 39: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

39

Types of Parallelism

Task Parallelism– Parallelism explicit in algorithm– Between filters without

producer/consumer relationship

Data Parallelism– Between iterations of a stateless filter – Place within scatter/gather pair (fission)– Can’t parallelize filters with state

Pipeline Parallelism– Between producers and consumers– Stateful filters can be parallelized

Scatter

Gather

Scatter

Gather

Task

Pip

elin

e

Data

Data Parallel

Page 40: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

40

Types of Parallelism

Traditionally:

Task Parallelism– Thread (fork/join) parallelism

Data Parallelism– Data parallel loop (forall)

Pipeline Parallelism– Usually exploited in hardware

Scatter

Gather

Scatter

Gather

Task

Pip

elin

e

Data

Page 41: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

41

Problem Statement

Given: – Stream graph with compute and communication

estimate for each filter– Computation and communication resources of

the target machine

Find:– Schedule of execution for the filters that best

utilizes the available parallelism to fit the machine resources

Page 42: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

42

Our 3-Phase Solution

1. Coarsen: Fuse stateless sections of the graph2. Data Parallelize: parallelize stateless filters3. Software Pipeline: parallelize stateful filters

Compile to a 16 core architecture– 11.2x mean throughput speedup over single core

Coarsen Granularity

Data Parallelize

Software Pipeline

Page 43: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

43

Baseline 1: Task Parallelism

Adder

Splitter

Joiner

Compress

BandPass

Expand

Process

BandStop

Compress

BandPass

Expand

Process

BandStop

• Inherent task parallelism between two processing pipelines

• Task Parallel Model:– Only parallelize explicit task

parallelism – Fork/join parallelism

• Execute this on a 2 core machine ~2x speedup over single core

• What about 4, 16, 1024, … cores?

Page 44: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

44

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

Bitonic

Sort

Chann

elVoc

oder

DCT

DES

FFT

Filterba

nk

FMRadio

Serpe

nt

TDEMPEG2D

ecod

er

Vocod

er

Radar

Geometr

ic Mea

nTh

roug

hput

Nor

mal

ized

to S

ingl

e C

ore

Stre

amIt

Evaluation: Task ParallelismRaw Microprocessor

16 inorder, single-issue cores with D$ and I$16 memory banks, each bank with DMA

Cycle accurate simulator

Parallelism: Not matched to target!Synchronization: Not matched to target!

Page 45: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

45

Baseline 2: Fine-Grained Data Parallelism

Adder

Splitter

Joiner

• Each of the filters in the example are stateless

• Fine-grained Data Parallel Model:– Fiss each stateless filter N

ways (N is number of cores)– Remove scatter/gather if

possible

• We can introduce data parallelism– Example: 4 cores

• Each fission group occupies entire machineBandStopBandStopBandStopAdder

Splitter

Joiner

ExpandExpandExpand

ProcessProcessProcess

Joiner

BandPassBandPassBandPass

CompressCompressCompress

BandStopBandStopBandStop

Expand

BandStop

Splitter

Joiner

Splitter

Process

BandPass

Compress

Splitter

Joiner

Splitter

Joiner

Splitter

Joiner

ExpandExpandExpand

ProcessProcessProcess

Joiner

BandPassBandPassBandPass

CompressCompressCompress

BandStopBandStopBandStop

Expand

BandStop

Splitter

Joiner

Splitter

Process

BandPass

Compress

Splitter

Joiner

Splitter

Joiner

Splitter

Joiner

Page 46: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

46

Evaluation:Fine-Grained Data Parallelism

0

1

2

34

5

6

7

8

910

11

12

13

14

1516

17

18

19

Bitonic

SortCha

nnelVoc

oder

DCT

DES

FFT

Filterban

k

FMRadio

Serpen

t

TDEMPEG2Deco

der

Vocod

er

Radar

Geometr

ic Mea

nTh

roug

hput

Nor

mal

ized

to S

ingl

e C

ore

Stre

amIt

TaskFine-Grained Data

Good Parallelism!Too Much Synchronization!

Page 47: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

47

Phase 1: Coarsen the Stream GraphSplitter

Joiner

Expand

BandStop

Process

BandPass

Compress

Expand

BandStop

Process

BandPass

Compress

• Before data-parallelism is exploited

• Fuse stateless pipelines as much as possible without introducing state– Don’t fuse stateless with

stateful– Don’t fuse a peeking filter with

anything upstreamPeek Peek

PeekPeek

Adder

Page 48: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

48

Splitter

Joiner

BandPassCompressProcessExpand

BandPassCompressProcessExpand

BandStop BandStop

Adder

• Before data-parallelism is exploited

• Fuse stateless pipelines as much as possible without introducing state– Don’t fuse stateless with

stateful– Don’t fuse a peeking filter with

anything upstream

• Benefits:– Reduces global communication

and synchronization– Exposes inter-node

optimization opportunities

Phase 1: Coarsen the Stream Graph

Page 49: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

49

Phase 2: Data Parallelize

AdderAdderAdder

Splitter

Joiner

Adder

BandPassCompressProcessExpand

BandPassCompressProcessExpand

BandStop BandStop

Splitter

Joiner

Fiss 4 ways, to occupy entire chip

Data Parallelize for 4 cores

Page 50: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

50

Phase 2: Data Parallelize

AdderAdderAdder

Splitter

Joiner

Adder

BandPassCompressProcessExpand

Splitter

Joiner

BandPassCompressProcessExpand

BandPassCompressProcessExpand

Splitter

Joiner

BandPassCompressProcessExpand

BandStop BandStop

Splitter

Joiner

Task parallelism!Each fused filter does equal workFiss each filter 2 times to occupy entire chip

Data Parallelize for 4 cores

Page 51: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

51

BandStop BandStop

Phase 2: Data Parallelize

AdderAdderAdder

Splitter

Joiner

Adder

BandPassCompressProcessExpand

Splitter

Joiner

BandPassCompressProcessExpand

BandPassCompressProcessExpand

Splitter

Joiner

BandPassCompressProcessExpand

Splitter

Joiner

BandStop

Splitter

Joiner

BandStop

Splitter

Joiner

Task parallelism, each filter does equal workFiss each filter 2 times to occupy entire chip

• Task-conscious data parallelization– Preserve task parallelism

• Benefits:– Reduces global communication

and synchronization

Data Parallelize for 4 cores

Page 52: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

52

Evaluation: Coarse-Grained Data Parallelism

012

34567

89

1011

1213141516

171819

Bitonic

SortCha

nnelVocoder

DCT

DES

FFT

Filterba

nk

FMRadio

Serpent

TDEMPEG2Dec

oder

Vocoder

Radar

Geometric

Mean

Thro

ughp

ut N

orm

aliz

ed to

Sin

gle

Cor

e St

ream

It

TaskFine-Grained DataCoarse-Grained Task + Data

Good Parallelism!Low Synchronization!

Page 53: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

53

Simplified Vocoder

RectPolar

Splitter

Joiner

AdaptDFT AdaptDFT

Splitter

Splitter

Amplify

Diff

UnWrap

Accum

Amplify

Diff

Unwrap

Accum

Joiner

Joiner

PolarRect

66

20

2

1

1

1

2

1

1

1

20 Data Parallel

Data Parallel

Target a 4 core machine

Data Parallel, but too little work!

Page 54: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

54

Data Parallelize

RectPolarRectPolarRectPolar

Splitter

Joiner

AdaptDFT AdaptDFT

Splitter

Splitter

Amplify

Diff

UnWrap

Accum

Amplify

Diff

Unwrap

Accum

Joiner

RectPolar

Splitter

Joiner

RectPolarRectPolarRectPolarPolarRect

Splitter

Joiner

Joiner

66

20

2

1

1

1

2

1

1

1

20

5

5

Target a 4 core machine

Page 55: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

55

Data + Task Parallel Execution

Time

Cores

21

Target 4 core machine

Splitter

Joiner

Splitter

Splitter

Joiner

Splitter

Joiner

RectPolarSplitter

Joiner

Joiner

66

2

1

1

1

2

1

1

1

5

5

Page 56: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

56

We Can Do Better!

Time

Cores

Target 4 core machine

Splitter

Joiner

Splitter

Splitter

Joiner

Splitter

Joiner

RectPolarSplitter

Joiner

Joiner

66

2

1

1

1

2

1

1

1

5

5

16

Page 57: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

57

Phase 3: Coarse-Grained Software Pipelining

RectPolar

RectPolar

RectPolar

RectPolar

Prologue

New Steady

State

• New steady-state is free of dependencies

• Schedule new steady-state using a greedy partitioning

Page 58: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

58

Greedy Partitioning

Target 4 core machine

Time 16

CoresTo Schedule:

Page 59: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

59

0123456789

10111213141516171819

Bitonic

SortCha

nnelVocoder

DCT

DES

FFT

Filterba

nk

FMRadio

Serpent

TDEMPEG2Dec

oder

Vocoder

Radar

Geometric

MeanTh

roug

hput

Nor

mal

ized

to S

ingl

e C

ore

Stre

amIt

Task Fine-Grained DataCoarse-Grained Task + Data Coarse-Grained Task + Data + Software Pipeline

Evaluation: Coarse-Grained Task + Data + Software Pipelining

Best Parallelism!Lowest Synchronization!

Page 60: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

60

programmability

domain specificoptimizations

enable parallelexecution

target multicores, clusters, tiled architectures, DSPs, graphics processors, …

simple and effective optimizations for domain

specific abstractions

boost productivity, enable faster development and

rapid prototyping

Compiler-Aware Language Design

Page 61: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

61

Conventional DSP Design Flow

DSP Optimizations

Rewrite the program

Design the Datapaths(no control flow)

Architecture-specificOptimizations(performance,

power, code size)

Spec. (data-flow diagram)

Coefficient Tables

C/Assembly Code

Signal Processing Expert in Matlab

Software Engineerin C and Assembly

Page 62: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

62

Design Flow with StreamIt

DSP Optimizations

StreamIt Program(dataflow + control)

Architecture-SpecificOptimizations

C/Assembly Code

Application-Level Design

StreamIt compiler

Application Programmer

Page 63: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

63

Design Flow with StreamIt

DSP Optimizations

StreamIt Program(dataflow + control)

Architecture-SpecificOptimizations

C/Assembly Code

Application-Level Design • Benefits of programming in a single, high-level abstraction– Modular – Composable– Portable– Malleable

• The Challenge: Maintaining Performance

Page 64: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

64

Focus: Linear State Space Filters

• Properties:1. Outputs are linear function of inputs and states 2. New states are linear function of inputs and states

• Most common target of DSP optimizations– FIR / IIR filters– Linear difference equations– Upsamplers / downsamplers– DCTs

Page 65: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

65

Representing State Space Filters

u

• A state space filter is a tuple ⟨A, B, C, D⟩

x’ = Ax + Bu

y = Cx + Du

⟨A, B, C, D⟩

inputs

states

outputs

Page 66: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

66

Representing State Space Filters

u

• A state space filter is a tuple ⟨A, B, C, D⟩

x’ = Ax + Bu

y = Cx + Du

⟨A, B, C, D⟩

float->float filter IIR {float x1, x2; work push 1 pop 1 { float u = pop();push(2*(x1+x2+u));x1 = 0.9*x1 + 0.3*u;x2 = 0.9*x2 + 0.2*u;

} }

inputs

states

outputs

Page 67: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

67

Representing State Space Filters

float->float filter IIR {float x1, x2; work push 1 pop 1 { float u = pop();push(2*(x1+x2+u));x1 = 0.9*x1 + 0.3*u;x2 = 0.9*x2 + 0.2*u;

} }

• A state space filter is a tuple A, B, C, D⟩

0.9 00 0.9 B = A =

0.30.2

u

2 2C = 2

x’ = Ax + Bu

D =

y = Cx + Du

Linear dataflow analysis

inputs

states

outputs

Page 68: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

68

Linear Optimizations

1. Combining adjacent filters2. Transformation to frequency domain3. Change of basis transformations4. Transformation Selection

Page 69: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

69

1) Combining Adjacent Filters

y = D1u

z = D2y

Filter 1

Filter 2

y

u

z

E

z = D2D1u CombinedFilter

u

z

z = Eu

Page 70: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

70

IIR Filter

Combination Example

x’ = 0.9x + uy = x + 2u

Decimator

y = [1 0] u1u2

IIR / Decimatorx’ = 0.81x + [0.9 1]

y = x + [2 0]

u1u2

8 FLOPsoutput

6 FLOPsoutput

u1u2

Page 71: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

71

IIR Filter

Combination Example

x’ = 0.9x + uy = x + 2u

Decimator

y = [1 0] u1u2

8 FLOPsoutput

6 FLOPsoutput

IIR / Decimatorx’ = 0.81x + [0.9 1]

y = x + [2 0]

u1u2

u1u2

As decimation factor goes to ∞,eliminate up to 75% of FLOPs.

Page 72: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

72

Floating-Point Operations Reduction

-40%

-20%

0%

20%

40%

60%

80%

100%

FIRRate

Conve

rtTarg

etDete

ctFMRad

io

Radar

FilterB

ank

Vocod

erOve

rsample

DToA

Benchmark

Flop

s R

emov

ed (%

)

linear

0.3%

Page 73: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

73

• Painful to do by hand– Blocking

– Coefficient calculations

– Startup etc.

Page 74: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

74

FLOPs Reduction

-40%

-20%

0%

20%

40%

60%

80%

100%

FIRRate

Conve

rtTarg

etDete

ctFMRad

io

Radar

FilterB

ank

Vocod

erOve

rsample

DToA

Benchmark

Flop

s R

emov

ed (%

)

linear

freq

-140%

0.3%

Page 75: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

75

x’ = Ax + Buy = Cx + Du

3) Change-of-Basis Transformation

T = invertible matrix, z = Tx

z’ = A’z + B’uy = C’z + D’u

A’ = TAT-1 B’ =TBC’ = CT-1 D’ = D

Can map original states x to transformed states z = Tx without changing I/O behavior

Page 76: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

76

Change-of-Basis Optimizations

1. State removal- Minimize number of states in system

2. Parameter reduction- Increase number of 0’s and 1’s in

multiplication

Formulated as general matrix operations

Page 77: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

77

4) Transformation Selection

• When to apply what transformations?– Linear filter combination can increase the

computation cost

– Shifting to the Frequency domain is expensive for filters with pop > 1

– Compute all outputs, then decimate by pop rate

– Some expensive transformations may later enable other transformations, reducing the overall cost

Page 78: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

78

FLOPs Reduction with Optimization Selection

-40%

-20%

0%

20%

40%

60%

80%

100%

FIRRateCon

vert

TargetD

etect

FMRadio

Radar

FilterB

ank

Vocode

rOve

rsample

DToA

Benchmark

Flop

s Re

mov

ed (%

)

linearfreqautosel

-140%

0.3%

Page 79: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

79

-200%-100%

0%100%200%300%400%500%600%700%800%900%

FIRRateCon

vert

TargetD

etect

FMRadio

Radar

FilterB

ank

Vocode

rOve

rsample

DToA

Benchmark

Spe

edup

(%)

linearfreqautosel

Execution Speedup

5%

On a Pentium IV

Page 80: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

80

Conclusion• Streaming programming model

– Can break the von Neumann bottleneck – A natural fit for a large class of applications– An ideal machine language for multicores.

• Natural programming language for many streaming applications– Better modularity, composability, malleability and portability than C

• Compiler can easily extract explicit and inherent parallelism– Parallelism is abstracted away from architectural details of multicores– Sustainable Speedups (5x to 19x on the 16 core Raw)

• Can we replace the DSP engineer from the design flow?– On the average 90% of the FLOPs eliminated, average speedup of 450% attained

• Increased abstraction does not have to sacrifice performance

• The compiler and benchmarks are available on the webhttp://cag.csail.mit.edu/commit/

Page 81: StreamIt – A Programming Language for the Era of Multicoresgroups.csail.mit.edu/commit/papers/06/Stramit-Alberta.pdf · 2 1 10 100 1000 10000 100000 1978 1980 1982 1984 1986 1988

Thanks for Listening!

Any questions?

http://cag.csail.mit.edu/streamit