a parallel 'for' loop memory template for a high level synthesis compiler

Post on 04-Jul-2015

1.376 Views

Category:

Education

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

We propose a parametrized memory template for applications with parallel 'for' loops. The template's parameters reflect important trade-offs made during system design. The template is incorporated in our high level synthesis (HLS) compiler, where the template's parameters are adjusted to the application. The template fits parallel 'for' loops with no loop dependencies and sequential bodies. We found two alternative template implementations using our compiler. In the future, we will develop templates for other types of 'for' loops. These will be added to the compiler and it will identify the template that works best for the application it is compiling. Once a template is selected, the compiler will use design space exploration to select the best combination of template parameters for the targeted hardware and application.

TRANSCRIPT

A parallel for loop memory templatefor a high level synthesis compiler

Euromicro Conference on Digital System Design

Lille, France02/09/2010

Craig MooreWim Meeus, Harald Devos, and Dirk Stroobandt

30/06/2010 Craig Moore, DSD 02/09/2010 2

Outline

● High Level Synthesis● Hardware Development● External Memory● Burst memory transfers● Parallel For Loops● Memory Template Overview● Small Example● Future Work● Conclusions

30/06/2010 Craig Moore, DSD 02/09/2010 3

High Level Synthesis (HLS)Missing Pieces

30/06/2010 Craig Moore, DSD 02/09/2010 4

HLS Missing Pieces

30/06/2010 Craig Moore, DSD 02/09/2010 5

HLS Missing Pieces

30/06/2010 Craig Moore, DSD 02/09/2010 6

Memory Templatesas Tools

● HDL Programmers have:● Toolkit of memory designs● Use the right tool for the job● Manually adapt their designs

● HLS Compilers should:● Have a toolkit of templates● Adapt the template to the app● Evaluate each template● Suggest the best template

30/06/2010 Craig Moore, DSD 02/09/2010 7

1) Read values from memory2) Process each value3) Store output in memory

Basic Steps for any Algorithm

for (int i = start; i < end; i++){ b[i] = func(a[i]);}

30/06/2010 Craig Moore, DSD 02/09/2010 8

Implement on Hardware

30/06/2010 Craig Moore, DSD 02/09/2010 9

External Memoryfor FPGAs

● A bottle neck● Sequential in nature● Number of values

returned each cycle depends on bus width.

● Each memory request requires a handshake

30/06/2010 Craig Moore, DSD 02/09/2010 10

Adapting to the Bottleneck

● Stream values from memory

● Pre-fetch values● Read/Write more than

one value each clock cycle

● Store values locally to mask latency

● Reduce number of requests

30/06/2010 Craig Moore, DSD 02/09/2010 11

Burst Transfers

● Burst of consecutive memory operations

30/06/2010 Craig Moore, DSD 02/09/2010 12

Read Transfer Start Address: 3

Transfer: 4

Burst Transfers

● Burst of consecutive memory operations

0

1

4

2

5

3

6

30/06/2010 Craig Moore, DSD 02/09/2010 13

Read Transfer Start Address: 3

Transfer: 4

Burst Transfers

● Burst of consecutive memory operations

0

1

4

2

5

3

6

30/06/2010 Craig Moore, DSD 02/09/2010 14

Read Transfer Start Address: 3

Transfer: 4

Burst Transfers

● Burst of consecutive memory operations

0

1

4

2

5

3

6

30/06/2010 Craig Moore, DSD 02/09/2010 15

Read Transfer Start Address: 3

Transfer: 4

Burst Transfers

● Burst of consecutive memory operations

0

1

4

2

5

3

6

30/06/2010 Craig Moore, DSD 02/09/2010 16

Read Transfer Start Address: 3

Transfer: 4

Burst Transfers

● Burst of consecutive memory operations

0

1

4

2

5

3

6

30/06/2010 Craig Moore, DSD 02/09/2010 17

Write Transfer Start Address: 2

Transfer: 5

Burst Transfers

● Burst of consecutive memory operations

0

1

4

2

5

3

6

30/06/2010 Craig Moore, DSD 02/09/2010 18

Write Transfer Start Address: 2

Transfer: 5

Burst Transfers

● Burst of consecutive memory operations

0

1

4

2

5

3

6

30/06/2010 Craig Moore, DSD 02/09/2010 19

Write Transfer Start Address: 2

Transfer: 5

Burst Transfers

● Burst of consecutive memory operations

0

1

4

2

5

3

6

30/06/2010 Craig Moore, DSD 02/09/2010 20

Write Transfer Start Address: 2

Transfer: 5

Burst Transfers

● Burst of consecutive memory operations

0

1

4

2

5

3

6

30/06/2010 Craig Moore, DSD 02/09/2010 21

Write Transfer Start Address: 2

Transfer: 5

Burst Transfers

● Burst of consecutive memory operations

0

1

4

2

5

3

6

30/06/2010 Craig Moore, DSD 02/09/2010 22

Write Transfer Start Address: 2

Transfer: 5

Burst Transfers

● Burst of consecutive memory operations

0

1

4

2

5

3

6

30/06/2010 Craig Moore, DSD 02/09/2010 23

Parallel for Loop

● Each iteration is run in parallel● No loop dependencies

● Loop Transformations to remove them

for i = 1 to 4{ a(i) = a(i) + 1 b(i) = a(i – 1) + a(i + 1)}

Example with Dependencies

30/06/2010 Craig Moore, DSD 02/09/2010 24

Template Overview

30/06/2010 Craig Moore, DSD 02/09/2010 25

Template Overview

Requests read bursts and controls execution of data paths, waits foroutput buffer if it is full

30/06/2010 Craig Moore, DSD 02/09/2010 26

Template Overview

Non-pipelined loop bodies executing in parallel.

30/06/2010 Craig Moore, DSD 02/09/2010 27

Manual Design

With enough values, performs write bursts.

30/06/2010 Craig Moore, DSD 02/09/2010 28

Manual Design

Starts and stops execution

30/06/2010 Craig Moore, DSD 02/09/2010 29

Manual Design

Controls access to memory, grants permission based on request (output buffer priority)

30/06/2010 Craig Moore, DSD 02/09/2010 30

Manual Design

Controls access to memory, grants permission based on request (output buffer priority)

Starts and stops execution With enough values, performs write bursts.

Non-pipelined loop bodies executing in parallel.

Requests read bursts and controls execution of data paths, waits foroutput buffer if it is full

30/06/2010 Craig Moore, DSD 02/09/2010 31

Byte-Enable Signal

● Multiple values for each memory transaction● Tells which bytes to replace and preserve

30/06/2010 Craig Moore, DSD 02/09/2010 32

Byte-Enable Signal

● Multiple values for each memory transaction● Tells which bytes to replace and preserve

Ignore

Enable

30/06/2010 Craig Moore, DSD 02/09/2010 33

Byte-Enable Signal

● Multiple values for each memory transaction● Tells which bytes to replace and preserve

Ignore

Enable

30/06/2010 Craig Moore, DSD 02/09/2010 34

Byte-Enable Signal

● Multiple values for each memory transaction● Tells which bytes to replace and preserve

Ignore

Enable

30/06/2010 Craig Moore, DSD 02/09/2010 35

Byte-Enable Signal

● Multiple values for each memory transaction● Tells which bytes to replace and preserve

Ignore

Enable

30/06/2010 Craig Moore, DSD 02/09/2010 36

Parametrized Template

30/06/2010 Craig Moore, DSD 02/09/2010 37

Parametrized Template

● Memory Bus Width = MParameters

30/06/2010 Craig Moore, DSD 02/09/2010 38

● Word Width = W

Parametrized Template

● Memory Bus Width = MParameters

30/06/2010 Craig Moore, DSD 02/09/2010 39

● Word Width = W

Parametrized Template

● Memory Bus Width = MParameters

● Max Words = A = M / W

30/06/2010 Craig Moore, DSD 02/09/2010 40

● Word Width = W

Parametrized Template

● Memory Bus Width = MParameters

● Max Words = A = M / W

● Input FIFOs = X = Cx * A

30/06/2010 Craig Moore, DSD 02/09/2010 41

● Word Width = W

Parametrized Template

● Memory Bus Width = MParameters

● Max Words = A = M / W

● Input FIFOs = X = Cx * A

● Iterations = Output FIFOs = N = C

N * X

30/06/2010 Craig Moore, DSD 02/09/2010 42

● Word Width = W

Parametrized Template

● Memory Bus Width = MParameters

● Max Words = A = M / W

● Input FIFOs = X = Cx * A

● Iterations = Output FIFOs = N = C

N * X

● Burst Length

● Input FIFO Length

● Iteration Length

● Output FIFO Length

30/06/2010 Craig Moore, DSD 02/09/2010 43

● Word Width = W

Parametrized Template

● Memory Bus Width = MParameters

● Max Words = A = M / W

● Input FIFOs = X = Cx * A

● Iterations = Output FIFOs = N = C

N * X

● Burst Length

● Input FIFO Length

● Iteration Length

● Output FIFO Length

30/06/2010 Craig Moore, DSD 02/09/2010 44

Example – Reading Values

Values in Memory

Values to be read

Byte enabled

Byte disabled

Values processed

30/06/2010 Craig Moore, DSD 02/09/2010 45

Example – Processing Values

Values in Memory

Values to be read

Byte enabled

Byte disabled

Values processed

30/06/2010 Craig Moore, DSD 02/09/2010 46

Example – Writing Values

Values in Memory

Values to be read

Byte enabled

Byte disabled

Values processed

30/06/2010 Craig Moore, DSD 02/09/2010 47

Future Work

● More templates for other parallel for loops● Pipelined loop body● Data reuse

● Compiler identifies parallel for loop● No keywords● Check for loop dependencies, and do loop

transformations if required● Compiler suggests best memory template

● Chosen based on performance estimate● Design space exploration using templates

30/06/2010 Craig Moore, DSD 02/09/2010 48

Conclusions

● HLS Tools don't create memory designs● Manual memory designs can take

days/weeks/months to complete● Parametrized memory template designs are

generated in seconds● Easy to perform design space exploration using

different parameter values and/or templates

30/06/2010 Craig Moore, DSD 02/09/2010 49

Thank You!

Questions?

craig.moore@elis.ugent.behttp://www.elis.ugent.be/~cmoore

Wim Meeus*, Harald Devos‡, and Dirk Stroobandt**{wim.meeus, dirk.stroobandt}@elis.ugent.be, ‡devos.harald@gmail.com

top related