altera power efficient solutions · –bandwidth limitations begin to dominate compute –use...

Power Efficient Solutions w/ FPGAs

Bill Jenkins

Altera Sr. Product Specialist for Programming Language

Solutions

I/O I/O

System Challenges

Market Reaction: Growth of customized hardware and architectures…

3

Memory

Result:

Slow

Performance

(high latency)

CPUArchitecture is

inefficient for most

parallel computing

applications

(big data, search)

CPUResult:

Excessive power

consumption

CPUBottlenecks

are starving

the CPU

for data

Bottleneck

BottleneckBottleneck

Role of FPGA

Resource Sharing

Virtualization of computation,

Storage, Networking

Accelerators

Network Acceleration, Hypervisor

offload

Data Access Acceleration

Algorithm Acceleration

Cluster Computing

CPU and FPGA

Cluster Fabric

Cluster Interconnect

Host CPU

DR

AM

4

FPGA

FPGAs can greatly enhance CPU-based data center processing by

accelerating algorithms and minimizing bottlenecks

FPGAs Increase Efficiency in the Data Center

Massively parallel architecture

Has 10 to 100 times the number of

computational units

Enables pipelined designs that perform

multiple / different instructions in a single

clock cycle

Better localized memory avoids bottlenecks

Programmability enables

application-specific accelerators

5

10X+ increase in performance per watt

>5M Logic Elements

1.5TFLOPs Floating

Point DSP

Programmable I/O

3200Mbps DDR4

SDRAM/ 2.5Tbps

HMC

Mapping a simple program to an FPGA

6

R0 Load Mem[100]R1 Load Mem[101]R2 Load #42R2 Mul R1, R2R0 Add R2, R0Store R0 Mem[100]

High-level code

Mem[100] += 42 * Mem[101]

CPU instructions

B

A

AALU

First let’s take a look at execution on a simple CPU

7

Op

Val

Instruction

Fetch

Registers

Aaddr

Baddr

Caddr

PC Load StoreLdAddr StAddr

CWriteEnable

C

Op

LdData

StData

Op

CData

Fixed and general

architecture:

- General “cover-all-cases” data-paths

- Fixed data-widths

- Fixed operations

B

A

AALU

Load constant value into register

8

Very inefficient use of hardware!

Op

Val

Instruction

Fetch

Registers

Aaddr

Baddr

Caddr

PC Load StoreLdAddr StAddr

CWriteEnable

C

Op

LdData

StData

Op

CData

CPU activity, step by step

9

AR0 Load Mem[100]

AR1 Load Mem[101]

AR2 Load #42

AR2 Mul R1, R2

AR0 Add R2, R0

Store R0 Mem[100]A

Time

On the FPGA we unroll the CPU hardware…

10

A

A

A

A

A

R0 Load Mem[100]

R1 Load Mem[101]

R2 Load #42

R2 Mul R1, R2

R0 Add R2, R0

Store R0 Mem[100]A

Space

… and specialize by position

11

A

A

A

A

A

R0 Load Mem[100]

R1 Load Mem[101]

R2 Load #42

R2 Mul R1, R2

R0 Add R2, R0

Store R0 Mem[100]A

1. Instructions are fixed.

Remove “Fetch”

… and specialize

12

A

A

A

A

A

R0 Load Mem[100]

R1 Load Mem[101]

R2 Load #42

R2 Mul R1, R2

R0 Add R2, R0

Store R0 Mem[100]A


Remove “Fetch”

2. Remove unused ALU ops

… and specialize

13

A

A

A

A

A

R0 Load Mem[100]

R1 Load Mem[101]

R2 Load #42

R2 Mul R1, R2

R0 Add R2, R0

Store R0 Mem[100]A


Remove “Fetch”


3. Remove unused Load / Store

… and specialize

14

R0 Load Mem[100]

R1 Load Mem[101]

R2 Load #42

R2 Mul R1, R2

R0 Add R2, R0

Store R0 Mem[100]


Remove “Fetch”



4. Wire up registers properly!

And propagate state.

… and specialize

15

R0 Load Mem[100]

R1 Load Mem[101]

R2 Load #42

R2 Mul R1, R2

R0 Add R2, R0

Store R0 Mem[100]


Remove “Fetch”





5. Remove dead data.

… and specialize

16

R0 Load Mem[100]

R1 Load Mem[101]

R2 Load #42

R2 Mul R1, R2

R0 Add R2, R0

Store R0 Mem[100]


Remove “Fetch”





5. Remove dead data.

6. Reschedule!

Custom Data-Path on the FPGA Matches Your Algorithm!

17

Build exactly what you need:

Operations

Data widths

Memory size & configuration

Efficiency:

Throughput / Latency / Power

load load

store

42

High-level code

Mem[100] += 42 * Mem[101]

Custom data-path

Architectural Example: Image Processing

Convolutions: dataflow can proceed in pipelined fashion– No need to wait until the entire execution is complete

– Start a new set of data calculations as soon as the first stage completes its execution

Inew 𝑥 𝑦 =

𝑥′=−1

1

𝑦′=−1

1

Iold 𝑥 + 𝑥′ 𝑦 + 𝑦′ × F 𝑥′ 𝑦′

Main

Memory

Cache

Processor (CPU/GPU) Implementation

19

A cache can hide poor memory access patterns

for(int y=1; y<height-1; ++y) {for(int x=1; x<width-1; ++x) {for(int y2=-1; y2<1; ++y2) {for(int x2=-1; x2<1; ++x2) {i2[y][x] += i[y+y2][x+x2]

* filter[y2][x2];

CPU

FPGA Implementation

20

Example performance point: 1 pixel per cycle

Cache requirements: 9 reads + 1 write per cycle

Expensive hardware!– Power overhead

– Cost overhead: more built in addressing flexibility than we need

Why not customize the cache for the application?

CacheCustom

Data-path

9 read ports!Memory

Optimizing the “Cache”

21

Start out with the initial picture that is W pixels wide

w


22

ww

Let’s remove all the lines that aren’t in the neighborhood

of the window


23

Take all of the lines and arrange them as a 1D array of

pixels

ww

ww

w

ww

ww

w

ww w


24

ww

ww

w

ww

ww

w

Remove the pixels at the edges that we don’t need for the

computation


25

What happens when we move the window one pixel to the

right?

We have created a shift register implementation

w

data_out[9]

Shift Registers in Software

26

pixel_t sr[2*W+3];while(keep_going) {

// Shift data in#pragma unrollfor(int i=1; i<2*W+3; ++i)

sr[i] = sr[i-1]sr[0] = data_in;

// Tap output datadata_out = {sr[ 0], sr[ 1], sr[ 2],

sr[ w], sr[ w+1], sr[ w+2]sr[2*w], sr[2*w+1], sr[2*w+2]}

// ...}

wwdata_in

sr[0] sr[2*W+2]

Managing data

movement to match

the FPGA’s

architectural strengths

is key to obtaining

high performance

Traditional OpenCL Implementation of a Pipeline(CPU/GPU)

High-latency: requires access to global memory

High memory-bandwidth

Requires host coordination to pass buffers from one kernel

to another

With a particular design example we achieved 183 Images/s

on a Stratix V PCIe card

Kernel 1 Kernel 2 Kernel 3

Global Memory (DDR)

Buffer Buffer Buffer Buffer

Leveraging Kernel-to-Kernel Channels

Low-latency communication between kernels

Significantly less memory bandwidth requirements

Host is not involved in coordinating communication between kernels

This implementation on the same Stratix V PCIe card resulted in 400 Images/s

Global Memory (DDR)

Buffer Buffer

Channels


• Channel declaration:

• Create a queue:value_type channel();

• Channel write:

• Push data into the queue:void write_channel_altera(channel &ch, value_type data);

• Channel read:

• Pop the first element from the queuevalue_type read_channel_altera(channel &ch);

channel int my_channel;

write_channel_altera(my_channel, x);

int y = read_channel_altera(my_channel);

FPGA Code

Kernels are written as standard

building blocks that are

connected together through

channels

The concept of having multiple

concurrent kernels executing

simultaneously and

communicating directly on a

device is currently unique to

FPGAs– Offered as Vendor Extension

– Portable in OpenCL 2.0 through the concept of “OpenCL Pipes”

#pragma OPENCL_EXTENSION cl_altera_channel: enable

// Declaration of Channel API data types

channel float prod_k1_channel;channel float k1_k2_channel;channel float k2_k3_channel;channel float k3_res_channel;

__kernel void convolution_prod(int batch_id_begin,int batch_id_end,__global const volatile float * restrict

input_global){for(...) {write_channel_altera(prod_k1_channel,input_global[...]);

write_channel_altera(k1_k2_channel,input_global[...]);...}

}

Migration Between FPGAs

30

In OpenCL, a float uses soft logic in an older FPGAs

Gen10 FPGAs have hardened floating point logic built into

the DSP blocks

On Arria 10 using the same code results in processing

6800 Images/s

Stratix 10 expectations:– Large increase in floating point resources

– Higher internal frequencies achievable

– 1.6x-2x performance increase

– 12x-16x performance/watt efficiency versus Stratix V

Additional Improvements: IO Channels

31

Kernel Channels are between OpenCL kernels

IO Channels take data directly from and to IO interfaces in

the FPGA– Camera or video feed could be processed directly in the FPGA without

going through the host

– Result could be passed out to the graphics card to be displayed or back to host memory for the host to use

Private, Local and Global memory can now be used to

buffer as needed

Kernel

Channels/Pipes


IO Channels

FPGA

Lessons Learned

Exploiting pipelining on the FPGA requires some attention to coding style to overcome the inherent assumptions of writing “software”

– FPGAs do not have caches

– Need to exploit data reuse in a more explicit way

The concept of dataflow pipelining will not realize its full potential if we write intermediate results to memory

– Bandwidth limitations begin to dominate compute

– Use direct kernel to kernel communication called channels

Native support for floating point on the FPGA allows order of magnitude performance increase

Code can be ported to newer FPGAs without modification to get performance increase

IO Channels can lower latency and improve performance even more by taking the host out of the processing chain even more

altera power efficient solutions · –bandwidth limitations begin to dominate compute –use...

Documents