altera power efficient solutions · –bandwidth limitations begin to dominate compute –use...
TRANSCRIPT
![Page 1: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point](https://reader035.vdocuments.us/reader035/viewer/2022062317/5edf4da7ad6a402d666aa745/html5/thumbnails/1.jpg)
1
![Page 2: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point](https://reader035.vdocuments.us/reader035/viewer/2022062317/5edf4da7ad6a402d666aa745/html5/thumbnails/2.jpg)
Power Efficient Solutions w/ FPGAs
Bill Jenkins
Altera Sr. Product Specialist for Programming Language
Solutions
![Page 3: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point](https://reader035.vdocuments.us/reader035/viewer/2022062317/5edf4da7ad6a402d666aa745/html5/thumbnails/3.jpg)
I/O I/O
System Challenges
Market Reaction: Growth of customized hardware and architectures…
3
Memory
Result:
Slow
Performance
(high latency)
CPUArchitecture is
inefficient for most
parallel computing
applications
(big data, search)
CPUResult:
Excessive power
consumption
CPUBottlenecks
are starving
the CPU
for data
Bottleneck
BottleneckBottleneck
![Page 4: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point](https://reader035.vdocuments.us/reader035/viewer/2022062317/5edf4da7ad6a402d666aa745/html5/thumbnails/4.jpg)
Role of FPGA
Resource Sharing
Virtualization of computation,
Storage, Networking
Accelerators
Network Acceleration, Hypervisor
offload
Data Access Acceleration
Algorithm Acceleration
Cluster Computing
CPU and FPGA
Cluster Fabric
Cluster Interconnect
Host CPU
DR
AM
4
FPGA
![Page 5: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point](https://reader035.vdocuments.us/reader035/viewer/2022062317/5edf4da7ad6a402d666aa745/html5/thumbnails/5.jpg)
FPGAs can greatly enhance CPU-based data center processing by
accelerating algorithms and minimizing bottlenecks
FPGAs Increase Efficiency in the Data Center
Massively parallel architecture
Has 10 to 100 times the number of
computational units
Enables pipelined designs that perform
multiple / different instructions in a single
clock cycle
Better localized memory avoids bottlenecks
Programmability enables
application-specific accelerators
5
10X+ increase in performance per watt
>5M Logic Elements
1.5TFLOPs Floating
Point DSP
Programmable I/O
3200Mbps DDR4
SDRAM/ 2.5Tbps
HMC
![Page 6: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point](https://reader035.vdocuments.us/reader035/viewer/2022062317/5edf4da7ad6a402d666aa745/html5/thumbnails/6.jpg)
Mapping a simple program to an FPGA
6
R0 Load Mem[100]R1 Load Mem[101]R2 Load #42R2 Mul R1, R2R0 Add R2, R0Store R0 Mem[100]
High-level code
Mem[100] += 42 * Mem[101]
CPU instructions
![Page 7: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point](https://reader035.vdocuments.us/reader035/viewer/2022062317/5edf4da7ad6a402d666aa745/html5/thumbnails/7.jpg)
B
A
AALU
First let’s take a look at execution on a simple CPU
7
Op
Val
Instruction
Fetch
Registers
Aaddr
Baddr
Caddr
PC Load StoreLdAddr StAddr
CWriteEnable
C
Op
LdData
StData
Op
CData
Fixed and general
architecture:
- General “cover-all-cases” data-paths
- Fixed data-widths
- Fixed operations
![Page 8: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point](https://reader035.vdocuments.us/reader035/viewer/2022062317/5edf4da7ad6a402d666aa745/html5/thumbnails/8.jpg)
B
A
AALU
Load constant value into register
8
Very inefficient use of hardware!
Op
Val
Instruction
Fetch
Registers
Aaddr
Baddr
Caddr
PC Load StoreLdAddr StAddr
CWriteEnable
C
Op
LdData
StData
Op
CData
![Page 9: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point](https://reader035.vdocuments.us/reader035/viewer/2022062317/5edf4da7ad6a402d666aa745/html5/thumbnails/9.jpg)
CPU activity, step by step
9
AR0 Load Mem[100]
AR1 Load Mem[101]
AR2 Load #42
AR2 Mul R1, R2
AR0 Add R2, R0
Store R0 Mem[100]A
Time
![Page 10: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point](https://reader035.vdocuments.us/reader035/viewer/2022062317/5edf4da7ad6a402d666aa745/html5/thumbnails/10.jpg)
On the FPGA we unroll the CPU hardware…
10
A
A
A
A
A
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100]A
Space
![Page 11: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point](https://reader035.vdocuments.us/reader035/viewer/2022062317/5edf4da7ad6a402d666aa745/html5/thumbnails/11.jpg)
… and specialize by position
11
A
A
A
A
A
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100]A
1. Instructions are fixed.
Remove “Fetch”
![Page 12: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point](https://reader035.vdocuments.us/reader035/viewer/2022062317/5edf4da7ad6a402d666aa745/html5/thumbnails/12.jpg)
… and specialize
12
A
A
A
A
A
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100]A
1. Instructions are fixed.
Remove “Fetch”
2. Remove unused ALU ops
![Page 13: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point](https://reader035.vdocuments.us/reader035/viewer/2022062317/5edf4da7ad6a402d666aa745/html5/thumbnails/13.jpg)
… and specialize
13
A
A
A
A
A
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100]A
1. Instructions are fixed.
Remove “Fetch”
2. Remove unused ALU ops
3. Remove unused Load / Store
![Page 14: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point](https://reader035.vdocuments.us/reader035/viewer/2022062317/5edf4da7ad6a402d666aa745/html5/thumbnails/14.jpg)
… and specialize
14
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100]
1. Instructions are fixed.
Remove “Fetch”
2. Remove unused ALU ops
3. Remove unused Load / Store
4. Wire up registers properly!
And propagate state.
![Page 15: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point](https://reader035.vdocuments.us/reader035/viewer/2022062317/5edf4da7ad6a402d666aa745/html5/thumbnails/15.jpg)
… and specialize
15
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100]
1. Instructions are fixed.
Remove “Fetch”
2. Remove unused ALU ops
3. Remove unused Load / Store
4. Wire up registers properly!
And propagate state.
5. Remove dead data.
![Page 16: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point](https://reader035.vdocuments.us/reader035/viewer/2022062317/5edf4da7ad6a402d666aa745/html5/thumbnails/16.jpg)
… and specialize
16
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100]
1. Instructions are fixed.
Remove “Fetch”
2. Remove unused ALU ops
3. Remove unused Load / Store
4. Wire up registers properly!
And propagate state.
5. Remove dead data.
6. Reschedule!
![Page 17: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point](https://reader035.vdocuments.us/reader035/viewer/2022062317/5edf4da7ad6a402d666aa745/html5/thumbnails/17.jpg)
Custom Data-Path on the FPGA Matches Your Algorithm!
17
Build exactly what you need:
Operations
Data widths
Memory size & configuration
Efficiency:
Throughput / Latency / Power
load load
store
42
High-level code
Mem[100] += 42 * Mem[101]
Custom data-path
![Page 18: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point](https://reader035.vdocuments.us/reader035/viewer/2022062317/5edf4da7ad6a402d666aa745/html5/thumbnails/18.jpg)
Architectural Example: Image Processing
Convolutions: dataflow can proceed in pipelined fashion– No need to wait until the entire execution is complete
– Start a new set of data calculations as soon as the first stage completes its execution
Inew 𝑥 𝑦 =
𝑥′=−1
1
𝑦′=−1
1
Iold 𝑥 + 𝑥′ 𝑦 + 𝑦′ × F 𝑥′ 𝑦′
![Page 19: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point](https://reader035.vdocuments.us/reader035/viewer/2022062317/5edf4da7ad6a402d666aa745/html5/thumbnails/19.jpg)
Main
Memory
Cache
Processor (CPU/GPU) Implementation
19
A cache can hide poor memory access patterns
for(int y=1; y<height-1; ++y) {for(int x=1; x<width-1; ++x) {for(int y2=-1; y2<1; ++y2) {for(int x2=-1; x2<1; ++x2) {i2[y][x] += i[y+y2][x+x2]
* filter[y2][x2];
CPU
![Page 20: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point](https://reader035.vdocuments.us/reader035/viewer/2022062317/5edf4da7ad6a402d666aa745/html5/thumbnails/20.jpg)
FPGA Implementation
20
Example performance point: 1 pixel per cycle
Cache requirements: 9 reads + 1 write per cycle
Expensive hardware!– Power overhead
– Cost overhead: more built in addressing flexibility than we need
Why not customize the cache for the application?
CacheCustom
Data-path
9 read ports!Memory
![Page 21: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point](https://reader035.vdocuments.us/reader035/viewer/2022062317/5edf4da7ad6a402d666aa745/html5/thumbnails/21.jpg)
Optimizing the “Cache”
21
Start out with the initial picture that is W pixels wide
w
![Page 22: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point](https://reader035.vdocuments.us/reader035/viewer/2022062317/5edf4da7ad6a402d666aa745/html5/thumbnails/22.jpg)
Optimizing the “Cache”
22
ww
Let’s remove all the lines that aren’t in the neighborhood
of the window
![Page 23: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point](https://reader035.vdocuments.us/reader035/viewer/2022062317/5edf4da7ad6a402d666aa745/html5/thumbnails/23.jpg)
Optimizing the “Cache”
23
Take all of the lines and arrange them as a 1D array of
pixels
ww
ww
w
ww
ww
w
ww w
![Page 24: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point](https://reader035.vdocuments.us/reader035/viewer/2022062317/5edf4da7ad6a402d666aa745/html5/thumbnails/24.jpg)
Optimizing the “Cache”
24
ww
ww
w
ww
ww
w
Remove the pixels at the edges that we don’t need for the
computation
![Page 25: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point](https://reader035.vdocuments.us/reader035/viewer/2022062317/5edf4da7ad6a402d666aa745/html5/thumbnails/25.jpg)
Optimizing the “Cache”
25
What happens when we move the window one pixel to the
right?
We have created a shift register implementation
w
![Page 26: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point](https://reader035.vdocuments.us/reader035/viewer/2022062317/5edf4da7ad6a402d666aa745/html5/thumbnails/26.jpg)
data_out[9]
Shift Registers in Software
26
pixel_t sr[2*W+3];while(keep_going) {
// Shift data in#pragma unrollfor(int i=1; i<2*W+3; ++i)
sr[i] = sr[i-1]sr[0] = data_in;
// Tap output datadata_out = {sr[ 0], sr[ 1], sr[ 2],
sr[ w], sr[ w+1], sr[ w+2]sr[2*w], sr[2*w+1], sr[2*w+2]}
// ...}
wwdata_in
sr[0] sr[2*W+2]
Managing data
movement to match
the FPGA’s
architectural strengths
is key to obtaining
high performance
![Page 27: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point](https://reader035.vdocuments.us/reader035/viewer/2022062317/5edf4da7ad6a402d666aa745/html5/thumbnails/27.jpg)
Traditional OpenCL Implementation of a Pipeline(CPU/GPU)
High-latency: requires access to global memory
High memory-bandwidth
Requires host coordination to pass buffers from one kernel
to another
With a particular design example we achieved 183 Images/s
on a Stratix V PCIe card
Kernel 1 Kernel 2 Kernel 3
Global Memory (DDR)
Buffer Buffer Buffer Buffer
![Page 28: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point](https://reader035.vdocuments.us/reader035/viewer/2022062317/5edf4da7ad6a402d666aa745/html5/thumbnails/28.jpg)
Leveraging Kernel-to-Kernel Channels
Low-latency communication between kernels
Significantly less memory bandwidth requirements
Host is not involved in coordinating communication between kernels
This implementation on the same Stratix V PCIe card resulted in 400 Images/s
Global Memory (DDR)
Buffer Buffer
Channels
Kernel 1 Kernel 2 Kernel 3
• Channel declaration:
• Create a queue:value_type channel();
• Channel write:
• Push data into the queue:void write_channel_altera(channel &ch, value_type data);
• Channel read:
• Pop the first element from the queuevalue_type read_channel_altera(channel &ch);
channel int my_channel;
write_channel_altera(my_channel, x);
int y = read_channel_altera(my_channel);
![Page 29: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point](https://reader035.vdocuments.us/reader035/viewer/2022062317/5edf4da7ad6a402d666aa745/html5/thumbnails/29.jpg)
FPGA Code
Kernels are written as standard
building blocks that are
connected together through
channels
The concept of having multiple
concurrent kernels executing
simultaneously and
communicating directly on a
device is currently unique to
FPGAs– Offered as Vendor Extension
– Portable in OpenCL 2.0 through the concept of “OpenCL Pipes”
#pragma OPENCL_EXTENSION cl_altera_channel: enable
// Declaration of Channel API data types
channel float prod_k1_channel;channel float k1_k2_channel;channel float k2_k3_channel;channel float k3_res_channel;
__kernel void convolution_prod(int batch_id_begin,int batch_id_end,__global const volatile float * restrict
input_global){for(...) {write_channel_altera(prod_k1_channel,input_global[...]);
write_channel_altera(k1_k2_channel,input_global[...]);...}
}
![Page 30: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point](https://reader035.vdocuments.us/reader035/viewer/2022062317/5edf4da7ad6a402d666aa745/html5/thumbnails/30.jpg)
Migration Between FPGAs
30
In OpenCL, a float uses soft logic in an older FPGAs
Gen10 FPGAs have hardened floating point logic built into
the DSP blocks
On Arria 10 using the same code results in processing
6800 Images/s
Stratix 10 expectations:– Large increase in floating point resources
– Higher internal frequencies achievable
– 1.6x-2x performance increase
– 12x-16x performance/watt efficiency versus Stratix V
![Page 31: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point](https://reader035.vdocuments.us/reader035/viewer/2022062317/5edf4da7ad6a402d666aa745/html5/thumbnails/31.jpg)
Additional Improvements: IO Channels
31
Kernel Channels are between OpenCL kernels
IO Channels take data directly from and to IO interfaces in
the FPGA– Camera or video feed could be processed directly in the FPGA without
going through the host
– Result could be passed out to the graphics card to be displayed or back to host memory for the host to use
Private, Local and Global memory can now be used to
buffer as needed
Kernel
Channels/Pipes
Kernel 1 Kernel 2 Kernel 3
IO Channels
FPGA
![Page 32: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point](https://reader035.vdocuments.us/reader035/viewer/2022062317/5edf4da7ad6a402d666aa745/html5/thumbnails/32.jpg)
Lessons Learned
Exploiting pipelining on the FPGA requires some attention to coding style to overcome the inherent assumptions of writing “software”
– FPGAs do not have caches
– Need to exploit data reuse in a more explicit way
The concept of dataflow pipelining will not realize its full potential if we write intermediate results to memory
– Bandwidth limitations begin to dominate compute
– Use direct kernel to kernel communication called channels
Native support for floating point on the FPGA allows order of magnitude performance increase
Code can be ported to newer FPGAs without modification to get performance increase
IO Channels can lower latency and improve performance even more by taking the host out of the processing chain even more