fpga-based accelerator for long short-term memory ... · 1 fpga-based accelerator for long...
TRANSCRIPT
1
FPGA-based Accelerator for Long Short-Term
Memory Recurrent Neural Networks
Yijin Guan1, Zhihang Yuan1, Guangyu Sun1,3, Jason Cong2,3,1
1Center for Energy-Efficient Computing and Applications, Peking University, China2Computer Science Department, University of California, Los Angeles, USA
3PKU/UCLA Joint Research Institute in Science and Engineering
2
Deep Learning
Scenarios
Applications
3
Recurrent Neural Network
Input Layer
Hidden Layer
Output Layer
Feed-forward NN RNN
Input Layer (t)
Hidden Layer (t)
Output Layer (t)
Hidden Layer (t-1)
Recurrent
Connection
4
Recurrent Neural Network
Input Layer (t-1)
Hidden Layer (t-1)
Output Layer (t-1)
Input Layer (t)
Hidden Layer (t)
Output Layer (t)
Input Layer (t+1)
Hidden Layer (t+1)
Output Layer (t+1)
…
RNN unfolds into a DNN over time
…
5
Long Short-Term Memory
Input Gate
Wxi & Whi
it
xt ht-1
𝒊𝒕 = σ (𝑾𝒙𝒊𝒙𝒕 + 𝐖𝒉𝒊𝒉𝒕−𝟏 + 𝒃𝒊)
6
Long Short-Term Memory
Input Gate
Wxi & Whi Wxf & Whf
Forget Gate
itft
xt ht-1
𝒇𝒕 = σ (𝑾𝒙𝒇𝒙𝒕 + 𝐖𝒉𝒇𝒉𝒕−𝟏 + 𝒃𝒇)
7
Long Short-Term Memory
Input Gate
Wxi & Whi
Cell Gate
Wxc & Whc
Wxf & Whf
Forget Gate
itft
xt ht-1
𝒄t
𝒄𝒕 = tanh (𝑾𝒙𝒄𝒙𝒕 + 𝐖𝒉𝒄𝒉𝒕−𝟏 + 𝒃𝒄)
8
Long Short-Term Memory
Input Gate
Wxi & Whi
Cell Gate
Wxc & Whc
Wxf & Whf
Forget Gate
itft
ct
ct-1
xt ht-1
𝒄t
𝒄𝒕 = 𝒇𝒕⊙𝒄𝒕−𝟏 + 𝒊𝒕⊙𝒄𝒕
9
Long Short-Term Memory
Input Gate
Wxi & Whi
Cell Gate
Wxc & Whc
Wxf & Whf
Forget Gate
Wxo & Who
Output Gate
itft
ct
ct-1
xt ht-1
𝒄t ot
𝒐𝒕 = σ (𝑾𝒙𝒐𝒙𝒕 + 𝐖𝒉𝒐𝒉𝒕−𝟏 + 𝒃𝒐)
10
Long Short-Term Memory
Input Gate
Wxi & Whi
Cell Gate
Wxc & Whc
Wxf & Whf
Forget Gate
Wxo & Who
Output Gate
itft
ct
ct-1
xt ht-1
ht
𝒄t ot
𝒉𝒕 = 𝒐𝒕⊙ tanh(𝒄𝒕)
11
Why FPGA
Adaptability
Performanc
Energy EfficiencyProgrammability
Scalability
GPU
Adaptability
Performance
Energy EfficiencyProgrammability
Scalability
FPGA
Adaptability
Performance
Energy EfficiencyProgrammability
Scalability
ASIC
Adaptability
Performance
Energy EfficiencyProgrammability
Scalability
CPU
12
Design Challenges and Optimizations
FPGA Chip
Computation
Engine
Data
Buffers
Off-chip
Memory
13
Design Challenges and Optimizations
FPGA Chip
Computation
Engine
Data
Buffers
Off-chip
Memory
Computation Resources & Performance
Loop Unroll
Deep Pipeline
14
Design Challenges and Optimizations
FPGA Chip
Computation
Engine
Data
Buffers
Off-chip
Memory
On-chip Memory Resources
Loop Tiling
Eclectic Data Partition
15
Design Challenges and Optimizations
FPGA Chip
Computation
Engine
Data
Buffers
Off-chip
Memory
Bandwidth
Ping-pong Buffers
Reshaping Data Layout
16
FPGA Chip
System Design
MicroBlaze
LSTM
Accelerator
Timer
Data
Dispatcher
AX
I4 B
us
DD
R3 D
RA
M UART
AX
I4L
ite
Bu
s
Vivado High-level Synthesis (v2015.4)
Vivado Design Suite (v2015.4)
17
Accelerator Design
f c o
LSTM Functional Logic
Input Group 0 Input Group 1
Output Group 0 Output Group 1
Cell Buffer
i
18
Experimental Results
0
5
10
15
20
25
1 thread -O3 16 thread -O3 Our Imp.
Speedup
Device Model Freq. Development Env.
CPU Xeon E5-2430 2.20 GHz gcc -O3 & OpenMP
FPGA Xilinx Virtex7-485t 150 MHz Vivado Design Suite
5.4x
20.2x
19
Experimental Results
0
5
10
15
20
25
Previous Imp. A Previous Imp. B Our Imp. A Our Imp. B
Speedup
Imp. Model Freq. Data Precision
Previous Imp. Xilinx Zynq7020 142MHz Fixed-16
Our Imp. Xilinx Virtex7-485t 150 MHz Float-32
~47%
15.5x
2x
20
Future Work
Data quantization
Low-precision fixed-point numbers
Model compression
Connection Pruning
Matrix compression (e.g. SVD)
General architecture
Support for all LSTM variants
21
Conclusions
An accelerator for LSTM-RNN
Optimizations for computation & communication
at architecture level
On-board implementation with high-performance
computation engines & data dispatcher
Outperforms CPU- & other FPGA- implementations
22
Thank You
Q & A