dnnbuilder: an automated tool for building high ......dnnbuilder: an automated tool for building...
TRANSCRIPT
![Page 1: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f08c3467e708231d4239a09/html5/thumbnails/1.jpg)
DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators
for FPGAs
Xiaofan Zhang1, Junsong Wang2, Chao Zhu2, Yonghua Lin2,
Jinjun Xiong3, Wen-mei Hwu1, Deming Chen1
1UIUC, 2IBM Research-China, 3IBM T. J. Watson Research Center
IBM Research AI Systems Day
ICCAD’18 Best Paper Award
![Page 2: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f08c3467e708231d4239a09/html5/thumbnails/2.jpg)
1. Background
2. Motivations
3. Automation Flow
4. Accelerator Architecture
5. Design Space Exploration
6. Experimental Results
7. Conclusions
2
Outline:
![Page 3: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f08c3467e708231d4239a09/html5/thumbnails/3.jpg)
Background
3
Deploying deep learning workloads in the cloud
Major requirements:
• Throughput performance
• Tail latency
• Power efficiency
Recommendations Auto-gen Sport HighlightsDNN Design
![Page 4: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f08c3467e708231d4239a09/html5/thumbnails/4.jpg)
Background
4
Major requirements:
• Real-time ability
• Energy efficiency design
• Area constraint
Deploying deep learning workloads at the edge
![Page 5: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f08c3467e708231d4239a09/html5/thumbnails/5.jpg)
1. Background
2. Motivations
3. Automation Flow
4. Accelerator Architecture
5. Design Space Exploration
6. Experimental Results
7. Conclusions
5
Outline:
![Page 6: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f08c3467e708231d4239a09/html5/thumbnails/6.jpg)
Motivation
6
FPGAs deliver improved latency & energy efficiency (vs. CPUs, GPUs)
But..
An end-to-end automated tool for mapping DNN to FPGAs
Try FPGAs for both cloud- and edge-computing
FPGA have limited computation & memory resources (DSPs, BRAMs)
Large design & test efforts (RTL programming, HW verification… )
Challenges in resource allocation for unbalanced DNN layers
We need
![Page 7: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f08c3467e708231d4239a09/html5/thumbnails/7.jpg)
1. Background
2. Motivations
3. Automation Flow
4. Accelerator Architecture
5. Design Space Exploration
6. Experimental Results
7. Conclusions
7
Outline:
![Page 8: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f08c3467e708231d4239a09/html5/thumbnails/8.jpg)
Automation Design Flow
➢ Low latency
➢ High throughput
➢ Efficient use of FPGA on-chip memory
➢ Auto on-chip resource allocation
8
3-step-solution as Design, Generation, & Execution
To bridge the gap between fast DNN construction in software and slow hardware implementation
![Page 9: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f08c3467e708231d4239a09/html5/thumbnails/9.jpg)
1. Background
2. Motivations
3. Automation Flow
4. Accelerator Architecture
5. Design Space Exploration
6. Experimental Results
7. Conclusions
9
Outline:
![Page 10: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f08c3467e708231d4239a09/html5/thumbnails/10.jpg)
Architecture
10
Overview of the proposed accelerator design
➢ A fine-grained layer-based pipeline structure
1) Higher throughput2) Better support of streaming inputs3) Higher efficiency with dedicated design for each DNN layer
➢ A column-based cache scheme
1) Lower latency, lower on-chip MEM demands 2) Support HD input3) Real-time capability
![Page 11: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f08c3467e708231d4239a09/html5/thumbnails/11.jpg)
Architecture
Proposed design General design
Higher throughput vs. recurrent structure
Lower latency vs. conventional pipeline structure
Reduce 7.7x latency for running YOLO
11
A fine-grained layer-based pipelined architecture
![Page 12: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f08c3467e708231d4239a09/html5/thumbnails/12.jpg)
Architecture
➢ 2-dim parallelism
KPF - kernel parallel factor
CPF - channel parallel factor
➢ Arbitrary quantization
DW - bit-width for feature map
WW - bit-width for weight/bias
External memory (DRAM)
On-chip memory (BRAM)
Computation resources
12
Pipelined stages instantiated on FPGA
![Page 13: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f08c3467e708231d4239a09/html5/thumbnails/13.jpg)
Architecture
13
RTL IPs for different DNN layers
• Adjustable parallel factor = CPF x KPF (more/less DSP utilization)
• On-chip buffers for sufficient data supply
![Page 14: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f08c3467e708231d4239a09/html5/thumbnails/14.jpg)
Architecture
➢ Save on-chip memory
➢ Adjust data reuse factor
For example:
Kernel size = 3
Stride = 1
4 slices cache on-chip instead of keeping the whole feature maps
1 2
14
A column-based cache scheme
![Page 15: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f08c3467e708231d4239a09/html5/thumbnails/15.jpg)
Architecture
Column-based cache schemeReduce 43x BRAM usage
for running YOLO
BRAM usage reduction for keep feature maps
320x ~ 7x
43x on average
15
![Page 16: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f08c3467e708231d4239a09/html5/thumbnails/16.jpg)
1. Background
2. Motivations
3. Automation Flow
4. Accelerator Architecture
5. Design Space Exploration
6. Experimental Results
7. Conclusions
16
Outline:
![Page 17: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f08c3467e708231d4239a09/html5/thumbnails/17.jpg)
Design Space Exploration
Conv1 Conv2
Conv3
Conv4
Conv5
Comp. boundMem. bound
Step1: Computation allocation
Total capability: 100 GOPSTotal BW 10 GB/S;
FC
Conv1 -> 15 GOPS
Conv2 -> 15 GOPS
Conv3 -> 21 GOPS
Conv4 -> 8 GOPS
Conv5 -> 5 GOPS
FC layer maximum
usage 5GOPS
17
An automatic resource allocator
CTC:
![Page 18: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f08c3467e708231d4239a09/html5/thumbnails/18.jpg)
To meet the BW constraint
Total capability 100 GOPS
Conv3
Total BW 10 GB/S
Comp. bound
CTC increase
Required mem. BW drop
Mem. bound
Col. 1
Col. 2 Cache one more Col.
18
Design Space ExplorationStep2: memory bandwidth adjustmentAn automatic resource allocator
CTC:
![Page 19: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f08c3467e708231d4239a09/html5/thumbnails/19.jpg)
1. Background
2. Motivations
3. Automation Flow
4. Accelerator Architecture
5. Design Space Exploration
6. Experimental Results
7. Conclusions
19
Outline:
![Page 20: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f08c3467e708231d4239a09/html5/thumbnails/20.jpg)
Experimental Results
20
Case study: real-time pedestrian/cyclist/car detection
Yolo9000 with HD input (1280x384, 20FPS) is mapped to Xilinx Zynq 706 FPGA @ 200MHz
![Page 21: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f08c3467e708231d4239a09/html5/thumbnails/21.jpg)
Experimental Results
21
Case study: real-time pedestrian/cyclist/car detection
![Page 22: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f08c3467e708231d4239a09/html5/thumbnails/22.jpg)
Experimental Results
*f.-t. in Design represents the accuracy results are collected after retraining and fine-tuning
22
Accuracy results after 16-bit & 8-bit quantization
![Page 23: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f08c3467e708231d4239a09/html5/thumbnails/23.jpg)
Experimental ResultsComparison:Embedded FPGAs for edge-devices
Zynq XC7Z045• LUT: 218,600• FF: 437,200• BRAM: 545• DSP: 900
KU115• LUT: 663,360• FF: 1,326,720• BRAM: 2160• DSP: 5520
Comparison:High-performance FPGAs for cloud computing
Peaking at 524 GOPS
Peaking at 4022 GOPS
23
![Page 24: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f08c3467e708231d4239a09/html5/thumbnails/24.jpg)
Experimental Results
24
Comparison:AlexNet inference performance GPU vs FPGA
![Page 25: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f08c3467e708231d4239a09/html5/thumbnails/25.jpg)
1. Background
2. Motivations
3. Automation Flow
4. Accelerator Architecture
5. Design Space Exploration
6. Experimental Results
7. Conclusions
25
Outline:
![Page 26: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f08c3467e708231d4239a09/html5/thumbnails/26.jpg)
Conclusions
26
➢ We presented DNNBuilder for building DNN accelerator on FPGAs
1) an automation tool (Design, Generation, and Execution)
2) a fine-grained layer-based pipeline architecture
3) a column-based cache scheme
4) an automatic resource allocation algorithm
➢ We delivered the state-of-the-art performance and power efficiency
1) the best throughput: 4022 (KU115) and 524 GOPS (ZC706)
2) the best efficiency: 180.4 (KU115) and 72.8 GOPS/W (ZC706)
![Page 27: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f08c3467e708231d4239a09/html5/thumbnails/27.jpg)
Q & A
#AI Research Weekhosted by MIT-IBM Watson AI Lab