Tool-flows for mapping CNNs into FPGAs: Trends and Challenges
Christos [email protected]
Intelligent Digital Systems Lab Electrical and Electronic Engineering DepartmentImperial College London
Artificial Intelligence - Machine Learning – Deep Neural Networks
Artificial Intelligence
Machine Learning
Deep Neural Networks
CNNs
YOLO v2
Convolutional Neural Network
Christos Bouganis
Feature Extractor Fully connected
3
Convolutional Neural Network - Trends
Christos Bouganis 4
A Deep Learning Software Ecosystem
5
First Wave
(UC Berkeley)
(NYU / Facebook)
(Univ. of Montreal)
Second Wave
(Facebook)
(Facebook)
(Google)
(Microsoft)
(Amazon)
CoreML(Apple)
TensorRT(NVIDIA)
Neural Network Toolbox
Training and Inference
Christos Bouganis
CNN Deployment Flow
6
Deep Learning Framework
CNN StructureUser Input
Trained Weights
CPU GPU TK1 TX1 & TX2
Qualcomm Snapdragon
Apple A11
Optimised mapping
Christos Bouganis
CNN Deployment Flow
7
Deep Learning Framework
CNN StructureUser Input
Trained Weights
Optimised mapping
CPU GPU TK1 TX1 & TX2
Qualcomm Snapdragon
Apple A11 FPGA
Christos Bouganis
FPGA platform resources
CNNs: An opportunity for tool-flows for Reconfigurable Hardware
In the last few years, significant progress in generic FPGA HLS tools
• Vivado HLS, Intel OpenCL, MaxCompiler, LegUp, etc.
• Generate designs based on the mapping and scheduling of low-level primitive
operations à Large design space.
• Low-level entry point
CNN workloads are highly structured
• Layers with pre-defined types and parametrisation
Opportunity for domain-specific frameworks for CNNs
• Generate optimised architectures
8Christos Bouganis
Existing CNN-to-FPGA tool-flows
Toolflow Name Interface Year VenuefpgaConvNet Caffe & Torch May 2016 FCCM 2016DeepBurning Caffe June 2016 DAC 2016Angel-Eye Proprietary July 2016 FPGA 2016ALAMO Proprietary August 2016 FPL 2016DnnWeaver Caffe October 2016 MICRO 2016Caffeine Caffe November 2016 ICCAD 2016FINN Theano February 2017 FPGA 2017FP-DNN TensorFlow May 2017 FCCM 2017SysArrayAccel C June 2017 DAC 2017
9
Name Input description Year Publication
Christos Bouganis
Key aspects for consideration – How do they compare?
05
101520253035
NN support
Front-Ends
Portability
HardwareArchitecture
Design SpaceExploration
Precision
Objectivefunction
Evaluating Tool-FlowsToolFlow 1 ToolFlow 2
Christos Bouganis
• Key aspects for consideration• Neural Network Support• Front-End Support• Design Portability• Hardware Architectures• Design Space Exploration• Precision support• Objective function
10
Supported Neural Network models
0
20
40
60
80
100
Percentage of toolflows (%)
Supported type of neural networks
CNNs RNNs BNNs
11
• Mainstream models include• CNNs• Recurrent Neural Networks (RNNs)• Binarised Neural Networks (BNNs)• Ternary Neural Networks
• DeepBurning and FP-DNN• RNNs• LSTMs• Residual connections in CNNs (FP-
DNN)
Christos Bouganis
Supported Front-Ends
0
10
20
30
40
50
Percentage of toolflows (%)
Supported front-ends
Caffe TensorFlow TorchTheano Proprietary
12
• Critical for reaching a wide user base• Two options
• Integration with existing frameworks• FpgaConvNet, DeepBurning, DNNWeaver
Caffeine• Proprietary front-ends
• SysArrayAccel, Angle-Eye, CNN RTL
Christos Bouganis
Design Portability
Def.: The degree to which a tool-flow can target:1. devices by multiple vendors and families2. different setups (SoCs, host-FPGA servers, stand-alone FPGAs).
13
0
10
20
30
40
50
60
Percentage of toolflows (%)
Supported setup
SoC Standalone Host + FPGA
0
20
40
60
80
Percentage of toolflows (%)
Supported FPGA vendors
Xilinx Intel
Highest portability: DnnWeaver
Christos Bouganis
Hardware Architecture
Two types of architectures
1. Streaming architectures
2. Single computation engine
14Christos Bouganis
Hardware Architecture – Streaming Architectures
15
Characteristics:• Coarse pipeline of hardware stages• One hardware stage per layer• fpgaConvNet, DeepBurning, FINN
Advantages:+ Customisation+ Concurrent execution of layers+ Flexible allocation of resources per layer,
tailored to the target network
Disadvantages:− Flexibility− New bitsream for each CNN à long
compilation times
Christos Bouganis
Hardware Architecture – Single Computation Engine
16
Characteristics:• Processing element-based design• Fixed architecture, time-shared between layers• Control via microinstructions• Angel-Eye, ALAMO, DnnWeaver, Caffeine,
FP-DNN, SysArrayAccel
Advantages:+ Flexibility+ One bistream can target many CNNs
Disadvantages:− Customisation− High performance variability across CNNs− Inefficiencies due to processor-like control
mechanisms
Christos Bouganis
Design Space Exploration
Each toolflow defines an architectural design space• Parameter Space:
» Which trade-offs and alternative designs can be explored?
• Exploration Method» How it the design space traversed? » Which objectives are optimised?
Observations• Toolflows with streaming architectures define a finer-grained space.
» Structure of the pipeline» Allocation of resources among hardware stages
• Single computation engine toolflows focus on:» Scaling of the computation engine with HW resources» CNN-to-microinstructions mapping
17Christos Bouganis
Design Space Exploration
Toolflow Formulation Solver ObjectivesfpgaConvNet Mathematical optimisation s.t.
the rsc budgetGlobal Optimiser(Simulated Annealing)
Throughput, latency or multiobjectivecriteria
DnnWeaver -//- Heuristic Throughput
Caffeine Roofline model Enumeration Throughput
SysArrayAccel Mathematical optimisation s.t.the rsc budget
Pruning + Enumeration
Throughput
FINN Rule-based: rate-balancing Heuristic Throughput, latency
FP-DNN Rule-based: bandwidth-matching Heuristic Throughput
ALAMO Heuristic Throughput, latency
DeepBurning Rule-based: rate-balancing Heuristic Throughput, latency
Angel-Eye Rule-based Heuristic Throughput, latency
18Christos Bouganis
Arithmetic Precision
19
Full precision: FP32
Deep learning community
• fpgaConvNet• FP-DNN• SysArrayAccel
Uniform precision
• ALAMO• DnnWeaver• DeepBurning
Variable precision
• Angel-Eye
Dynamic fixed-point
User-specified
Automated
• FINN
BinaryBinary/Ternary
Christos Bouganis
Performance (Quality of Result)
Achieved QoR is the most critical factor.
QoR can be evaluated with respect to two factors:1) Comparison of QoR between toolflows for the same CNN-FPGA pair,2) Comparison with hand-tuned accelerator for the same CNN-FPGA pair.
Fair comparisons: Each toolflow to target the same CNN-FPGA pair. However, the majority of existing toolflows are not open-sourced, or provide
limited support.• DnnWeaver targeting Zynq XC7Z020 (limited support, open sourced)• FINN targeting Xilinx PYNQ-Z1 board (specific BNNs)• Angel-Eye used internally by DeePhi. • fpgaConvNet (webpage with up-to-date benchmarking results)
20Christos Bouganis
High-level Performance Observations
1) RTL-based designs tend to outperform their HLS counterparts.
2) Finer-grained DSE tends to offer an advantage in terms of obtained performance.
3) Single computation engines tend to reach higher performance for CNNs with a uniform structure.
4) Comparable or even better performance than hand-tuned designs
24
1:18 S. I. Venieris, A. Kouris and C. S. Bouganis
SupportedNNModels
FPGAPortability
OptimisationObjectivePrecision
PerformanceDensity
SummaryofToolflow Characteristics
fpgaConvNetSysArrayAccelAngel-EyeCNNRTLComplilerDNNWEAVERCaffeineFINNFP-DNNDeepBurning
ArchitectureStreamingSingle-Engine
DeviceFamilyIntelXilinx
Fig. 7. Overview of toolflow characteristics
the di�erent target devices of the two tool�ows does not allow for a meaningful performancecomparison.The CNN RTL Compiler is designed to combine the high throughput and low latency of RTL
designs with high precision �exibility across layers. Nevertheless, precision quantisation is per-formed manually and is not part of the automated �ow. The tool�ow uses Intel’s o�-the-shelf IPs,including the NIOS soft processor and the scatter-gather DMA block, and hence the generateddesigns are restricted and tailored for Intel FPGAs, which a�ects its portability.
Angel-Eye bases its competitive advantage on its automatic dynamic quantisation scheme. Theselected parameter space allows for the unrolling of both the input and output feature maps andhence latency and throughput are co-optimised. However, the use of Vivado HLS restricts itsportability to Xilinx devices.
fpgaConvNet prioritises the support of various optimisation objectives based on the application-level performance needs. In this context, distinct methodologies are used for high-throughput andlow-latency applications. Similarly to Ca�eine and Angel-Eye, the use of Vivado HLS currentlyrestricts fpgaConvNet to Xilinx devices.Finally, DeepBurning’s design principle entails modularity and support of a wide range of NN
models. In this respect, the tool�ow’s RTL building blocks can target various types of NNs ando�er �exibility with respect to precision quantisation across the layers. By design, the generatedaccelerators are optimised to operate with a batch size of 1 and hence their optimisation objectivesare simultaneously high throughput and low latency.
3 THE FUTURE OF CNN-TO-FPGA TOOLFLOWS3.1 Towards a Uniform Evaluation MethodologyThe existing FPGA too�ows have employed ad-hoc evaluation methodologies, by targeting di�erentCNN models and reporting the achieved performance in a non-uniform manner. A uniform evalua-tion methodology is proposed here in order to enable the thorough and comparative evaluation ofCNN-to-FPGA tool�ows. The proposed methodology comprises a benchmark suite and guidelinesfor evaluation metrics.
Benchmark Suite.13 A comprehensive benchmark suite should include CNNs that are widelyused and whose accuracy has been extensively studied by the deep learning community. Each CNN
13We propose a benchmark suite of representative CNNs that can be found in: http://...
ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: July 2017.
Christos Bouganis
THE FUTURE OF CNN-TO-FPGA TOOLFLOWS
25Christos Bouganis
Objectives of a CNN-to-FPGA Toolflow
Objective 1. Targeting next-generation CNN models.Trends:
1) Increased depth.» 8-layer AlexNet (2012) à 152-layer ResNet-152 (2016), 161-layer DenseNet-161 (2017)
2) Increased workload.» 20x in GOps/input from AlexNet (2012) to VGG-16 (2014)
3) Novel compound components.» Inception module (GoogLeNet), residual block (ResNet), dense block (DenseNet),
residual Inception block (Inception-v4).
Challenge• Irregular layer connectivity challenges the automation of mapping to hardware.
26Christos Bouganis
Objectives of a CNN-to-FPGA Toolflow
Objective 1. Targeting next-generation CNN models.Support for Recurrent Neural Networks (RNNs):
• TPU paper by Google – around 95% of NN workloads are RNNs.
Challenge
• RNNs are memory-bounded and require different design approach to CNNs.
27Christos Bouganis
Objectives of a CNN-to-FPGA Toolflow
Objective 2. Support for compressed and sparse CNNsNumerous schemes exploit redundancy in the model to reduce inference time.
• Low-rank approximations, pruning, sparsification, quantisation. • ASIC accelerators have introduced designs for such networks (e.g. zero-skipping
compute units, bit-serial arithmetic, etc.)Challenge
• Methods such as pruning can break the uniformity of computation and memory accesses.
28Christos Bouganis
Objectives of a CNN-to-FPGA Toolflow
Objective 3. FPGA-based CNN training
GPUs are the current standard for CNN training. • Next-generation FPGAs demonstrate promising performance and power efficiency
(Stratix 10, UltraScale).• Recent advances in low-precision NN training offers space for customisation and
variable-precision arithmetic that suits FPGAs.• Slightly explored by the F-CNN framework.
Challenge• Demonstrate gains of FPGAs over GPUs for CNN training.
29Christos Bouganis
Objectives of a CNN-to-FPGA Toolflow
Objective 4. Hardware-Network co-designEnd-to-end toolchain, from dataset and application to network and hardware design.
• Expose hardware performance and power consumption to the training phase, co-optimise the CNN model and the hardware under a holistic view.
Challenge• Long-term objective for the community towards the efficient hardware execution of high-
performing neural networks.
30Christos Bouganis
Thank you - Questions
Christos Bouganis
1
Toolflows for Mapping Convolutional Neural Networks on FPGAs:A Survey
STYLIANOS I. VENIERIS, ALEXANDROS KOURIS AND CHRISTOS-SAVVAS BOUGANIS,Imperial College London
In the past decade, Convolutional Neural Networks (CNNs) have demonstrated state-of-the-art performancein various Arti�cial Intelligence tasks. To accelerate the experimentation and development of CNNs, severalsoftware frameworks have been released, primarily targeting power-hungry CPUs and GPUs. In this context,recon�gurable hardware in the form of FPGAs constitutes a potential alternative platform that can be integratedin the existing CNN ecosystem to provide a tunable balance between performance, power consumption andprogrammability. In this paper, a survey of the existing CNN-to-FPGA tool�ows is presented, comprising acomparative study of their key characteristics which include the supported applications, architectural choices,design space exploration methods and achieved performance. Moreover, major challenges and objectivesintroduced by the latest trends in CNN algorithmic research are identi�ed and presented. Finally, an evaluationmethodology is proposed, aiming at the comprehensive, complete and in-depth evaluation of CNN-to-FPGAtool�ows.
CCS Concepts: • General and reference → Surveys and overviews; • Computing methodologies →Neural networks; • Hardware → Recon�gurable logic and FPGAs; Electronic design automation;
Additional Key Words and Phrases: Convolutional Neural Networks, FPGA Too�ows, Deep Learning
ACM Reference format:Stylianos I. Venieris, Alexandros Kouris and Christos-Savvas Bouganis. 2017. Tool�ows for Mapping Convolu-tional Neural Networks on FPGAs: A Survey. ACM Comput. Surv. 1, 1, Article 1 (July 2017), 26 pages.https://doi.org/10.1145/nnnnnnn.nnnnnnn
1 INTRODUCTIONConvolutional Neural Networks (CNNs) [22] have demonstrated remarkable performance in com-puter vision tasks. Being able to achieve high accuracy and frequently outperform traditionalmachine vision approaches, CNNs have been employed in a vast range of applications over the lastdecade, from object detection and classi�cation to mobile robot navigation. While becoming thestate-of-the-art algorithm in machine vision, CNNs are challenged to deal with tasks of continuouslyincreasing complexity. This leads to the design of deeper, more expressive networks at the expenseof an increase in the computational and memory requirements.
The support of the EPSRC Centre for Doctoral Training in High Performance Embedded and Distributed Systems(HiPEDS, Grant Reference EP/L016796/1) is gratefully acknowledged. Authors’ addresses: Department of Electrical andElectronic Engineering, Imperial College London, SW7 2AZ, London, UK; emails: {stylianos.venieris10, a.kouris16, christos-savvas.bouganis}@imperial.ac.uk.Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without feeprovided that copies are not made or distributed for pro�t or commercial advantage and that copies bear this notice and thefull citation on the �rst page. Copyrights for components of this work owned by others than the author(s) must be honored.Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requiresprior speci�c permission and/or a fee. Request permissions from [email protected].© 2017 Copyright held by the owner/author(s). Publication rights licensed to Association for Computing Machinery.0360-0300/2017/7-ART1 $15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn
ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: July 2017.
…soon published in ACM Computing Surveys
31
References
Christos Bouganis 32
• Stylianos I. Venieris and Christos-Savvas Bouganis. 2016. fpgaConvNet: A Framework for Mapping Convolutional Neural Networks on FPGAs. In FCCM. DOI:http://dx.doi.org/10.1109/fccm.2016.22
• Stylianos I. Venieris and Christos-Savvas Bouganis. 2017. fpgaConvNet: Automated Mapping of Convolutional Neural Networks on FPGAs (Abstract Only). In FPGA. DOI:http://dx.doi.org/10.1145/3020078.3021791
• Stylianos I. Venieris and Christos-Savvas Bouganis. 2017. Latency-Driven Design for FPGA-based Convolutional Neural Networks. In FPL.
• Ying Wang, Jie Xu, Yinhe Han, Huawei Li, and Xiaowei Li. 2016. DeepBurning: Automatic Generation of FPGA-based Learning Accelerators for the Neural Network Family. In DAC. Article 110. DOI:http://dx.doi.org/10.1145/2897937.2898003
• KaiyuanGuo,LingzhiSui,JiantaoQiu,SongYao,SongHan,YuWang,andHuazhongYang.2016.Angel-Eye:ACompleteDesign Flow for Mapping CNN onto Customized Hardware. In ISVLSI. DOI:http://dx.doi.org/10.1109/isvlsi.2016.129
• Jiantao Qiu and others. 2016. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network. In FPGA. ACM, 26–35. DOI:http://dx.doi.org/10.1145/2847263.2847265
• Yufei Ma, Naveen Suda, Yu Cao, Jae sun Seo, and Sarma Vrudhula. 2016. Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA. In FPL .DOI:http://dx.doi.org/10.1109/FPL.2016.7577356
• Hardik Sharma, Jongse Park, Emmanuel Amaro, Bradley Thwaites, Praneetha Kotha, Anmol Gupta, Joon Kyung Kim,Asit Mishra, and Hadi Esmaeilzadeh. 2016. Dnnweaver: from High-level Deep Network Models to FpgaAcceleration.
References
Christos Bouganis 33
• Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon Kyung Kim, Chenkai Shao, Asit Mishra, andHadi Esmaeilzadeh. 2016. From High-Level Deep Neural Models to FPGAs. In MICRO. DOI:http://dx.doi.org/10.1109/micro.2016.7783720
• Chen Zhang, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. 2016. Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks. In ICCAD. Article 12, 8 pages. DOI:http://dx.doi.org/10.1145/2966986.2967011
• Yaman Umuroglu, Nicholas J. Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and KeesVissers. 2017. FINN: A Framework for Fast, Scalable Binarized Neural Network Inference. In FPGA. DOI:http://dx.doi.org/10.1145/3020078.3021744
• NicholasJ.Fraser,YamanUmuroglu,GiulioGambardella,MichaelaBlott,PhilipLeong,MagnusJahre,andKeesVissers.2017. Scaling Binarized Neural Networks on Reconfigurable Logic. In PARMA-DITAM. DOI:http://dx.doi.org/10.1145/3029580.3029586
• Yijin Guan, Hao Liang, Ningyi Xu, Wenqiang Wang, Shaoshuai Shi, Xi Chen, Guangyu Sun, Wei Zhang, and JasonCong. 2017. FP-DNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTL-HLS Hybrid Templates. In FCCM. DOI:http://dx.doi.org/10.1109/fccm.2016.22
• Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and Jason Cong. 2017. Automated Systolic Array Architecture Synthesis for High Throughput CNN Inference on FPGAs. In DAC. DOI:http://dx.doi.org/10.1145/3061639.3062207