tool-flows for mapping cnns into fpgas: trends and … · disadvantages: − flexibility ......

Tool-flows for mapping CNNs into FPGAs: Trends and Challenges

Christos [email protected]

Intelligent Digital Systems Lab Electrical and Electronic Engineering DepartmentImperial College London

Artificial Intelligence - Machine Learning – Deep Neural Networks

Artificial Intelligence

Machine Learning

Deep Neural Networks

CNNs

YOLO v2

Convolutional Neural Network

Christos Bouganis

Feature Extractor Fully connected

3

Convolutional Neural Network - Trends

Christos Bouganis 4

A Deep Learning Software Ecosystem

5

First Wave

(UC Berkeley)

(NYU / Facebook)

(Univ. of Montreal)

Second Wave

(Facebook)

(Facebook)

(Google)

(Microsoft)

(Amazon)

CoreML(Apple)

TensorRT(NVIDIA)

Neural Network Toolbox

Training and Inference

Christos Bouganis

CNN Deployment Flow

6

Deep Learning Framework

CNN StructureUser Input

Trained Weights

CPU GPU TK1 TX1 & TX2

Qualcomm Snapdragon

Apple A11

Optimised mapping

Christos Bouganis

CNN Deployment Flow

7

Deep Learning Framework

CNN StructureUser Input

Trained Weights

Optimised mapping

CPU GPU TK1 TX1 & TX2

Qualcomm Snapdragon

Apple A11 FPGA

Christos Bouganis

FPGA platform resources

CNNs: An opportunity for tool-flows for Reconfigurable Hardware

In the last few years, significant progress in generic FPGA HLS tools

• Vivado HLS, Intel OpenCL, MaxCompiler, LegUp, etc.

• Generate designs based on the mapping and scheduling of low-level primitive

operations à Large design space.

• Low-level entry point

CNN workloads are highly structured

• Layers with pre-defined types and parametrisation

Opportunity for domain-specific frameworks for CNNs

• Generate optimised architectures

8Christos Bouganis

Existing CNN-to-FPGA tool-flows

Toolflow Name Interface Year VenuefpgaConvNet Caffe & Torch May 2016 FCCM 2016DeepBurning Caffe June 2016 DAC 2016Angel-Eye Proprietary July 2016 FPGA 2016ALAMO Proprietary August 2016 FPL 2016DnnWeaver Caffe October 2016 MICRO 2016Caffeine Caffe November 2016 ICCAD 2016FINN Theano February 2017 FPGA 2017FP-DNN TensorFlow May 2017 FCCM 2017SysArrayAccel C June 2017 DAC 2017

9

Name Input description Year Publication

Christos Bouganis

Key aspects for consideration – How do they compare?

05

101520253035

NN support

Front-Ends

Portability

HardwareArchitecture

Design SpaceExploration

Precision

Objectivefunction

Evaluating Tool-FlowsToolFlow 1 ToolFlow 2

Christos Bouganis

• Key aspects for consideration• Neural Network Support• Front-End Support• Design Portability• Hardware Architectures• Design Space Exploration• Precision support• Objective function

10

Supported Neural Network models

0

20

40

60

80

100

Percentage of toolflows (%)

Supported type of neural networks

CNNs RNNs BNNs

11

• Mainstream models include• CNNs• Recurrent Neural Networks (RNNs)• Binarised Neural Networks (BNNs)• Ternary Neural Networks

• DeepBurning and FP-DNN• RNNs• LSTMs• Residual connections in CNNs (FP-

DNN)

Christos Bouganis

Supported Front-Ends

0

10

20

30

40

50


Supported front-ends

Caffe TensorFlow TorchTheano Proprietary

12

• Critical for reaching a wide user base• Two options

• Integration with existing frameworks• FpgaConvNet, DeepBurning, DNNWeaver

Caffeine• Proprietary front-ends

• SysArrayAccel, Angle-Eye, CNN RTL

Christos Bouganis

Design Portability

Def.: The degree to which a tool-flow can target:1. devices by multiple vendors and families2. different setups (SoCs, host-FPGA servers, stand-alone FPGAs).

13

0

10

20

30

40

50

60


Supported setup

SoC Standalone Host + FPGA

0

20

40

60

80


Supported FPGA vendors

Xilinx Intel

Highest portability: DnnWeaver

Christos Bouganis

Hardware Architecture

Two types of architectures

1. Streaming architectures

2. Single computation engine

14Christos Bouganis

Hardware Architecture – Streaming Architectures

15

Characteristics:• Coarse pipeline of hardware stages• One hardware stage per layer• fpgaConvNet, DeepBurning, FINN

Advantages:+ Customisation+ Concurrent execution of layers+ Flexible allocation of resources per layer,

tailored to the target network

Disadvantages:− Flexibility− New bitsream for each CNN à long

compilation times

Christos Bouganis

Hardware Architecture – Single Computation Engine

16

Characteristics:• Processing element-based design• Fixed architecture, time-shared between layers• Control via microinstructions• Angel-Eye, ALAMO, DnnWeaver, Caffeine,

FP-DNN, SysArrayAccel

Advantages:+ Flexibility+ One bistream can target many CNNs

Disadvantages:− Customisation− High performance variability across CNNs− Inefficiencies due to processor-like control

mechanisms

Christos Bouganis

Design Space Exploration

Each toolflow defines an architectural design space• Parameter Space:

» Which trade-offs and alternative designs can be explored?

• Exploration Method» How it the design space traversed? » Which objectives are optimised?

Observations• Toolflows with streaming architectures define a finer-grained space.

» Structure of the pipeline» Allocation of resources among hardware stages

• Single computation engine toolflows focus on:» Scaling of the computation engine with HW resources» CNN-to-microinstructions mapping

17Christos Bouganis

Design Space Exploration

Toolflow Formulation Solver ObjectivesfpgaConvNet Mathematical optimisation s.t.

the rsc budgetGlobal Optimiser(Simulated Annealing)

Throughput, latency or multiobjectivecriteria

DnnWeaver -//- Heuristic Throughput

Caffeine Roofline model Enumeration Throughput

SysArrayAccel Mathematical optimisation s.t.the rsc budget

Pruning + Enumeration

Throughput

FINN Rule-based: rate-balancing Heuristic Throughput, latency

FP-DNN Rule-based: bandwidth-matching Heuristic Throughput

ALAMO Heuristic Throughput, latency

DeepBurning Rule-based: rate-balancing Heuristic Throughput, latency

Angel-Eye Rule-based Heuristic Throughput, latency

18Christos Bouganis

Arithmetic Precision

19

Full precision: FP32

Deep learning community

• fpgaConvNet• FP-DNN• SysArrayAccel

Uniform precision

• ALAMO• DnnWeaver• DeepBurning

Variable precision

• Angel-Eye

Dynamic fixed-point

User-specified

Automated

• FINN

BinaryBinary/Ternary

Christos Bouganis

Performance (Quality of Result)

Achieved QoR is the most critical factor.

QoR can be evaluated with respect to two factors:1) Comparison of QoR between toolflows for the same CNN-FPGA pair,2) Comparison with hand-tuned accelerator for the same CNN-FPGA pair.

Fair comparisons: Each toolflow to target the same CNN-FPGA pair. However, the majority of existing toolflows are not open-sourced, or provide

limited support.• DnnWeaver targeting Zynq XC7Z020 (limited support, open sourced)• FINN targeting Xilinx PYNQ-Z1 board (specific BNNs)• Angel-Eye used internally by DeePhi. • fpgaConvNet (webpage with up-to-date benchmarking results)

20Christos Bouganis

High-level Performance Observations

1) RTL-based designs tend to outperform their HLS counterparts.

2) Finer-grained DSE tends to offer an advantage in terms of obtained performance.

3) Single computation engines tend to reach higher performance for CNNs with a uniform structure.

4) Comparable or even better performance than hand-tuned designs

24

1:18 S. I. Venieris, A. Kouris and C. S. Bouganis

SupportedNNModels

FPGAPortability

OptimisationObjectivePrecision

PerformanceDensity

SummaryofToolflow Characteristics

fpgaConvNetSysArrayAccelAngel-EyeCNNRTLComplilerDNNWEAVERCaffeineFINNFP-DNNDeepBurning

ArchitectureStreamingSingle-Engine

DeviceFamilyIntelXilinx

Fig. 7. Overview of toolflow characteristics

the di�erent target devices of the two tool�ows does not allow for a meaningful performancecomparison.The CNN RTL Compiler is designed to combine the high throughput and low latency of RTL

designs with high precision �exibility across layers. Nevertheless, precision quantisation is per-formed manually and is not part of the automated �ow. The tool�ow uses Intel’s o�-the-shelf IPs,including the NIOS soft processor and the scatter-gather DMA block, and hence the generateddesigns are restricted and tailored for Intel FPGAs, which a�ects its portability.

Angel-Eye bases its competitive advantage on its automatic dynamic quantisation scheme. Theselected parameter space allows for the unrolling of both the input and output feature maps andhence latency and throughput are co-optimised. However, the use of Vivado HLS restricts itsportability to Xilinx devices.

fpgaConvNet prioritises the support of various optimisation objectives based on the application-level performance needs. In this context, distinct methodologies are used for high-throughput andlow-latency applications. Similarly to Ca�eine and Angel-Eye, the use of Vivado HLS currentlyrestricts fpgaConvNet to Xilinx devices.Finally, DeepBurning’s design principle entails modularity and support of a wide range of NN

models. In this respect, the tool�ow’s RTL building blocks can target various types of NNs ando�er �exibility with respect to precision quantisation across the layers. By design, the generatedaccelerators are optimised to operate with a batch size of 1 and hence their optimisation objectivesare simultaneously high throughput and low latency.

3 THE FUTURE OF CNN-TO-FPGA TOOLFLOWS3.1 Towards a Uniform Evaluation MethodologyThe existing FPGA too�ows have employed ad-hoc evaluation methodologies, by targeting di�erentCNN models and reporting the achieved performance in a non-uniform manner. A uniform evalua-tion methodology is proposed here in order to enable the thorough and comparative evaluation ofCNN-to-FPGA tool�ows. The proposed methodology comprises a benchmark suite and guidelinesfor evaluation metrics.

Benchmark Suite.13 A comprehensive benchmark suite should include CNNs that are widelyused and whose accuracy has been extensively studied by the deep learning community. Each CNN

13We propose a benchmark suite of representative CNNs that can be found in: http://...

ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: July 2017.

Christos Bouganis

THE FUTURE OF CNN-TO-FPGA TOOLFLOWS

25Christos Bouganis

Objectives of a CNN-to-FPGA Toolflow

Objective 1. Targeting next-generation CNN models.Trends:

1) Increased depth.» 8-layer AlexNet (2012) à 152-layer ResNet-152 (2016), 161-layer DenseNet-161 (2017)

2) Increased workload.» 20x in GOps/input from AlexNet (2012) to VGG-16 (2014)

3) Novel compound components.» Inception module (GoogLeNet), residual block (ResNet), dense block (DenseNet),

residual Inception block (Inception-v4).

Challenge• Irregular layer connectivity challenges the automation of mapping to hardware.

26Christos Bouganis


Objective 1. Targeting next-generation CNN models.Support for Recurrent Neural Networks (RNNs):

• TPU paper by Google – around 95% of NN workloads are RNNs.

Challenge

• RNNs are memory-bounded and require different design approach to CNNs.

27Christos Bouganis


Objective 2. Support for compressed and sparse CNNsNumerous schemes exploit redundancy in the model to reduce inference time.

• Low-rank approximations, pruning, sparsification, quantisation. • ASIC accelerators have introduced designs for such networks (e.g. zero-skipping

compute units, bit-serial arithmetic, etc.)Challenge

• Methods such as pruning can break the uniformity of computation and memory accesses.

28Christos Bouganis


Objective 3. FPGA-based CNN training

GPUs are the current standard for CNN training. • Next-generation FPGAs demonstrate promising performance and power efficiency

(Stratix 10, UltraScale).• Recent advances in low-precision NN training offers space for customisation and

variable-precision arithmetic that suits FPGAs.• Slightly explored by the F-CNN framework.

Challenge• Demonstrate gains of FPGAs over GPUs for CNN training.

29Christos Bouganis


Objective 4. Hardware-Network co-designEnd-to-end toolchain, from dataset and application to network and hardware design.

• Expose hardware performance and power consumption to the training phase, co-optimise the CNN model and the hardware under a holistic view.

Challenge• Long-term objective for the community towards the efficient hardware execution of high-

performing neural networks.

30Christos Bouganis

Thank you - Questions

Christos Bouganis

1

Toolflows for Mapping Convolutional Neural Networks on FPGAs:A Survey

STYLIANOS I. VENIERIS, ALEXANDROS KOURIS AND CHRISTOS-SAVVAS BOUGANIS,Imperial College London

In the past decade, Convolutional Neural Networks (CNNs) have demonstrated state-of-the-art performancein various Arti�cial Intelligence tasks. To accelerate the experimentation and development of CNNs, severalsoftware frameworks have been released, primarily targeting power-hungry CPUs and GPUs. In this context,recon�gurable hardware in the form of FPGAs constitutes a potential alternative platform that can be integratedin the existing CNN ecosystem to provide a tunable balance between performance, power consumption andprogrammability. In this paper, a survey of the existing CNN-to-FPGA tool�ows is presented, comprising acomparative study of their key characteristics which include the supported applications, architectural choices,design space exploration methods and achieved performance. Moreover, major challenges and objectivesintroduced by the latest trends in CNN algorithmic research are identi�ed and presented. Finally, an evaluationmethodology is proposed, aiming at the comprehensive, complete and in-depth evaluation of CNN-to-FPGAtool�ows.

CCS Concepts: • General and reference → Surveys and overviews; • Computing methodologies →Neural networks; • Hardware → Recon�gurable logic and FPGAs; Electronic design automation;

Additional Key Words and Phrases: Convolutional Neural Networks, FPGA Too�ows, Deep Learning

ACM Reference format:Stylianos I. Venieris, Alexandros Kouris and Christos-Savvas Bouganis. 2017. Tool�ows for Mapping Convolu-tional Neural Networks on FPGAs: A Survey. ACM Comput. Surv. 1, 1, Article 1 (July 2017), 26 pages.https://doi.org/10.1145/nnnnnnn.nnnnnnn

1 INTRODUCTIONConvolutional Neural Networks (CNNs) [22] have demonstrated remarkable performance in com-puter vision tasks. Being able to achieve high accuracy and frequently outperform traditionalmachine vision approaches, CNNs have been employed in a vast range of applications over the lastdecade, from object detection and classi�cation to mobile robot navigation. While becoming thestate-of-the-art algorithm in machine vision, CNNs are challenged to deal with tasks of continuouslyincreasing complexity. This leads to the design of deeper, more expressive networks at the expenseof an increase in the computational and memory requirements.

The support of the EPSRC Centre for Doctoral Training in High Performance Embedded and Distributed Systems(HiPEDS, Grant Reference EP/L016796/1) is gratefully acknowledged. Authors’ addresses: Department of Electrical andElectronic Engineering, Imperial College London, SW7 2AZ, London, UK; emails: {stylianos.venieris10, a.kouris16, christos-savvas.bouganis}@imperial.ac.uk.Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without feeprovided that copies are not made or distributed for pro�t or commercial advantage and that copies bear this notice and thefull citation on the �rst page. Copyrights for components of this work owned by others than the author(s) must be honored.Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requiresprior speci�c permission and/or a fee. Request permissions from [email protected].© 2017 Copyright held by the owner/author(s). Publication rights licensed to Association for Computing Machinery.0360-0300/2017/7-ART1 $15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn

ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: July 2017.

…soon published in ACM Computing Surveys

31

References

Christos Bouganis 32

• Stylianos I. Venieris and Christos-Savvas Bouganis. 2016. fpgaConvNet: A Framework for Mapping Convolutional Neural Networks on FPGAs. In FCCM. DOI:http://dx.doi.org/10.1109/fccm.2016.22

• Stylianos I. Venieris and Christos-Savvas Bouganis. 2017. fpgaConvNet: Automated Mapping of Convolutional Neural Networks on FPGAs (Abstract Only). In FPGA. DOI:http://dx.doi.org/10.1145/3020078.3021791

• Stylianos I. Venieris and Christos-Savvas Bouganis. 2017. Latency-Driven Design for FPGA-based Convolutional Neural Networks. In FPL.

• Ying Wang, Jie Xu, Yinhe Han, Huawei Li, and Xiaowei Li. 2016. DeepBurning: Automatic Generation of FPGA-based Learning Accelerators for the Neural Network Family. In DAC. Article 110. DOI:http://dx.doi.org/10.1145/2897937.2898003

• KaiyuanGuo,LingzhiSui,JiantaoQiu,SongYao,SongHan,YuWang,andHuazhongYang.2016.Angel-Eye:ACompleteDesign Flow for Mapping CNN onto Customized Hardware. In ISVLSI. DOI:http://dx.doi.org/10.1109/isvlsi.2016.129

• Jiantao Qiu and others. 2016. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network. In FPGA. ACM, 26–35. DOI:http://dx.doi.org/10.1145/2847263.2847265

• Yufei Ma, Naveen Suda, Yu Cao, Jae sun Seo, and Sarma Vrudhula. 2016. Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA. In FPL .DOI:http://dx.doi.org/10.1109/FPL.2016.7577356

• Hardik Sharma, Jongse Park, Emmanuel Amaro, Bradley Thwaites, Praneetha Kotha, Anmol Gupta, Joon Kyung Kim,Asit Mishra, and Hadi Esmaeilzadeh. 2016. Dnnweaver: from High-level Deep Network Models to FpgaAcceleration.

References

Christos Bouganis 33

• Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon Kyung Kim, Chenkai Shao, Asit Mishra, andHadi Esmaeilzadeh. 2016. From High-Level Deep Neural Models to FPGAs. In MICRO. DOI:http://dx.doi.org/10.1109/micro.2016.7783720

• Chen Zhang, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. 2016. Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks. In ICCAD. Article 12, 8 pages. DOI:http://dx.doi.org/10.1145/2966986.2967011

• Yaman Umuroglu, Nicholas J. Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and KeesVissers. 2017. FINN: A Framework for Fast, Scalable Binarized Neural Network Inference. In FPGA. DOI:http://dx.doi.org/10.1145/3020078.3021744

• NicholasJ.Fraser,YamanUmuroglu,GiulioGambardella,MichaelaBlott,PhilipLeong,MagnusJahre,andKeesVissers.2017. Scaling Binarized Neural Networks on Reconfigurable Logic. In PARMA-DITAM. DOI:http://dx.doi.org/10.1145/3029580.3029586

• Yijin Guan, Hao Liang, Ningyi Xu, Wenqiang Wang, Shaoshuai Shi, Xi Chen, Guangyu Sun, Wei Zhang, and JasonCong. 2017. FP-DNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTL-HLS Hybrid Templates. In FCCM. DOI:http://dx.doi.org/10.1109/fccm.2016.22

• Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and Jason Cong. 2017. Automated Systolic Array Architecture Synthesis for High Throughput CNN Inference on FPGAs. In DAC. DOI:http://dx.doi.org/10.1145/3061639.3062207

tool-flows for mapping cnns into fpgas: trends and … · disadvantages: − flexibility ......

Documents