deploying deep learning on embedded devices when fpgas ... · accelerated training interoperability...

23
Deploying Deep Learning on Embedded Devices When FPGAs Make Sense Jack Erickson HDL Technical Marketing

Upload: others

Post on 25-Sep-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deploying Deep Learning on Embedded Devices When FPGAs ... · accelerated training Interoperability AI Modeling Integration with complex systems System verification and validation

Deploying Deep Learning on Embedded Devices

– When FPGAs Make Sense

Jack Erickson

HDL Technical Marketing

Page 2: Deploying Deep Learning on Embedded Devices When FPGAs ... · accelerated training Interoperability AI Modeling Integration with complex systems System verification and validation

Deep Learning Inferencing on Embedded Devices

Airborne Image

Analysis

Autonomous Driving Industrial Inspection

Medical Image

Analysis

Wireless Modulation

Classification

Radar Signature

Classification

Page 3: Deploying Deep Learning on Embedded Devices When FPGAs ... · accelerated training Interoperability AI Modeling Integration with complex systems System verification and validation

System Requirements Drive Network Design

Systems

Engineer

Deep Learning

Practitioner

Hardware/Software

Engineers

Camera specs

Accuracy

Latency

Cost

Power

Industrial Inspection

Page 4: Deploying Deep Learning on Embedded Devices When FPGAs ... · accelerated training Interoperability AI Modeling Integration with complex systems System verification and validation

Challenges of Deploying Deep Learning to FPGA Hardware:

Convolution

Each stride is an 11x11x3 matrix multiply-accumulate

→105M floating-point multiply operations!

55

55

96

filters

stride=4

224

224

11x11

96 filters of 11x11x3 of 32-bit parameters →140k bytes

11x11

→1.16M bytes of activations

Page 5: Deploying Deep Learning on Embedded Devices When FPGAs ... · accelerated training Interoperability AI Modeling Integration with complex systems System verification and validation

Challenges of Deploying Deep Learning to FPGA Hardware

conv

1

conv

2

conv

3

conv

4

conv

5fc6 fc7 fc8input Total

140K 1.2M 3.5M 5.2M 1.8M 148M 64M 16MParameters

(Bytes)n/a 230 M

1.1M 728K 252K 252K 168K 16K 16K 4KActivations

(Bytes)588K 3.1 M

105M 223M 149M 112M 74M 37M 16M 4MFLOPs n/a 720 M

Off-chip RAM

Block RAM

DSP Slices

Page 6: Deploying Deep Learning on Embedded Devices When FPGAs ... · accelerated training Interoperability AI Modeling Integration with complex systems System verification and validation

Deploying Deep Learning to FPGA Hardware Requires

Collaboration

140K 1.2M 3.5M 5.2M 1.8M 148M 64M 16MParameters

(Bytes)

conv

1

conv

2

conv

3

conv

4

conv

5fc6 fc7 fc8

n/a

input Total

230 M

1.1M 728K 252K 252K 168K 16K 16K 4KActivations

(Bytes)588K 3.1 M

105M 223M 149M 112M 74M 37M 16M 4MFLOPs n/a 720 M

ResizeAcquire

data

Output /

display

Mem i/f

Optimize

• Network / layers

• Fixed-point quantization

• Processor micro-architecture

Page 7: Deploying Deep Learning on Embedded Devices When FPGAs ... · accelerated training Interoperability AI Modeling Integration with complex systems System verification and validation

AI-Driven System Design

Model design and

tuning

Hardware

accelerated training

Interoperability

AI Modeling

Integration with

complex systems

System verification

and validation

System simulation

System Design

Data cleansing and

preparation

Simulation-

generated data

Human insight

Data Preparation

7

Enterprise systems

Embedded devices

Edge, cloud,

desktop

Deployment

Iteration and Refinement

Page 8: Deploying Deep Learning on Embedded Devices When FPGAs ... · accelerated training Interoperability AI Modeling Integration with complex systems System verification and validation

Design and Analyze Your Networks in MATLAB

8

Classification Learner app to try different

classifiers and find the best fit for your data set

Deep Network Designer app to build, visualize,

and edit deep learning networks

Model design and

tuning

Hardware

accelerated training

Interoperability

AI Modeling

Page 9: Deploying Deep Learning on Embedded Devices When FPGAs ... · accelerated training Interoperability AI Modeling Integration with complex systems System verification and validation

MATLAB Interoperates with Other AI Frameworks

9

Caffe importer

Keras importerModel design and

tuning

Hardware

accelerated training

Interoperability

AI Modeling

Page 10: Deploying Deep Learning on Embedded Devices When FPGAs ... · accelerated training Interoperability AI Modeling Integration with complex systems System verification and validation

Application

logic

Deploy from MATLAB to a Variety of Hardware Platforms

10

FPGA

CPU

GPU Enterprise systems

Embedded devices

Edge, cloud,

desktop

Deployment

Page 11: Deploying Deep Learning on Embedded Devices When FPGAs ... · accelerated training Interoperability AI Modeling Integration with complex systems System verification and validation

FPGA Deployment from MATLAB

▪ Prototype network on FPGA

▪ Assess memory usage, latency, and accuracy

– Adjust network and iterate

– Quantize to fixed-point

▪ Generate customized deep learning processor HDL

…all from within MATLAB!

Deep Learning HDL ToolboxTM

Application

logic

Page 12: Deploying Deep Learning on Embedded Devices When FPGAs ... · accelerated training Interoperability AI Modeling Integration with complex systems System verification and validation

HDL Coder

Deep Learning HDL Toolbox Components

Layer

control

instructions

Weights &

Activations

IP core interface

DL Processor

HDL

Application

logic

Deep Learning Processor

QuantizeCompile &

Deploy Network

FPGA Bitstream

Analyze

Profile

Fully

Connected

Module

Convolution

Module

Conv Module Control

Memory Access

Activa

tio

ns

Activa

tio

ns

Activa

tio

ns

FC Module Control

Activa

tio

ns

Memory Access

Build ProcessorCustomize

Estimate

Page 13: Deploying Deep Learning on Embedded Devices When FPGAs ... · accelerated training Interoperability AI Modeling Integration with complex systems System verification and validation

Application

logic

Get Started Prototyping on FPGA with Deep Learning HDL ToolboxTM

Hardware support package

Deep learning processor with I/O and

external memory interfaces• Int8 or single

• Supported boards:• Xilinx: ZCU102 or ZC706

• Intel: Arria10 SoC

• http://mathworks.com/hardware-support.html

FPGA Bitstream

Layer

control

instructions

Weights &

Activations

Page 14: Deploying Deep Learning on Embedded Devices When FPGAs ... · accelerated training Interoperability AI Modeling Integration with complex systems System verification and validation

Defect Detection Example

14

Application

logic

Pre-processing:

Extract regions and

resize

Post-processing:

Annotate and label

Inference: Predict

using trained networkFPGA

Page 15: Deploying Deep Learning on Embedded Devices When FPGAs ... · accelerated training Interoperability AI Modeling Integration with complex systems System verification and validation

Run Deep Learning on FPGA from MATLAB

15

Page 16: Deploying Deep Learning on Embedded Devices When FPGAs ... · accelerated training Interoperability AI Modeling Integration with complex systems System verification and validation

>> deepNetworkDesigner

Application

logic

Profile FPGA Prototype and Iterate in MATLAB

Layer

control

instructions

Weights &

Activations

Re-t

rain

Page 17: Deploying Deep Learning on Embedded Devices When FPGAs ... · accelerated training Interoperability AI Modeling Integration with complex systems System verification and validation

Design Exploration and Customization

Page 18: Deploying Deep Learning on Embedded Devices When FPGAs ... · accelerated training Interoperability AI Modeling Integration with complex systems System verification and validation

Collaborate to Quantize Network

Systems

Engineer

Deep Learning

Practitioner

Hardware/Software

EngineersFully

Connected

Module

Convolution

Module

Processor Control

Memory Access

Activa

tio

ns

Activa

tio

ns

Activa

tio

ns

x

+/-

Σ

32/

x

+/-

Σ

8/

x

+/-

Σ

8/

x

+/-

Σ

8/

x

+/-

Σ

8/

Accuracy

Latency

Cost

Power

Page 19: Deploying Deep Learning on Embedded Devices When FPGAs ... · accelerated training Interoperability AI Modeling Integration with complex systems System verification and validation

Int8 Quantization

19

Page 20: Deploying Deep Learning on Embedded Devices When FPGAs ... · accelerated training Interoperability AI Modeling Integration with complex systems System verification and validation

Application

logic

Converge on an FPGA-Optimized Deep Learning Network

Layer

control

instructions

Weights &

Activations

Re-t

rain

Modify network

% Create target objecthTarget = dlhdl.Target(…)

% Create workflow object, using the targethW = dlhdl.Workflow(…);

% Compile the networkhW.compile;

% Program the bitstream and deploy the compiled network and weightshW.deploy;

% Run prediction[score, speed] = hW.predict(img, ‘Profile’, ‘on’);>> deepNetworkDesigner

Parameters Speed

140 MB 18 fps

84 MB 45 fps

68 MB 139 fps

Quantize

>> deepNetworkQuantizer

int8 Bitstream

Generate

HDL

Page 21: Deploying Deep Learning on Embedded Devices When FPGAs ... · accelerated training Interoperability AI Modeling Integration with complex systems System verification and validation

Application

logic

Generate Custom Deep Learning Processor HDL and IP Core

% Create a custom processor objecthPC = dlhdl.ProcessorConfig;

% Customize processor characteristicshPC.setModuleProperty('conv', 'KernelDataType', 'int8');hPC.setModuleProperty('conv', 'ConvThreadNumber', 64);hPC.setModuleProperty('fc', 'KernelDataType', 'int8');hPC.setModuleProperty('fc', 'FCThreadNumber', 16);hPC.TargetFrequency = 300;

% Create workflow object for this config, estimate performancehW = dlhdl.Workflow('Network’,quantizer,'ProcessorConfig',hPC)hW.estimate('Performance');

% Generate HDL and IP core using HDL Coderdlhdl.buildProcessor(hPC);

HDL Coder

Custom Processor

IP core interface

DL Processor

HDL

• Configure processor settings

• Parallel threads, frequency, memory sizes

• Quantized or single precision floating point

• Target frequency

• Target any hardware

• Synthesizable RTL with AXI mappings

• Automatic Xilinx or Intel implementation

Page 22: Deploying Deep Learning on Embedded Devices When FPGAs ... · accelerated training Interoperability AI Modeling Integration with complex systems System verification and validation

Application

logic

Collaborate to Converge on Deep Learning FPGA Implementation

FPGA

CPU

GPU

Tune for system requirements

Prototype from MATLAB

Configure and generate RTL

Deep Learning HDL Toolbox

AI Modeling

System Design

Deployment

Page 23: Deploying Deep Learning on Embedded Devices When FPGAs ... · accelerated training Interoperability AI Modeling Integration with complex systems System verification and validation

Learn More

▪ Deep Learning Solutions in MATLAB

https://www.mathworks.com/solutions/deep-learning.html

▪ Deep Learning HDL Toolbox

https://www.mathworks.com/products/deep-learning-hdl.html

▪ Onramp: Deep Learning in MATLAB

https://www.mathworks.com/learn/tutorials/deep-learning-onramp.html

▪ MathWorks FPGA Solutions Page

https://www.mathworks.com/solutions/fpga-asic-soc-development.html