deploying deep learning networks to embedded gpus and cpus · deploying deep learning networks to...

37
1 © 2015 The MathWorks, Inc. Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD Senior Application Engineer, Computer Vision

Upload: others

Post on 24-May-2020

25 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

1© 2015 The MathWorks, Inc.

Deploying Deep Learning Networks

to Embedded GPUs and CPUs

Rishu Gupta, PhD

Senior Application Engineer, Computer Vision

Page 2: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

2

MATLAB Deep Learning Framework

Access Data Design + Train Deploy

▪ Manage large image sets

▪ Automate image labeling

▪ Easy access to models

▪ Acceleration with GPU’s

▪ Scale to clusters

Page 3: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

3

Multi-Platform Deep Learning Deployment

Embedded

MobileNvidia

TX1, TX2, TK1 Raspberry pi Beagle bone

Desktop Data-center

Page 4: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

4

Multi-Platform Deep Learning Deployment

▪ Need code that takes advantage

of:

– NVIDIA® CUDA libraries, including

cuDNN and TensorRT

– Intel® Math Kernel Library for Deep

Neural Networks (MKL-DNN) for

Intel processors

– ARM® Compute libraries for ARM

processors

Page 5: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

5

Intel Xeon Desktop PC Raspberry Pi Board

Android Phone

NVIDIA Jetson TX1 board

Multi-Platform Deep Learning Deployment

▪ Need code that takes advantage

of:

– NVIDIA® CUDA libraries, including

cuDNN and TensorRT

– Intel® Math Kernel Library for Deep

Neural Networks (MKL-DNN) for

Intel processors

– ARM® Compute libraries for ARM

processors

Page 6: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

6

Algorithm Design to Embedded Deployment Workflow

Conventional Approach

High-level language

Deep learning framework

Large, complex software stack1

Desktop GPU

C++

C/C++

Low-level APIs

Application-specific libraries2

C++

Embedded GPU

C/C++

Target-optimized libraries

Optimize for memory & speed3

Challenges

• Integrating multiple libraries and

packages

• Verifying and maintaining multiple

implementations

• Algorithm & vendor lock-in

Page 7: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

7

Solution- GPU Coder for Deep Learning Deployment

GPU Coder

Target Libraries

NVIDIA

TensorRT &

cuDNN

Libraries

ARM

Compute

Library

Intel

MKL-DNN

LibraryApplication

logic

Page 8: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

8

Deep Learning Deployment Workflows

Pre-

processing

Post-

processing

codegen

Portable target code

INTEGRATED APPLICATION DEPLOYMENT

cnncodegen

Portable target code

INFERENCE ENGINE DEPLOYMENT

Trained

DNN

Page 9: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

9

Workflow for Inference Engine Deployment

Steps for inference engine deployment

1. Generate the code for trained model>> cnncodegen(net, 'targetlib', 'cudnn’)

2. Copy the generated code onto target board

3. Build the code for the inference engine>> make –C ./codegen –f …mk

4. Use hand written main function to call inference engine

5. Generate the exe and test the executable>> make –C ./ ……

cnncodegen

Portable target code

INFERENCE ENGINE DEPLOYMENT

Trained

DNN

Page 10: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

10

How to get a Trained DNN into MATLAB?

Train in MATLAB

Model

importer

Trained

DNN

Transfer

learningReference model

Page 11: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

11

Deep Learning Inference Deployment

Train in MATLAB Trained

DNN

Target Libraries

NVIDIA

TensorRT &

cuDNN

Libraries

ARM

Compute

Library

Intel

MKL-DNN

LibraryModel

importer

Transfer

learningReference model

Page 12: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

12

Building DNN from Scratch

Load Training Data

Build Layer Architecture

Set Training

Options

Train Network

%% Create a datastore

imds = imageDatastore(‘Data’,...

'IncludeSubfolders',true,'LabelSource','foldernames');

num_classes = numel(unique(imds.Labels));

%% Build layer architecture

layers = [imageInputLayer([64 32 3])

convolution2dLayer(5,20)

reluLayer()

maxPooling2dLayer(2,'Stride',2)

fullyConnectedLayer(512)

fullyConnectedLayer(2)

softmaxLayer()

classificationLayer()];

%% Set Training Options

trainOpts = trainingOptions( 'sgdm',...

'MiniBatchSize', miniBatchSize,...

'Plots', 'training-progress’);

%% Train Network

net = trainNetwork(imds, layers, trainOpts);

Page 13: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

13

Pedestrian Detection DNN Deployment on ARM Processor

layers = [imageInputLayer([64 32 3])

convolution2dLayer(5,20)

reluLayer()

maxPooling2dLayer(2,'Stride',2)

CrossChannelNormalizationLayer(5,'K',1);

convolution2dLayer(5,20)

reluLayer()

maxPooling2dLayer(2,'Stride',2)

fullyConnectedLayer(512)

fullyConnectedLayer(2)

softmaxLayer()

classificationLayer()];

Page 14: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

14

Pedestrian Detection DNN Deployment on ARM Processor

▪ ARM Neon instruction set architecture

– Example: ARM Cortex A

▪ ARM Compute Library

– Low-level Software functions

– Computer vision, machine learning etc…

▪ Pedestrian detection on Raspberry pi

Page 15: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

15

Deep Learning Inference Deployment

Train in MATLAB

Model

importer

Trained

DNN

Target Libraries

NVIDIA

TensorRT &

cuDNN

Libraries

ARM

Compute

Library

Intel

MKL-DNN

Library

Transfer

learningReference model

Page 16: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

16

Importing DNN from Open Source Framework

Caffe Model Importer

(including Caffe Model Zoo)

▪ importCaffeLayers

▪ importCaffeNetwork

TensorFlow-Keras Model Importer

▪ importKerasLayers

▪ importKerasNetwork

network = importCaffeNetwork(protofile, 'yolo.caffemodel');

Page 17: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

17

Deep Learning Inference Deployment

Train in MATLAB

Model

importer

Trained

DNN

Target Libraries

NVIDIA

TensorRT &

cuDNN

Libraries

ARM

Compute

Library

Intel

MKL-DNN

Library

Transfer

learningReference model

Object Detection

Page 18: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

18

Train in MATLAB

Deep Learning Inference Deployment

Model

importer

Trained

DNN

Target Libraries

Transfer

learningReference model

NVIDIA

TensorRT &

cuDNN

Libraries

ARM

Compute

Library

Intel

MKL-DNN

Library

Page 19: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

19

Train in MATLAB

Deep Learning Inference Deployment

Model

importer

Trained

DNN

Target Libraries

Transfer

learningReference model

NVIDIA

TensorRT &

cuDNN

Libraries

ARM

Compute

Library

Intel

MKL-DNN

Library

Page 20: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

20

Layered Architecture for Segnet- Semantic Segmentation

DAG Network

Total number of layers: 91

Page 21: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

21

TESLA V100

DRIVE PX 2

TESLA P4

JETSON TX2

NVIDIA DLA

PROGRAMMABLE INFERENCE ACCELERATOR

NVIDIA TensorRT

TensorRT

Page 22: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

22

Performance Summary (VGG-16) on TitanXP

0 50 100 150 200 250 300 350 400

MATLAB (cuDNN fp32)

GPU Coder (cuDNN fp32)

GPU Coder (TensorRT fp32)

GPU Coder (TensorRT int8)

Page 23: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

23

How Good is Generated Code Performance?

▪ Performance of CNN inference (Alexnet) on Titan XP GPU

▪ Performance of CNN inference (Alexnet) on Jetson (Tegra) TX2

Page 24: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

24

Alexnet Inference on NVIDIA Titan Xp

GPU Coder +

TensorRT (3.0.1)

GPU Coder +

cuDNN

Fra

mes p

er

second

Batch Size

CPU Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz

GPU Pascal Titan Xp

cuDNN v7

Testing platform

MXNet (1.1.0)

GPU Coder +

TensorRT (3.0.1, int8)

TensorFlow (1.6.0)

Page 25: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

25

VGG-16 Inference on NVIDIA Titan Xp

GPU Coder +

TensorRT (3.0.1)

GPU Coder +

cuDNN

Fra

mes p

er

second

Batch Size

CPU Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz

GPU Pascal Titan Xp

cuDNN v7

Testing platform

MXNet (1.1.0)

GPU Coder +

TensorRT (3.0.1, int8)

TensorFlow (1.6.0)

Page 26: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

26

Alexnet Inference on Jetson TX2: Frame-Rate Performance

GPU Coder + cuDNN

Batch Size

C++ Caffe (1.0.0-rc5)

GPU Coder +

TensorRT

Fra

mes p

er

second

Page 27: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

27

Brief Summary

DNN libraries are great for inference, …

▪ GPU coder generates code that takes advantage

of:

NVIDIA® CUDA libraries, including

cuDNN, and TensorRT

Intel® Math Kernel Library for Deep

Neural Networks (MKL-DNN)

ARM® Compute libraries for mobile

platforms

Page 28: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

28

Brief Summary

DNN libraries are great for inference, …

▪ GPU coder generates code that takes advantage

of:

NVIDIA® CUDA libraries, including

cuDNN, and TensorRT

Intel® Math Kernel Library for Deep

Neural Networks (MKL-DNN)

ARM® Compute libraries for mobile

platforms

but, applications require more than just inference

Page 29: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

29

Deep learning Workflows- Integrated Application Deployment

Pre-

processing

Post-

processing

codegen

Portable target code

INTEGRATED APPLICATION DEPLOYMENT

Page 30: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

30

Traffic sign detection and recognition

Object

detection

DNN

Strongest

Bounding

Box

Classifier

DNN

YOLO Recognition net

Page 31: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

31

Traffic sign detection and recognition

Page 32: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

32

Traffic sign detection and recognition

Page 33: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

33

GPU Coder Helps You Deploy Applications to GPUs Faster

GPU Coder

CUDA Kernel creation

Memory allocation

Data transfer minimization

• Library function mapping

• Loop optimizations

• Dependence analysis

• Data locality analysis

• GPU memory allocation

• Data-dependence analysis

• Dynamic memcpy reduction

Page 34: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

34

CUDA Code Generation from GPU Coder app

Integrated editor

and simplified

workflow for code

generation

Page 35: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

35

Summary- GPU Coder

MATLAB algorithm

(functional reference)

Functional test1 Deployment

unit-test

2

Desktop

GPU

C++

Deployment

integration-test

3

Desktop

GPU

C++

Real-time test4

Embedded GPU

.mex .lib Cross-compiled

.lib

Build type

Call CUDA

from MATLAB

directly

Call CUDA from

(C++) hand-

coded main()

Call CUDA from (C++)

hand-coded main().

Page 36: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

36

MATLAB Deep Learning Framework

Access Data Design + Train

▪ Manage large image sets

▪ Automate image labeling

▪ Easy access to models

▪ Acceleration with GPU’s

▪ Scale to clusters

NVIDIA

TensorRT &

cuDNN

Libraries

Intel

MKL-DNN

Library

DEPLOYMENT

ARM

Compute

Library

Page 37: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

37

• Share your experience with MATLAB & Simulink on Social Media

▪ Use #MATLABEXPO

▪ I use #MATLAB because……………………… Attending #MATLABEXPO▪ Examples

▪ I use #MATLAB because it helps me be a data scientist! Attending #MATLABEXPO

▪ Learning new capabilities in #MATLAB and #Simulink at #MATLABEXPO.

• Share your session feedback: Please fill in your feedback for this session in the feedback form

Speaker Details

Email: [email protected]

LinkedIn: https://www.linkedin.com/in/rishu-gupta-72148914/

Contact MathWorks India

Products/Training Enquiry Booth

Call: 080-6632-5749

Email: [email protected]