applying deep learning vision technology to low-cost/power embedded systems

© 2016 Synopsys, Inc. 1

Applying Deep Learning Vision Technology to

Low-cost, Low-power Embedded Systems:

An Industrial Perspective

Pierre Paulin

Director of R&D

16 January 2016


Agenda

• Embedded Vision application

trends and challenges

• Synopsys Embedded Vision

Processor Overview

• Convolution Neural Networks

– Applications, requirements

– Dedicated CNN engine for EV

– Competitive analysis

• Summary & Final Thoughts


Embedded Vision is Coming Fast

• Embedded Vision is the use of computer vision in embedded systems to interpret meaning from images or video

• In cars to improve safety

• Surveillance for detection and tracking

• In industrial automation to improve quality and control

• Estimated $300B+ market in 2020, 35% CAGR

0

50

100

150

200

250

300

350

2013 2014 2015 2016 2017 2018 2019 2020

Billio

ns o

f D

ollars

Vision Systems Shipments

Sources: ABI Research, Insight Media, Transparency

Market Research, Markets And Markets, Synopsys

Wide Variety of Vision Applications

Cameras

Drones

Home AutomationRetailGaming Infotainment

Augmented RealityMobile SurveillanceADAS


Autonomous Driving Buzz

1/14/2016 – U.S. Proposes Spending $4 Billion to Encourage Driverless CarsObama administration aims to remove hurdles to making autonomous cars more widespread

Wall Street Journal

8/17/2016 – Ford's self-driving car 'coming in 2021’ (BBC News)

8/24/2016 – Self-driving taxis roll out in Singapore -

beating Uber to it (The Guardian)

10/20/2016 – Elon Musk: You'll be able to summon your driverless Tesla

from cross-country (CNN Money)

10/25/2016 – Uber's Self-Driving Truck Makes Its First Delivery:

50000 Beers (Wired)


Largest Embedded Vision Application SegmentAdvanced Driver Assistance Systems Driven By Safety Concerns

Source: IC Market Drivers, IC Insights, January 2015 & Trends and Opportunities in Driver Assistance and Automated Driving, IHS Automotive Sep 2015


Video Surveillance Markets Growing Rapidly

• Global IP Video Surveillance Market

expected to grow at CAGR of 37.3%

from 2012-20

• Demand driven by

– Growing installations of IP cameras

– Need for surveillance cameras

with better video quality

– Limited ability for real-time human

analysis

http://www.alliedmarketresearch.com/IP-video-surveillance-VSaaS-market

3X Growth Forecast

2013 - 2019

Security (Airports, Govt, Banks, Casinos), Home Surveillance, Retail, Healthcare

http://www.alliedmarketresearch.com/IP-video-surveillance-VSaaS-market


Less Efficient EV Options Dedicated Embedded Vision Processors

EV Challenges Require Embedded Vision Processors

Perf

orm

ance P

ow

er

Are

a

CPUs don’t have math horsepower for fast

2D vision processing

GPUs have high performance but large

areas and higher power

DSPs are designed for low power audio

and speech applications, not 2D video

FPGAs are good for prototyping but are

expensive and performance limited

Higher performance

Lower power

Smaller area

Can include a dedicated deep learning

(CNN) engine


Embedded Vision Applications and

Power, Performance and Area (PPA) Requirements


Vision Pipeline Example

Object detection pipeline

Grayscale

Conversion

Image

Pyramid

Detecting

Areas of

Interest in a

Frame

Non-max

Suppression

Draw Box


Vision Pipeline Example

Video surveillance pipeline

Grayscale &

Image

Pyramid

Face

Detection

Tracking &

Detection

Cascade

Fusion &

Learning


Vision Algorithm Computation

• Object detection

• Background

subtraction

• Feature extraction

• Image

segmentation

• Connected comp.

labeling

• Noise reduction

• Color space

conversion

• Gamma correction

• Image scaling

• Gaussian

pyramid

Simple Data-Level

Parallelism (DLP)

• Good spatial locality

• Good compute intensity

• Small context

More Complex DLP

• Complex data structures

• Irregular compute intensity

• Larger context

Scalar Processing

• General purpose compute

• Thread level parallelism

Pre-processing Selecting Areas

of Interest

Precise

Processing of

Selected Areas

Decision

Making

• Object recognition

• Tracking

• Feature matching

• Gesture

recognition

• Motion analysis

• Match/no match

• Flag events

CNN

RISC scalar

Multi-core Gen2

EV SIMD processorMulti-core Gen1

EV SIMD processorMulti-core

CNN Engine


Sample Power, Performance and Area Targets

• Intelligent video surveillance applications

– Face detection & tracking, pose detection, gaze

estimation, gender recognition, age estimation

– People detection & counting for video surveillance

– Driver fatigue detection

– Advanced detection and tracking

– Implementation on

GPP and GP-GPU

– Typical customer

targets for

HD @30 fps

Based on 28 nm process node

<500 mW 1-2 mm2

10-500 GOP/s

1-10 W 50-100 mm2


Sample Power, Performance and Area Targets

• ADAS

– Pedestrian, vehicle, traffic sign, lane detections

– Scene segmentation

– Implementation on

GPP and GP-GPU

– Typical customer

targets for

HD @30 fps

Based on 28 nm process node

100-2000 GOP/s

1-2 W 2-5 mm2

>100 W >100 mm2


DesignWare® ARC EV6 Processor and CNN

- Vision-specific wide SIMD engine

- Optimized CNN engine

- Programming tools


EV6x Processor Objectives

Low power:

Over 1000 GMAC/s/W

in CNN engine

High productivity

Highly Scalable Vector Engine

100 GOP/s

620 GOP/s

Low area High-performance CNN:

Up to 880 MAC/cycle

Scalar

Vector

CNN

Standard Programming model

Accelerator

OpenCL C

Most Integrated Solution

C/C++

Embedded

Vision

Libraries

Preliminary – Subject to Change

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwiOma_39aTJAhWLOz4KHW9QACUQjRwIBw&url=http://www.scsk12.org/schools/mcsprepnw.aca/site/calendar.shtml&psig=AFQjCNEd3qQbBSnwLRd0j3RYH55Fruz65A&ust=1448312277290004

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwjJgKaU-aTJAhVDGj4KHe09BMsQjRwIBw&url=http://www.mwdata.net/solutions/cloud/&psig=AFQjCNE4eyK2v88PAruWUvBfXEct2gnQ9w&ust=1448313140571746

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwi8wqiT-qTJAhXGdD4KHd5nC5AQjRwIBw&url=http://macexperience.com/2014/08/26/iphone-battery-replacement-program-info/&psig=AFQjCNHj0Yh298q6hauHec2rweNzA1W74A&ust=1448313429809082

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwilgOiK-KTJAhUIyj4KHfTHDzMQjRwIBw&url=http://expertintegratedsystemsblog.com/2012/04/the-competitive-advantages-of-troy-ngp-in-the-world-of-integrated-systems-part-1-of-3-integration-by-design/&psig=AFQjCNF7AVvw7lEtP_Pj_tfD6CvILNSauQ&ust=1448312688166634


EV Processor Solution: EV6x with CNN Engine

Embedded Vision Programming Tools

Vision CPU (1 to 4 cores) CNN Engine

Option

Convolution

ALU Conv. 2D

AGUs CC MEMs

Cluster

Comm. Shared Mem.DMA

Classification

AXI Interconnect

User kernels

Ui

Uk

C/C++

OpenCL C

K1 Kn…

Kernel Lib

OpenCL C compiler, with

whole function vectorization

C/C++

compiler

Lib

Ui

Uj

Uk

Kn

Uk

Um

graph

CNN Graph

Mapping Tools

HAPS®

Rapid

Prototyping

Board

Virtual

Prototype

ALU Conv. 1D

AGUs CC MEMs

Coherency

ARConnect Sync Debug Power Mgmt.

Up to 880 MAC/cycleUp to 620 GOP/s

at 800 MHz

Core 4

Core 3

32b

Scalar

512b

Vector DSP

Core 2

Core 1

32b

Scalar

512b

Vector DSP

VCCMD$I$ VCCMD$I$

CNN

graphCn

CNN graph

node


CNN – Convolution Neural Networks

Deep Learning Approach to Embedded Vision


CNN for a Wide Range of Vision Applications

• Image classification, search similar images

• Object detection, classification & localization

– Any type of object(s), depending on training phase

• Face recognition

• Visual attention

• Facial expression recognition

• Gesture recognition / hand tracking

• Resolution upscaling

• Scene recognition and labelling, semantic segmentation

– Sky, mountain, road, tree, building, …

• Recent advocates

– Nvidia, Microsoft, Google, Baidu, Adobe, Qualcomm, Yahoo …

– Mobileye for autonomous driving carcar

skybuilding

building

road


Pedestrian Detection: HoG vs. CNN


Computation Requirements for CNN

Accuracy

Com

pu

tation

al co

mp

lexity

Lenet (1994)

4 layers

AlexNet (2012)

8 layers

100MByte

VGG-19 (2014)

19 layers

270MByte

GoogleNet (2014)

22 layer

20MByteResNet (2015)

152 layers!

10MByte

1 GOPs/frame

10 GOPs/frame


Scene Segmentation

Source: Press Release by Toshiba and Denso, 17 Oct. 2016


Super resolution using CNN

Source

Bicubic

Interpolation CNN Reference Source

Bicubic

Interpolation CNN Reference

“Image Super-Resolution Using Deep Convolutional Networks (2016), C. Dong et al.”


Super-Resolution using Convolutional Neural Networks

• CNN’s deliver superior Super-Resolution for single image and video

• CNN’s for Super-Resolution require dedicated compute engine with high compute capacity

• Example “Image Super-Resolution Using Deep Convolutional Networks (2016), C. Dong et al.”

Requires 600 GMAC for one 4K frame


CNN Graph Training and Porting

Image labeling

Graph

explore,

training

GPU farm

Code

vectorization

Tra

inin

gP

ort

ing

coeff.

Code

Object

detection

executable

CNN

graphGPP

CNN-optimized

processor

GP-GPU


CNN Computation

• Convolution of multiple

inputs together

– Fixed kernel size

• Optional subsampling

– 1x, 2x, 4x

• Optional max-pooling

• Very regular, repetitive

computation

– Dominated by MAC

– Deterministic

• Non-linear activation

function

– Rectifier, Sigmoid,

Hyperbolic tangent

I0

IM-1

I1

O0

ON-1

M inputs

(XI * YI)Z kernels (K * K) with

associated weights

N outputs (XO * YO)

Oj = act(Bj+ (Iv x Kw) + …)

Convolution (x)

act

act

Activation (tanh, ReLU)…


EV6x Second Generation CNN Engine for

Object Detection and Semantic Segmentation

- High performance, low power and area

- Fully programmable

carcar

skybuilding

building


High-Performance EV6x CNN Engine

• Dedicated EV6x CNN Engine with

performance equal or better than GP-GPU

• Programmable to support full range of fixed point

CNN graphs

• State-of-the-art power-efficiency

• Real-time, high quality image classification, object

recognition, semantic segmentation

• Supports resolutions up to 4K

• Operates in parallel with Vision CPUs increasing

efficiency and throughput

AX

I Inte

rco

nn

ec

t

Vision CPU Core

32 bit

RISC

512-bit

Vector DSP

Cluster

Shared

Memory

DMA

AR

Co

nn

ec

t

CNN Engine

Convolution

Classification


ALUConv. 2D

AGUs CC MEMs

ALU Conv. 1D

AGUs CC MEMs


AlexNet on ImageNet

Quantization opportunities for recognition tasks

32-bit

floating point

16-bit

fixed point

vs

[Moons WACV2016]

Recognitio

n a

ccura

cy

Fixed-point word length

• 12 bit good compromise between

CNN recognition performance and

hardware cost

– 8 bit will cause recognition rate loss on

existing graphs

– 12 bit multiplier is almost half the area

of a 16 bit multiplier

12-bit


CNN data precision – Qualcomm data


CNN Competitive Analysis


CNN Performance and Area Efficiency Comparison


GM

AC

/s/

mm

2

10 10001

10

100

1000

300X

2X

100

GMAC/s

20X

14X

First gen

vision

processors

GP/GPU

EV6x Embedded

Vision Processor

w/integrated CNN

Circle area proportional

to logic area


CNN Performance and Power Efficiency Comparison


GM

AC

/s/

W

10 100 100010

100

1000

10000

11X

30X

GMAC/s

EV6x Embedded

Vision Processor

w/integrated CNN

First gen

vision

processors

GP/GPU

Circle area proportional

to logic area


Less Efficient EV Options Dedicated Embedded Vision Processors

EV Challenges Require Embedded Vision Processors

Perf

orm

ance P

ow

er

Are

a

CPUs don’t have math horsepower

for fast 2D vision processing

GPUs have high performance but

large areas and higher power

DSPs are designed for low power

audio and speech applications, not

2D video

FPGAs are good for prototyping but

are expensive and performance

limited

High performance

Lower power

Smaller area

Dedicated deep learning (CNN) engine provides

PPA numbers compatible with surveillance,

ADAS and mobile targets

1000

GMACs/W

100-1000

GOP/sFew

mm2

Thank You

Contact me at:

[email protected]