high performance, low power, cnn implementation on ultra-low...

HighPerformance,LowPower,CNNimplementationonUltra-LowPower

embeddedplatforms

This project is part of the portfolio of theA.3 – Advanced Computing and Complex System UnitCommunications Networks, Content and Technology DGEuropean Commission

www.excess.euCopyright © 2013 - 2016 The EXCESS Consortium

Contract Number: 611183Total Cost [€]: 3.31 millionStarting Date: 2013-09-01

Duration: 36 months

BrendanBarryVPofEngineering

Movidius

WhyareCNN’sinteresting?

n Deep-Learning:theGreatDisruptorn CompaniesusingCNNscanenablesoughtafterfeaturesotherwisedifficultor

impossible

Brendan Barry 2

Why are CNNs interesting to Movidius?

Brendan Barry 3

DRONES• Sense and Avoid• GPS-denied hovering• Scene labeling

SERVICE ROBOTICS

• Volumetric mapping• Indoor visual navigation

AR/VR• 6DOF positional tracking• Depth sensing• Gesture & Eye tracking• See-through

SECURITY CAMERAS• Detection• Recognition• Identification

DeepConvolutional NeuralNetworks(CNNs):

Brendan Barry 4

Input

Output

Weights

Layer2

Layer3

Layer4

LayerN

Layer1

Inference

+Weights

ErrorBackpropagation

Expectedoutput(label)

ConvolutionMax-Pooling(blockwisemaximum)

Activation(pointwiseReLU)

RepeatedforNlayersinNetwork

WeightsfromDDR/RAM

InputsPreviouslayerorDDR/RAM

InputstoNextlayer

Training

n Training on HPC clustern Inference on embedded platformn What about training on embedded platform?

AlexNet CNN (Visualization)

Brendan Barry 5

ComplexityofaStandardNetwork:GoogLeNet

Brendan Barry 6

ConvolutionPoolingSoftmaxOther

1 2 34

5 67

8 9

9InceptionLayers

256 480 480 512512 512

832 832 1024

Widthfilters

Computationalcostisincreasedbylessthan2XcomparedtoAlexNet (<1.5Bnoperations/evaluation)

5Mparameters

TheDeeperTheBetter…YetWithIncreasedComplexity

Brendan Barry 7

IncrementalAccuracyandthePowerEfficiencyCost:IsItWorthIt?

Brendan Barry 8

Where is the energy efficiency lost?

n Data movement burns much more power than arithmetic operationsn Focus is on reducing operations, not on reducing data movement

(pJ / normalized to ADD=1)

Brendan Barry 9

Add

64wordRF=1.3

Multiply=3.4

4kSRAM=44

32kSRAM=61

DRAM=3556

Power cost of data movement

n Will this get better? n Yes but…

n Logic power decreases with each noden SRAM power decrease is less each timen DRAM iterations are slower/less impactful

Brendan Barry 10

LPDDR

SRAM

Logic

Time

Solutions — 3 Pillars

Brendan Barry 11

OptimizedSoftware

AppropriateAlgorithms

ProcessingPlatform

Architecture

Processing Platform Architecture

• Consider processing platform vs. this pyramid

• More local memory allows expensive memory to be kept “Dark”– Less power consumption– More scope for software dataflow

optimization

• Recent paper on “Dark Memory”[1]

shows this formally

Brendan Barry 12

[1]http://arxiv.org/pdf/1602.04183v1.pdf

Add

RegFile

Multiply

4kSRAM

32kSRAM

DRAM

RegFile

Multiply

4kSRAM

32kSRAM

DRAM

Add

Algorithms: Some are Better than Others

n Kernel Level: Ratio of processing cycles to load/store cyclesn Matrix-Matrix better than Matrix-Vectorn CNN Example: Batched AlexNet better than AlexNetn CNN Example: GoogleNet better than AlexNet

n Optimize for locality of accessn Transform algorithm solutions to predictable dataflown Keep intermediate data as local as possiblen Move data once and complete all processing possible

n Cache systems can help, but can also be counter productiven Pre-fetching can burn more powern Cache trashing will also burn more power

Brendan Barry 13

Optimized Software for energy efficiency?

n What is Optimal energy efficiency?

n What do we want to do?n Do as much compute as possible on as smallest span of datan Keep higher bandwidth data as local as we can

n How can we do this (quickly!)?n Manually optimized coden Directed Acyclic Graphs (DAG) n Deferred execution frameworks

n How do you know you are actually improving power efficiency?

Brendan Barry 14

EXCESS – Energy monitoring

n HPC (x86 and K40) systems system energy monitoring

n Tegra TK1, TX1 energy monitoring

n Myriad2 compute cluster energy monitoring

Brendan Barry 15

Directed Graph — Specifics

n OpenVxTM: Khronos industry standard supported by variety of platformsn Directed graph APIn Some built-in, vision-specific, common functions

n Tensorflown General compute APIn Directed graph type approach without full graph API — key ideas

n ‘Process this’n ‘Process this with the output from that earlier task’

n Other viable Directed Graph approachesn Android: RenderScript Graph buildern Various vendor proprietary Frameworks

Brendan Barry 16

Deferred Execution

n Key idea: Define all of the problem and give it to a tool. Tool can analyze the data flown Pros: more abstract APIn Cons: tool needs to do more work, less predictable

n Specific examplesn Halide for image processingn OpenCL

Brendan Barry 17

Worked example: GoogleNet Implementation

n GoogLe-Net V1n Lots of compute per inputn Lots of intermediate datan Lots of weights (but pretty good in relative

terms)

Brendan Barry 18

GoogleNet Implementation

n Optimization1: Kernel Fusion for Key operationsn Conv + Bias + RELU à Single Operationn Conv + Bias + RELU + Pool à More work but

good returns

Brendan Barry 19

CBRP

conv

bias

relu

pool

GoogleNet Implementation

n Optimization2: 8 bit weightsn Optimization3: Keep activations in local SRAM

n Reduce DDR BW to ~7MB/sec/inference

Brendan Barry 20

CNN implementation problems

n Identified needn General focus on performance not energy efficiencyn Identified industry narrow solutionsn Identified issues with general solutions

n How does EXCESS help?

Brendan Barry 21

HowMovidius isDeployingStandardCNNsattheNetworkEdge?

Brendan Barry 22

CNN Model Description

WeightsFor Integration

ParsesCNNModelScheduler

Efficient Execution of Kernel Function Libraries

Drones

Robotics

EXCESSinfluence

GoogleNet RelativeGFLOPS/WPerformanceResults

Brendan Barry 23

GoogLeNet Single Inference (Batch = 1) no Heatsink or Fan

nmProcess fps Power(W)

GFLOPS GFLOPS/W

GPU-A 28 15 7.5 51.37 6.85

GPU-B 20 22 7.5 75.34 10.04

Myriad2 28 25 1.2 85.61 71.34

Fathom Neural Compute Stick: Deep Learning’s Newest Form Factor

Brendan Barry 24

Where to next?

n CNN, very fast changing research arean New results frequently change research direction

n Exploration of optimised energy efficient inferencen Is it possible to change training to produce an energy efficient

inference?

n Is it possible to iterate training after each encounter of new data?n Current systems cloud basedn What above live re-learning?

Brendan Barry 25

Acknowledgements

n Links APIs n http://www.tensorflow.orgn https://www.khronos.org/openvx/resourcesn http://www.movidius.orgn http://halide-lang.org/

n Academic references1. Dark Memory and Accelerator-Rich System Optimization in the Dark Silicon Era2. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems3. Densecap ; http://arxiv.org/pdf/1511.07571v1.pdf4. EMBEDDED DEEP NEURAL NETWORKS “The Cost of Everything and the Value of

Nothing” David Moloney, CTO Hot Chips 28, Cupertino, California

Brendan Barry 26

high performance, low power, cnn implementation on ultra-low...

Documents