high performance, low power, cnn implementation on ultra-low...

26
High Performance, Low Power, CNN implementation on Ultra-Low Power embedded platforms This project is part of the portfolio of the A.3 – Advanced Computing and Complex System Unit Communications Networks, Content and Technology DG European Commission www.excess.eu Copyright © 2013 - 2016 The EXCESS Consortium Contract Number: 611183 Total Cost [€]: 3.31 million Starting Date: 2013-09-01 Duration: 36 months Brendan Barry VP of Engineering Movidius

Upload: others

Post on 28-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: High Performance, Low Power, CNN implementation on Ultra-Low …excess-project.eu/excess_workshop/Chalmers16/EXCESS... · 2016-09-29 · Complexity of a Standard Network: GoogLeNet

HighPerformance,LowPower,CNNimplementationonUltra-LowPower

embeddedplatforms

This project is part of the portfolio of theA.3 – Advanced Computing and Complex System UnitCommunications Networks, Content and Technology DGEuropean Commission

www.excess.euCopyright © 2013 - 2016 The EXCESS Consortium

Contract Number: 611183Total Cost [€]: 3.31 millionStarting Date: 2013-09-01

Duration: 36 months

BrendanBarryVPofEngineering

Movidius

Page 2: High Performance, Low Power, CNN implementation on Ultra-Low …excess-project.eu/excess_workshop/Chalmers16/EXCESS... · 2016-09-29 · Complexity of a Standard Network: GoogLeNet

WhyareCNN’sinteresting?

n Deep-Learning:theGreatDisruptorn CompaniesusingCNNscanenablesoughtafterfeaturesotherwisedifficultor

impossible

Brendan Barry 2

Page 3: High Performance, Low Power, CNN implementation on Ultra-Low …excess-project.eu/excess_workshop/Chalmers16/EXCESS... · 2016-09-29 · Complexity of a Standard Network: GoogLeNet

Why are CNNs interesting to Movidius?

Brendan Barry 3

DRONES• Sense and Avoid• GPS-denied hovering• Scene labeling

SERVICE ROBOTICS

• Volumetric mapping• Indoor visual navigation

AR/VR• 6DOF positional tracking• Depth sensing• Gesture & Eye tracking• See-through

SECURITY CAMERAS• Detection• Recognition• Identification

Page 4: High Performance, Low Power, CNN implementation on Ultra-Low …excess-project.eu/excess_workshop/Chalmers16/EXCESS... · 2016-09-29 · Complexity of a Standard Network: GoogLeNet

DeepConvolutional NeuralNetworks(CNNs):

Brendan Barry 4

Input

Output

Weights

Layer2

Layer3

Layer4

LayerN

Layer1

Inference

+Weights

ErrorBackpropagation

Expectedoutput(label)

ConvolutionMax-Pooling(blockwisemaximum)

Activation(pointwiseReLU)

RepeatedforNlayersinNetwork

WeightsfromDDR/RAM

InputsPreviouslayerorDDR/RAM

InputstoNextlayer

Training

n Training on HPC clustern Inference on embedded platformn What about training on embedded platform?

Page 5: High Performance, Low Power, CNN implementation on Ultra-Low …excess-project.eu/excess_workshop/Chalmers16/EXCESS... · 2016-09-29 · Complexity of a Standard Network: GoogLeNet

AlexNet CNN (Visualization)

Brendan Barry 5

Page 6: High Performance, Low Power, CNN implementation on Ultra-Low …excess-project.eu/excess_workshop/Chalmers16/EXCESS... · 2016-09-29 · Complexity of a Standard Network: GoogLeNet

ComplexityofaStandardNetwork:GoogLeNet

Brendan Barry 6

ConvolutionPoolingSoftmaxOther

1 2 34

5 67

8 9

9InceptionLayers

256 480 480 512512 512

832 832 1024

Widthfilters

Computationalcostisincreasedbylessthan2XcomparedtoAlexNet (<1.5Bnoperations/evaluation)

5Mparameters

Page 7: High Performance, Low Power, CNN implementation on Ultra-Low …excess-project.eu/excess_workshop/Chalmers16/EXCESS... · 2016-09-29 · Complexity of a Standard Network: GoogLeNet

TheDeeperTheBetter…YetWithIncreasedComplexity

Brendan Barry 7

Page 8: High Performance, Low Power, CNN implementation on Ultra-Low …excess-project.eu/excess_workshop/Chalmers16/EXCESS... · 2016-09-29 · Complexity of a Standard Network: GoogLeNet

IncrementalAccuracyandthePowerEfficiencyCost:IsItWorthIt?

Brendan Barry 8

Page 9: High Performance, Low Power, CNN implementation on Ultra-Low …excess-project.eu/excess_workshop/Chalmers16/EXCESS... · 2016-09-29 · Complexity of a Standard Network: GoogLeNet

Where is the energy efficiency lost?

n Data movement burns much more power than arithmetic operationsn Focus is on reducing operations, not on reducing data movement

(pJ / normalized to ADD=1)

Brendan Barry 9

Add

64wordRF=1.3

Multiply=3.4

4kSRAM=44

32kSRAM=61

DRAM=3556

Page 10: High Performance, Low Power, CNN implementation on Ultra-Low …excess-project.eu/excess_workshop/Chalmers16/EXCESS... · 2016-09-29 · Complexity of a Standard Network: GoogLeNet

Power cost of data movement

n Will this get better? n Yes but…

n Logic power decreases with each noden SRAM power decrease is less each timen DRAM iterations are slower/less impactful

Brendan Barry 10

LPDDR

SRAM

Logic

Time

Page 11: High Performance, Low Power, CNN implementation on Ultra-Low …excess-project.eu/excess_workshop/Chalmers16/EXCESS... · 2016-09-29 · Complexity of a Standard Network: GoogLeNet

Solutions — 3 Pillars

Brendan Barry 11

OptimizedSoftware

AppropriateAlgorithms

ProcessingPlatform

Architecture

Page 12: High Performance, Low Power, CNN implementation on Ultra-Low …excess-project.eu/excess_workshop/Chalmers16/EXCESS... · 2016-09-29 · Complexity of a Standard Network: GoogLeNet

Processing Platform Architecture

• Consider processing platform vs. this pyramid

• More local memory allows expensive memory to be kept “Dark”– Less power consumption– More scope for software dataflow

optimization

• Recent paper on “Dark Memory”[1]

shows this formally

Brendan Barry 12

[1]http://arxiv.org/pdf/1602.04183v1.pdf

Add

RegFile

Multiply

4kSRAM

32kSRAM

DRAM

RegFile

Multiply

4kSRAM

32kSRAM

DRAM

Add

Page 13: High Performance, Low Power, CNN implementation on Ultra-Low …excess-project.eu/excess_workshop/Chalmers16/EXCESS... · 2016-09-29 · Complexity of a Standard Network: GoogLeNet

Algorithms: Some are Better than Others

n Kernel Level: Ratio of processing cycles to load/store cyclesn Matrix-Matrix better than Matrix-Vectorn CNN Example: Batched AlexNet better than AlexNetn CNN Example: GoogleNet better than AlexNet

n Optimize for locality of accessn Transform algorithm solutions to predictable dataflown Keep intermediate data as local as possiblen Move data once and complete all processing possible

n Cache systems can help, but can also be counter productiven Pre-fetching can burn more powern Cache trashing will also burn more power

Brendan Barry 13

Page 14: High Performance, Low Power, CNN implementation on Ultra-Low …excess-project.eu/excess_workshop/Chalmers16/EXCESS... · 2016-09-29 · Complexity of a Standard Network: GoogLeNet

Optimized Software for energy efficiency?

n What is Optimal energy efficiency?

n What do we want to do?n Do as much compute as possible on as smallest span of datan Keep higher bandwidth data as local as we can

n How can we do this (quickly!)?n Manually optimized coden Directed Acyclic Graphs (DAG) n Deferred execution frameworks

n How do you know you are actually improving power efficiency?

Brendan Barry 14

Page 15: High Performance, Low Power, CNN implementation on Ultra-Low …excess-project.eu/excess_workshop/Chalmers16/EXCESS... · 2016-09-29 · Complexity of a Standard Network: GoogLeNet

EXCESS – Energy monitoring

n HPC (x86 and K40) systems system energy monitoring

n Tegra TK1, TX1 energy monitoring

n Myriad2 compute cluster energy monitoring

Brendan Barry 15

Page 16: High Performance, Low Power, CNN implementation on Ultra-Low …excess-project.eu/excess_workshop/Chalmers16/EXCESS... · 2016-09-29 · Complexity of a Standard Network: GoogLeNet

Directed Graph — Specifics

n OpenVxTM: Khronos industry standard supported by variety of platformsn Directed graph APIn Some built-in, vision-specific, common functions

n Tensorflown General compute APIn Directed graph type approach without full graph API — key ideas

n ‘Process this’n ‘Process this with the output from that earlier task’

n Other viable Directed Graph approachesn Android: RenderScript Graph buildern Various vendor proprietary Frameworks

Brendan Barry 16

Page 17: High Performance, Low Power, CNN implementation on Ultra-Low …excess-project.eu/excess_workshop/Chalmers16/EXCESS... · 2016-09-29 · Complexity of a Standard Network: GoogLeNet

Deferred Execution

n Key idea: Define all of the problem and give it to a tool. Tool can analyze the data flown Pros: more abstract APIn Cons: tool needs to do more work, less predictable

n Specific examplesn Halide for image processingn OpenCL

Brendan Barry 17

Page 18: High Performance, Low Power, CNN implementation on Ultra-Low …excess-project.eu/excess_workshop/Chalmers16/EXCESS... · 2016-09-29 · Complexity of a Standard Network: GoogLeNet

Worked example: GoogleNet Implementation

n GoogLe-Net V1n Lots of compute per inputn Lots of intermediate datan Lots of weights (but pretty good in relative

terms)

Brendan Barry 18

Page 19: High Performance, Low Power, CNN implementation on Ultra-Low …excess-project.eu/excess_workshop/Chalmers16/EXCESS... · 2016-09-29 · Complexity of a Standard Network: GoogLeNet

GoogleNet Implementation

n Optimization1: Kernel Fusion for Key operationsn Conv + Bias + RELU à Single Operationn Conv + Bias + RELU + Pool à More work but

good returns

Brendan Barry 19

CBRP

conv

bias

relu

pool

Page 20: High Performance, Low Power, CNN implementation on Ultra-Low …excess-project.eu/excess_workshop/Chalmers16/EXCESS... · 2016-09-29 · Complexity of a Standard Network: GoogLeNet

GoogleNet Implementation

n Optimization2: 8 bit weightsn Optimization3: Keep activations in local SRAM

n Reduce DDR BW to ~7MB/sec/inference

Brendan Barry 20

Page 21: High Performance, Low Power, CNN implementation on Ultra-Low …excess-project.eu/excess_workshop/Chalmers16/EXCESS... · 2016-09-29 · Complexity of a Standard Network: GoogLeNet

CNN implementation problems

n Identified needn General focus on performance not energy efficiencyn Identified industry narrow solutionsn Identified issues with general solutions

n How does EXCESS help?

Brendan Barry 21

Page 22: High Performance, Low Power, CNN implementation on Ultra-Low …excess-project.eu/excess_workshop/Chalmers16/EXCESS... · 2016-09-29 · Complexity of a Standard Network: GoogLeNet

HowMovidius isDeployingStandardCNNsattheNetworkEdge?

Brendan Barry 22

CNN Model Description

WeightsFor Integration

ParsesCNNModelScheduler

Efficient Execution of Kernel Function Libraries

Drones

Robotics

EXCESSinfluence

Page 23: High Performance, Low Power, CNN implementation on Ultra-Low …excess-project.eu/excess_workshop/Chalmers16/EXCESS... · 2016-09-29 · Complexity of a Standard Network: GoogLeNet

GoogleNet RelativeGFLOPS/WPerformanceResults

Brendan Barry 23

GoogLeNet Single Inference (Batch = 1) no Heatsink or Fan

nmProcess fps Power(W)

GFLOPS GFLOPS/W

GPU-A 28 15 7.5 51.37 6.85

GPU-B 20 22 7.5 75.34 10.04

Myriad2 28 25 1.2 85.61 71.34

Page 24: High Performance, Low Power, CNN implementation on Ultra-Low …excess-project.eu/excess_workshop/Chalmers16/EXCESS... · 2016-09-29 · Complexity of a Standard Network: GoogLeNet

Fathom Neural Compute Stick: Deep Learning’s Newest Form Factor

Brendan Barry 24

Page 25: High Performance, Low Power, CNN implementation on Ultra-Low …excess-project.eu/excess_workshop/Chalmers16/EXCESS... · 2016-09-29 · Complexity of a Standard Network: GoogLeNet

Where to next?

n CNN, very fast changing research arean New results frequently change research direction

n Exploration of optimised energy efficient inferencen Is it possible to change training to produce an energy efficient

inference?

n Is it possible to iterate training after each encounter of new data?n Current systems cloud basedn What above live re-learning?

Brendan Barry 25

Page 26: High Performance, Low Power, CNN implementation on Ultra-Low …excess-project.eu/excess_workshop/Chalmers16/EXCESS... · 2016-09-29 · Complexity of a Standard Network: GoogLeNet

Acknowledgements

n Links APIs n http://www.tensorflow.orgn https://www.khronos.org/openvx/resourcesn http://www.movidius.orgn http://halide-lang.org/

n Academic references1. Dark Memory and Accelerator-Rich System Optimization in the Dark Silicon Era2. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems3. Densecap ; http://arxiv.org/pdf/1511.07571v1.pdf4. EMBEDDED DEEP NEURAL NETWORKS “The Cost of Everything and the Value of

Nothing” David Moloney, CTO Hot Chips 28, Cupertino, California

Brendan Barry 26