high performance, low power, cnn implementation on ultra-low...
TRANSCRIPT
HighPerformance,LowPower,CNNimplementationonUltra-LowPower
embeddedplatforms
This project is part of the portfolio of theA.3 – Advanced Computing and Complex System UnitCommunications Networks, Content and Technology DGEuropean Commission
www.excess.euCopyright © 2013 - 2016 The EXCESS Consortium
Contract Number: 611183Total Cost [€]: 3.31 millionStarting Date: 2013-09-01
Duration: 36 months
BrendanBarryVPofEngineering
Movidius
WhyareCNN’sinteresting?
n Deep-Learning:theGreatDisruptorn CompaniesusingCNNscanenablesoughtafterfeaturesotherwisedifficultor
impossible
Brendan Barry 2
Why are CNNs interesting to Movidius?
Brendan Barry 3
DRONES• Sense and Avoid• GPS-denied hovering• Scene labeling
SERVICE ROBOTICS
• Volumetric mapping• Indoor visual navigation
AR/VR• 6DOF positional tracking• Depth sensing• Gesture & Eye tracking• See-through
SECURITY CAMERAS• Detection• Recognition• Identification
DeepConvolutional NeuralNetworks(CNNs):
Brendan Barry 4
Input
Output
Weights
Layer2
Layer3
Layer4
LayerN
Layer1
Inference
+Weights
ErrorBackpropagation
Expectedoutput(label)
ConvolutionMax-Pooling(blockwisemaximum)
Activation(pointwiseReLU)
RepeatedforNlayersinNetwork
WeightsfromDDR/RAM
InputsPreviouslayerorDDR/RAM
InputstoNextlayer
Training
n Training on HPC clustern Inference on embedded platformn What about training on embedded platform?
AlexNet CNN (Visualization)
Brendan Barry 5
ComplexityofaStandardNetwork:GoogLeNet
Brendan Barry 6
ConvolutionPoolingSoftmaxOther
1 2 34
5 67
8 9
9InceptionLayers
256 480 480 512512 512
832 832 1024
Widthfilters
Computationalcostisincreasedbylessthan2XcomparedtoAlexNet (<1.5Bnoperations/evaluation)
5Mparameters
TheDeeperTheBetter…YetWithIncreasedComplexity
Brendan Barry 7
IncrementalAccuracyandthePowerEfficiencyCost:IsItWorthIt?
Brendan Barry 8
Where is the energy efficiency lost?
n Data movement burns much more power than arithmetic operationsn Focus is on reducing operations, not on reducing data movement
(pJ / normalized to ADD=1)
Brendan Barry 9
Add
64wordRF=1.3
Multiply=3.4
4kSRAM=44
32kSRAM=61
DRAM=3556
Power cost of data movement
n Will this get better? n Yes but…
n Logic power decreases with each noden SRAM power decrease is less each timen DRAM iterations are slower/less impactful
Brendan Barry 10
LPDDR
SRAM
Logic
Time
Solutions — 3 Pillars
Brendan Barry 11
OptimizedSoftware
AppropriateAlgorithms
ProcessingPlatform
Architecture
Processing Platform Architecture
• Consider processing platform vs. this pyramid
• More local memory allows expensive memory to be kept “Dark”– Less power consumption– More scope for software dataflow
optimization
• Recent paper on “Dark Memory”[1]
shows this formally
Brendan Barry 12
[1]http://arxiv.org/pdf/1602.04183v1.pdf
Add
RegFile
Multiply
4kSRAM
32kSRAM
DRAM
RegFile
Multiply
4kSRAM
32kSRAM
DRAM
Add
Algorithms: Some are Better than Others
n Kernel Level: Ratio of processing cycles to load/store cyclesn Matrix-Matrix better than Matrix-Vectorn CNN Example: Batched AlexNet better than AlexNetn CNN Example: GoogleNet better than AlexNet
n Optimize for locality of accessn Transform algorithm solutions to predictable dataflown Keep intermediate data as local as possiblen Move data once and complete all processing possible
n Cache systems can help, but can also be counter productiven Pre-fetching can burn more powern Cache trashing will also burn more power
Brendan Barry 13
Optimized Software for energy efficiency?
n What is Optimal energy efficiency?
n What do we want to do?n Do as much compute as possible on as smallest span of datan Keep higher bandwidth data as local as we can
n How can we do this (quickly!)?n Manually optimized coden Directed Acyclic Graphs (DAG) n Deferred execution frameworks
n How do you know you are actually improving power efficiency?
Brendan Barry 14
EXCESS – Energy monitoring
n HPC (x86 and K40) systems system energy monitoring
n Tegra TK1, TX1 energy monitoring
n Myriad2 compute cluster energy monitoring
Brendan Barry 15
Directed Graph — Specifics
n OpenVxTM: Khronos industry standard supported by variety of platformsn Directed graph APIn Some built-in, vision-specific, common functions
n Tensorflown General compute APIn Directed graph type approach without full graph API — key ideas
n ‘Process this’n ‘Process this with the output from that earlier task’
n Other viable Directed Graph approachesn Android: RenderScript Graph buildern Various vendor proprietary Frameworks
Brendan Barry 16
Deferred Execution
n Key idea: Define all of the problem and give it to a tool. Tool can analyze the data flown Pros: more abstract APIn Cons: tool needs to do more work, less predictable
n Specific examplesn Halide for image processingn OpenCL
Brendan Barry 17
Worked example: GoogleNet Implementation
n GoogLe-Net V1n Lots of compute per inputn Lots of intermediate datan Lots of weights (but pretty good in relative
terms)
Brendan Barry 18
GoogleNet Implementation
n Optimization1: Kernel Fusion for Key operationsn Conv + Bias + RELU à Single Operationn Conv + Bias + RELU + Pool à More work but
good returns
Brendan Barry 19
CBRP
conv
bias
relu
pool
GoogleNet Implementation
n Optimization2: 8 bit weightsn Optimization3: Keep activations in local SRAM
n Reduce DDR BW to ~7MB/sec/inference
Brendan Barry 20
CNN implementation problems
n Identified needn General focus on performance not energy efficiencyn Identified industry narrow solutionsn Identified issues with general solutions
n How does EXCESS help?
Brendan Barry 21
HowMovidius isDeployingStandardCNNsattheNetworkEdge?
Brendan Barry 22
CNN Model Description
WeightsFor Integration
ParsesCNNModelScheduler
Efficient Execution of Kernel Function Libraries
Drones
Robotics
EXCESSinfluence
GoogleNet RelativeGFLOPS/WPerformanceResults
Brendan Barry 23
GoogLeNet Single Inference (Batch = 1) no Heatsink or Fan
nmProcess fps Power(W)
GFLOPS GFLOPS/W
GPU-A 28 15 7.5 51.37 6.85
GPU-B 20 22 7.5 75.34 10.04
Myriad2 28 25 1.2 85.61 71.34
Fathom Neural Compute Stick: Deep Learning’s Newest Form Factor
Brendan Barry 24
Where to next?
n CNN, very fast changing research arean New results frequently change research direction
n Exploration of optimised energy efficient inferencen Is it possible to change training to produce an energy efficient
inference?
n Is it possible to iterate training after each encounter of new data?n Current systems cloud basedn What above live re-learning?
Brendan Barry 25
Acknowledgements
n Links APIs n http://www.tensorflow.orgn https://www.khronos.org/openvx/resourcesn http://www.movidius.orgn http://halide-lang.org/
n Academic references1. Dark Memory and Accelerator-Rich System Optimization in the Dark Silicon Era2. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems3. Densecap ; http://arxiv.org/pdf/1511.07571v1.pdf4. EMBEDDED DEEP NEURAL NETWORKS “The Cost of Everything and the Value of
Nothing” David Moloney, CTO Hot Chips 28, Cupertino, California
Brendan Barry 26