ai on the edge - cambridge wireless · cyrus m. vahid, principal solutions architect, principal...
TRANSCRIPT
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Cyrus M. Vahid, Principal Solutions Architect,Principal DeepLearning Solution Architect
Oct 2017
AI On the Edge
Motivation
Training vs. Inference
• Training is performed on the cloud.
• Inference is performed everywhere
• Efficiency of inference is indispensable to address:• Latency• Connectivity• Cost• Privacy/Security
MotivationLarge DNNs require huge amounts of memory--e.g.
Alexnet Caffemodel is over 200MB VGG-16 CaffeModel is over 500MB.
Complex computation makes apps power hungry.
Edge devices have low power and small memory capacity------------------------------------------------------∴
To run models on the edge we need to compress them significantly
Motivating Examples From Customers
• Industrial IoT (Out of Distribution/Anomaly Detection)
Motivating Examples From Customers
• Real Time Filtering (Neural Style Transfer)
Motivating Examples From Customers
• Building a Better Hearing Aid (Recurrent Acoustic Models)
Motivating Examples From Customers
• Security Robots (Object Detection and Recognition)
Autonomous Vehicles
Model Compression
Computational Efficiency• The goal is to reduce floating point operations and
number of parameters:
Fast Fourier Transform
Most effective for larger kernels
2𝑁# → 2𝑁𝑙𝑛(𝑁) Winograd FFT2.25𝑡𝑖𝑚𝑒𝑠𝑟𝑒𝑢𝑐𝑡𝑖𝑜𝑛𝑓𝑜𝑟
𝐹(2×2, 3×3)
Tensor Contraction Layer
𝐼 = 𝑥=,= ⋯ 𝑥?,=⋮ ⋱ ⋮𝑥=,B ⋯ 𝑥?,B ?CB
𝐹 = 𝑓=,= ⋯ 𝑓D,=⋮ ⋱ ⋮𝑓=,E ⋯ 𝑓D,E DCE
𝐹 = 𝐹= ⊗…⊗ 𝐹D
Separable KernelsO 𝑤J → 𝑂 𝑤×𝑑
Very effective on CPU
Model Compression: Pruning-Quantization-Encoding
arXiv:1510.00149v5
Model Compression: Pruning
• Pruning is removing connections that are less effective in computation of a network.
• After training is performed, then all the weights that are smaller than a certain threshold are removed, and model is retrained.
• Reduction of number of parameters by 9-13 times without loss of accuracy is shown. [arXiv:1510.00149v5]
Model Compression: Quantization
• Quantization is about using fewer bits to express the same information.
• Wight sharing a one method of quantization via using centroids as shared weights.
[arXiv:1510.00149v5]:weightsharingthroughscalarquantization
Good to take advantage of low precision hardware acceleration
Model Compression: Hoffman Coding
• A Hoffman code is an optimal prefix code commonly used for lossless data compression.
• It uses variable-length code words to encode source symbols.
• More common symbols are represented with fewer bits.
[arXiv:1510.00149v5]
• probabilitydistributionofquantizedweightsandthesparsematrixindexofthelastfullyconnectedlayerinAlexNet.
• mostofthequantizedweightsaredistributedaroundthetwopeaks;thesparsematrixindexdifferencearerarelyabove20
• ExperimentsshowthatHuffmancodingthesenon-uniformlydistributedvaluessaves20%- 30% ofnetworkstorage.
BMXNet – Collaborators in the MXNetcommunity, brought this to binary weightshttps://github.com/hpi-xnor/BMXNet
Reduced Architecture
SqueezeNet: AlexNet Accuracy with 50x Fewer Parameters
Good for devices with low RAM that can’t hold all weights for larger models concurrently in memory
Student/Teacher training
Comparing Techniques
WinogradConvolutions
SeparableConvolutions
Quantization Tensor Contractions
Sparsity Exploitation
Weight Sharing
CPUAcceleration
+ ++ = ++ + +
GPU Acceleration
+ + + + = +
Model Size = = - - - -
ModelAccuracy
= - - - - -SpecializedHardware Acceleration
+ + ++ + + +
Edge Compute Models – AWS IoT
Key Functions• Data Ingest• Compressed Inference• Full Inference / Trained Model Query• Model Training
Deployment ModelsCloud <-> EdgeCloud <-> Hub <-> Edge
Edge Analytics Trends : Reduce Latency, Reduce Transfer Costs
AWS Deep Learning Infrastructure Tools
P2 Instances:Up to 40K CUDA Cores
Deep Learning AMI,Preconfigured for Deep Learning mxnet, TensorFlow, …
CFM TemplateDeep Learning Cluster
Apache MXNet
Most Open Best On AWSOptimized for deep learning on AWSAccepted into the Apache Incubator
IdealInception v3Resnet
Alexnet
88%Efficiency
1 2 4 8 16 32 64 128 256
Amazon AI: Scaling With MXNet
Manage and Monitor Models on The Fly
AWS
Captured Data
Upload Tagged Data
Escalate toAI Service
Escalate toCustom Model on P2
Deploy andManage Model
Local Learning LoopPoorly
Classified Data
Updated Model
Fine Tune Model With Accurate Classification
References
• arXiv:1510.00149v5: Deep Compression; Han, Mao, and Dally• arXiv:1509.09308v2: Fast Algorithms for CNN, Laving & Gray• arXiv:1706.00439v1: Tensor Contraction Layers; Anima Anandkumar et al• arXiv:1606.09274v1 : Compression of NMT via Pruning; See, Luong, Manning• http://cs231n.stanford.edu/reports/2016/pdfs/117_Report.pdf: Pruning Winograd and FFT based
algorithms; Liu and Turakhia• https://colfaxresearch.com/falcon-library/• https://betterexplained.com/articles/an-interactive-guide-to-the-fourier-transform/• https://en.wikipedia.org/wiki/Fast_Fourier_transform• https://arxiv.org/pdf/1611.06321.pdf: Learning the Number of Neurons in Deep Networks• https://aclweb.org/anthology/D16-1139: Sequence Level Knowledge Distillation; Kim and Rush
Thank you!
Cyrus M. [email protected]