ai on the edge - cambridge wireless · cyrus m. vahid, principal solutions architect, principal...

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Cyrus M. Vahid, Principal Solutions Architect,Principal DeepLearning Solution Architect

AWS [email protected]

Oct 2017

AI On the Edge

Motivation

Training vs. Inference

• Training is performed on the cloud.

• Inference is performed everywhere

• Efficiency of inference is indispensable to address:• Latency• Connectivity• Cost• Privacy/Security

MotivationLarge DNNs require huge amounts of memory--e.g.

Alexnet Caffemodel is over 200MB VGG-16 CaffeModel is over 500MB.

Complex computation makes apps power hungry.

Edge devices have low power and small memory capacity------------------------------------------------------∴

To run models on the edge we need to compress them significantly

Motivating Examples From Customers

• Industrial IoT (Out of Distribution/Anomaly Detection)


• Real Time Filtering (Neural Style Transfer)


• Building a Better Hearing Aid (Recurrent Acoustic Models)


• Security Robots (Object Detection and Recognition)

Autonomous Vehicles

Model Compression

Computational Efficiency• The goal is to reduce floating point operations and

number of parameters:

Fast Fourier Transform

Most effective for larger kernels

2𝑁# → 2𝑁𝑙𝑛(𝑁) Winograd FFT2.25𝑡𝑖𝑚𝑒𝑠𝑟𝑒𝑢𝑐𝑡𝑖𝑜𝑛𝑓𝑜𝑟

𝐹(2×2, 3×3)

Tensor Contraction Layer

𝐼 = 𝑥=,= ⋯ 𝑥?,=⋮ ⋱ ⋮𝑥=,B ⋯ 𝑥?,B ?CB

𝐹 = 𝑓=,= ⋯ 𝑓D,=⋮ ⋱ ⋮𝑓=,E ⋯ 𝑓D,E DCE

𝐹 = 𝐹= ⊗…⊗ 𝐹D

Separable KernelsO 𝑤J → 𝑂 𝑤×𝑑

Very effective on CPU

Model Compression: Pruning-Quantization-Encoding

arXiv:1510.00149v5

Model Compression: Pruning

• Pruning is removing connections that are less effective in computation of a network.

• After training is performed, then all the weights that are smaller than a certain threshold are removed, and model is retrained.

• Reduction of number of parameters by 9-13 times without loss of accuracy is shown. [arXiv:1510.00149v5]

Model Compression: Quantization

• Quantization is about using fewer bits to express the same information.

• Wight sharing a one method of quantization via using centroids as shared weights.

[arXiv:1510.00149v5]:weightsharingthroughscalarquantization

Good to take advantage of low precision hardware acceleration

Model Compression: Hoffman Coding

• A Hoffman code is an optimal prefix code commonly used for lossless data compression.

• It uses variable-length code words to encode source symbols.

• More common symbols are represented with fewer bits.

[arXiv:1510.00149v5]

• probabilitydistributionofquantizedweightsandthesparsematrixindexofthelastfullyconnectedlayerinAlexNet.

• mostofthequantizedweightsaredistributedaroundthetwopeaks;thesparsematrixindexdifferencearerarelyabove20

• ExperimentsshowthatHuffmancodingthesenon-uniformlydistributedvaluessaves20%- 30% ofnetworkstorage.

BMXNet – Collaborators in the MXNetcommunity, brought this to binary weightshttps://github.com/hpi-xnor/BMXNet

Reduced Architecture

SqueezeNet: AlexNet Accuracy with 50x Fewer Parameters

Good for devices with low RAM that can’t hold all weights for larger models concurrently in memory

Student/Teacher training

Comparing Techniques

WinogradConvolutions

SeparableConvolutions

Quantization Tensor Contractions

Sparsity Exploitation

Weight Sharing

CPUAcceleration

+ ++ = ++ + +

GPU Acceleration

+ + + + = +

Model Size = = - - - -

ModelAccuracy

= - - - - -SpecializedHardware Acceleration

+ + ++ + + +

Edge Compute Models – AWS IoT

Key Functions• Data Ingest• Compressed Inference• Full Inference / Trained Model Query• Model Training

Deployment ModelsCloud <-> EdgeCloud <-> Hub <-> Edge

Edge Analytics Trends : Reduce Latency, Reduce Transfer Costs

AWS Deep Learning Infrastructure Tools

P2 Instances:Up to 40K CUDA Cores

Deep Learning AMI,Preconfigured for Deep Learning mxnet, TensorFlow, …

CFM TemplateDeep Learning Cluster

Apache MXNet

Most Open Best On AWSOptimized for deep learning on AWSAccepted into the Apache Incubator

IdealInception v3Resnet

Alexnet

88%Efficiency

1 2 4 8 16 32 64 128 256

Amazon AI: Scaling With MXNet

Manage and Monitor Models on The Fly

AWS

Captured Data

Upload Tagged Data

Escalate toAI Service

Escalate toCustom Model on P2

Deploy andManage Model

Local Learning LoopPoorly

Classified Data

Updated Model

Fine Tune Model With Accurate Classification

References

• arXiv:1510.00149v5: Deep Compression; Han, Mao, and Dally• arXiv:1509.09308v2: Fast Algorithms for CNN, Laving & Gray• arXiv:1706.00439v1: Tensor Contraction Layers; Anima Anandkumar et al• arXiv:1606.09274v1 : Compression of NMT via Pruning; See, Luong, Manning• http://cs231n.stanford.edu/reports/2016/pdfs/117_Report.pdf: Pruning Winograd and FFT based

algorithms; Liu and Turakhia• https://colfaxresearch.com/falcon-library/• https://betterexplained.com/articles/an-interactive-guide-to-the-fourier-transform/• https://en.wikipedia.org/wiki/Fast_Fourier_transform• https://arxiv.org/pdf/1611.06321.pdf: Learning the Number of Neurons in Deep Networks• https://aclweb.org/anthology/D16-1139: Sequence Level Knowledge Distillation; Kim and Rush

Thank you!

Cyrus M. [email protected]

ai on the edge - cambridge wireless · cyrus m. vahid, principal solutions architect, principal...

Documents