(based on andres rodriguez’ course “deep learning 101 ... · 3 deep learning • a branch of...

Post on 09-Jul-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Prepared for Intel Delta 7 series.

(based on Andres Rodriguez’ course “Deep Learning 101” https://software.intel.com/en-us/ai/academy )

2

• Deep learning overview and usages

• Intel optimized deep learning environment

• hardware, software, tools

• Training models

• Getting started resources

Content Outline

3

Deep Learning

• A branch of machine learning

• Data is passed through multiple non-linear transformations

• Objective: Learn the parameters of the transformation that minimize a cost function

4

Types of Deep Learning

• Supervised learning

• Data -> Labels

• Unsupervised learning

• No labels; Clustering; Reducing dimensionality

• Reinforcement learning

• Reward actions (e.g., robotics)

5

Types of Deep Learning

• Supervised learning

• Data -> Labels

• Unsupervised learning

• No labels; Clustering; Reducing dimensionality

• Reinforcement learning

• Reward actions (e.g., robotics)

6

Step 1: Training (Over Hours/Days/Weeks)

Supervised Learning

Person

90% person 8% traffic light

Input data

Output Classification

Create Deep network

Step 2: Inference (Real Time)

New input from camera and

sensors

Output Classification

Trained neural network model

97% person

Trained Model

Bigger Data Better Hardware Smarter Algorithms

7

Why Now?

Image: 1000 KB / picture

Audio: 5000 KB / song

Video: 5,000,000 KB / movie

Transistor density doubles every 18 months

Cost / GB in 1995: $1000.00

Cost / GB in 2015: $0.03

Advances in algorithm innovation, including neural networks, leading to better accuracy in training models

Why machine learning?

• The explosive growth of smart and connected devices. – agriculture, retail, industrial and automated driving (by 2020 all cars sold in USA will

have automated control regime )

• computer vision, using technology like Intel® RealSense™,

• Exponential growth of data generated: 1.5GB/day/human and upto 4 TB/day/device by 2020

• Cloud services running on customer data : Amazon, Google, Picasa etc etc.

8 Intel Confidential

Classification

Label the image

• Person

• Motorcyclist

• Bike https://people.eecs.berkeley.edu/~jhoffman/talks/lsda-baylearn2014.pdf

9

Detection

Detect and

label objects

https://people.eecs.berkeley.edu/~jhoffman/talks/lsda-baylearn2014.pdf

10

Semantic Segmentation

Label every pixel

https://people.eecs.berkeley.edu/~jhoffman/talks/lsda-baylearn2014.pdf

11

Natural Language Object Retrieval

http://arxiv.org/pdf/1511.04164v3.pdf

12

Visual and Textual Question Answering

https://arxiv.org/pdf/1603.01417v1.pdf

13

Visuomotor Control

https://arxiv.org/pdf/1504.00702v5.pdf

14

Speech Recognition

The same architecture is used for English and Mandarin Chinese speech recognition

http://svail.github.io/mandarin/

15

Q&A Natural Language Understanding

https://arxiv.org/pdf/1506.07285.pdf

16

Personal Assistant

17

18

Deep Learning Use Cases

Cloud Service Providers

Financial Services

Healthcare

Automotive

• Personal assistant

• Image & Video recognition/tagging

• Natural language processing

• Automatic Speech recognition

• Targeted Ads

• Fraud / face detection

• Gaming, check processing

• Computer server monitoring

• Financial forecasting and prediction

• Network intrusion detection

• Recommender Systems

19

Optimized Deep Learning Environment

Fuel the development of vertical solutions

Deliver best single node and multi-node performance

Accelerate design, training, and deployment

Drive optimizations across open source deep learning frameworks

Intel® Deep Learning SDK

Intel® Omni-Path Architecture (Intel® OPA)

Maximum performance on Intel architecture Intel® Math Kernel

Library (Intel® MKL)

+

Training Inference

Intel® MKL-DNN

+

20

Optimized Deep Learning Environment

Fuel the development of vertical solutions

Deliver best single node and multi-node performance

Accelerate design, training, and deployment

Drive optimizations across open source machine learning frameworks

Intel® Deep Learning SDK

Maximum performance on Intel architecture Intel® Math Kernel

Library (Intel® MKL)

Training

Intel® MKL-DNN

Inference

+

Intel® Omni-Path Architecture (Intel® OPA)

+

21

• Knights Landing

• Up to >6 peak SP TFLOPs per socket

• Binary-compatible with Intel® Xeon® processors

• Up to 72 cores, 512-bit SIMD vectors with 2 VPU/core

• Integrated memory delivers superior bandwidth

• Integrated Intel® Omni-Path fabric (dual-port; 50 Gb/s )

• Distributes the training workload

Intel® Xeon Phi™ + Omni-Path Deep Learning Training Scalable System

48p Sw

48p Sw

48p Sw

48p Sw

48p Sw

4x KNL

2

48p Sw

L1

L2

“12-12"

48p Sw

4x KNL

4x KNL

0 1 11

48p Sw

12 13

48p Sw

*

. .

48p Sw

“4-4-4-4-4-4"

4x KNL

5

4x KNL

4x KNL

3 4

4x KNL

4x KNL

4x KNL

4x KNL

4x KNL

4x KNL

14 15 16

Storage

server

“2-2"

22

2013 2016

1ST GENERATION

XEON PHI

2ND GENERATION

XEON PHI

2017

Sin

gle

-Pre

cisi

on

Te

rafl

op

s

Knights Mill

Optimized for Deep Learning

Optimized for scale-out

Flexible, high capacity memory

Enhanced variable precision

Improved efficiency

Knights Mill: Next Gen Intel® Xeon Phi™ COMING 2017

23

Optimized Deep Learning Environment

Fuel the development of vertical solutions

Deliver best single node and multi-node performance

Accelerate design, training, and deployment

Drive optimizations across open source machine learning frameworks

Intel® Deep Learning SDK

Intel® Omni-Path Architecture (Intel® OPA)

Maximum performance on Intel architecture Intel® Math Kernel

Library (Intel® MKL)

+

Training Inference

Intel® MKL-DNN

+

Diversity in Deep Networks

VVariety in Network Topology

Recurrent NNs common for NLP/ASR, DAG for GoogLeNet, Networks with memory…

BBut there are a few well defined building blocks

Convolutions common for image recognition tasks

GEMMs for recurrent network layers—could be sparse

ReLU, tanh, softmax

GoogLeNet

Recurrent NN

CNN - AlexNet

24

25

26

Optimized Deep Learning Environment

Intel® Omni-Path Architecture (Intel® OPA)

Intel® Math Kernel Library (Intel® MKL)

+

Intel® MKL-DNN

Intel® MKL-DNN – free open source DNN functions designed for max Intel HW performance and high-velocity integration with DL frameworks

• Open source DNN functions included in MKL 2017

• IA optimizations contributed by community

• Binary GEMM functions

• Apache* 2 license

Targeted release: Q4’ 2016 (APIs and preview Q3’ 2016)

+

Intel® Math Kernel Library (Intel® MKL)

• Optimized AVX-2 and AVX-512 instructions

• Intel® Xeon® and Intel® Xeon Phi™ processors

• Supports all common layers types

• Coming soon: Winograd-based convolutions

27

Naïve Convolution

https://en.wikipedia.org/wiki/Convolutional_neural_network

28

Cache Friendly Convolution

arxiv.org/pdf/1602.06709v1.pdf

29

30

Recent MKL SGEMM Improvements

• Up to 4× performance improvements for small sizes

• New APIs to eliminate data packing overheads if A or B matrices are re-used

• Pack once and use multiple times

• Up to 1.2× additional performance improvements over SGEMM

• Available in Intel MKL 2017 (released: September 6, 2016)

31

Optimized Deep Learning Environment

Fuel the development of vertical solutions

Deliver best single node and multi-node performance

Accelerate design, training, and deployment

Drive optimizations across open source machine learning frameworks

Intel® Deep Learning SDK

Intel® Omni-Path Architecture (Intel® OPA)

Maximum performance on Intel architecture Intel® Math Kernel

Library (Intel® MKL)

Intel® Data Analytics Acceleration Library

(Intel® DAAL)

+

Training Inference

Intel® MKL-DNN

Drive optimizations across open source machine learning frameworks

+

Deep Learning Tools

Programming languages

Top Frameworks

Caffe

C/C++

32

MLlib

INTEL OPTIMIZED Caffe

•All the goodness of BVLC Caffe* + Integrated with Intel® MKL 2017 Multi-node distributed training

Forrest Iandola, et al., “Scaling DNN Training on Intel Platforms.” 2016

33

34

Optimized Deep Learning Environment

Fuel the development of vertical solutions

Deliver best single node and multi-node performance

Accelerate design, training, and deployment

Drive optimizations across open source machine learning frameworks

Intel® Deep Learning SDK

Intel® Omni-Path Architecture (Intel® OPA)

Maximum performance on Intel architecture Intel® Math Kernel

Library (Intel® MKL)

Intel® Data Analytics Acceleration Library

(Intel® DAAL)

+

Training Inference

Intel® MKL-DNN

Drive optimizations across open source machine learning frameworks

35 Real-time graphs to see how the accuracy rate is trending

36

High-Level Workflow

.prototxt

.caffemodel

Trained Model

Model Optimizer

FP Quantize

Model Compress

End-Point

SW Developer

Import

Inference Run-Time

OpenVX

Application Logic

Forward Result

Real-time Data Validation Data

Model Analysis

MKL-DNN

Deploy-ready model

INFERENCE ON THE END POINT

37

Optimized Deep Learning Environment

Fuel the development of vertical solutions

Deliver best single node and multi-node performance

Accelerate design, training, and deployment

Drive optimizations across open source machine learning frameworks

Intel® Deep Learning SDK

Intel® Omni-Path Architecture (Intel® OPA)

Maximum performance on Intel architecture Intel® Math Kernel

Library (Intel® MKL)

+

Training Inference

Intel® MKL-DNN

+

38

Training

• Gradient descent and variants

• Batch sizes

• Distributed training

• Challenges

Gradient Descent 𝐽 𝒘(0) = 𝑐𝑜𝑠𝑡(𝒘(0), 𝒙𝑖)

𝑁

𝑖=1

𝒘 𝒘(0)

Gradient Descent 𝐽 𝒘(0) = 𝑐𝑜𝑠𝑡(𝒘(0), 𝒙𝑖)

𝑁

𝑖=1

𝒘 𝒘(0)

𝑑𝐽 𝒘(0)

𝑑𝒘

Gradient Descent 𝐽 𝒘(0) = 𝑐𝑜𝑠𝑡(𝒘(0), 𝒙𝑖)

𝑁

𝑖=1

𝒘 𝒘(0)

𝒘(1) = 𝒘(0) − 𝑑𝐽 𝒘(0)

𝑑𝒘

Gradient Descent 𝐽 𝒘(0) = 𝑐𝑜𝑠𝑡(𝒘(0), 𝒙𝑖)

𝑁

𝑖=1

𝒘 𝒘(0)

𝒘(1) = 𝒘(0) − 𝛼𝑑𝐽 𝒘(0)

𝑑𝒘

learning rate

Gradient Descent 𝐽 𝒘(0) = 𝑐𝑜𝑠𝑡(𝒘(0), 𝒙𝑖)

𝑁

𝑖=1

𝒘 𝒘(0)

𝒘(1) = 𝒘(0) − 𝛼𝑑𝐽 𝒘(0)

𝑑𝒘

𝒘(1)

too small

Gradient Descent 𝐽 𝒘(0) = 𝑐𝑜𝑠𝑡(𝒘(0), 𝒙𝑖)

𝑁

𝑖=1

𝒘 𝒘(0)

𝒘(1) = 𝒘(0) − 𝛼𝑑𝐽 𝒘(0)

𝑑𝒘

𝒘(1)

too large

Gradient Descent 𝐽 𝒘(0) = 𝑐𝑜𝑠𝑡(𝒘(0), 𝒙𝑖)

𝑁

𝑖=1

𝒘 𝒘(0)

𝒘(1) = 𝒘(0) − 𝛼𝑑𝐽 𝒘(0)

𝑑𝒘

𝒘(1)

good enough

Gradient Descent 𝐽 𝒘(1) = 𝑐𝑜𝑠𝑡(𝒘(1), 𝒙𝑖)

𝑁

𝑖=1

𝒘 𝒘(2)

𝒘(2) = 𝒘(1) − 𝛼𝑑𝐽 𝒘(1)

𝑑𝒘

𝒘(1)

Gradient Descent 𝐽 𝒘(2) = 𝑐𝑜𝑠𝑡(𝒘(2), 𝒙𝑖)

𝑁

𝑖=1

𝒘

𝒘(3) = 𝒘(2) − 𝛼𝑑𝐽 𝒘(2)

𝑑𝒘

𝒘(2)

𝒘(3)

Gradient Descent 𝐽 𝒘(3) = 𝑐𝑜𝑠𝑡(𝒘(3), 𝒙𝑖)

𝑁

𝑖=1

𝒘

𝒘(4) = 𝒘(3) − 𝛼𝑑𝐽 𝒘(3)

𝑑𝒘

𝒘(4)

𝒘(3)

Gradient Descent

𝒘

Saddle Point

Gradient Descent

𝒘

Saddle Point

Gradient Descent

𝒘

Saddle Point

Gradient Descent

𝒘

Saddle Point

Gradient Descent

𝒘

Saddle Point

Gradient Descent

𝒘

Saddle Point

Gradient Descent

𝒘

Saddle Point

Gradient Descent

𝐽 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑁

𝑖=1

Gradient Descent

𝐽 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑁

𝑖=1

Gradient Descent

𝐽 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑁

𝑖=1

Gradient Descent

𝐽 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑁

𝑖=1

Gradient Descent

𝐽 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑁

𝑖=1

Gradient Descent

𝐽 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑁

𝑖=1

Gradient Descent

𝐽 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑁

𝑖=1

Stochastic Gradient Descent (SGD)

𝐽 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑁

𝑖=1

𝐽𝑏1 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑖∈𝑏1

𝐽𝑏𝑀 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑖∈𝑏𝑀

𝑀 = 𝑁/(𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒)

Stochastic Gradient Descent (SGD)

𝐽 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑁

𝑖=1

𝐽𝑏1 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑖∈𝑏1

𝐽𝑏𝑀 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑖∈𝑏𝑀

𝑀 = 𝑁/(𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒)

Stochastic Gradient Descent (SGD)

𝐽 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑁

𝑖=1

𝐽𝑏1 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑖∈𝑏1

𝐽𝑏𝑀 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑖∈𝑏𝑀

𝑀 = 𝑁/(𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒)

Stochastic Gradient Descent (SGD)

𝐽 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑁

𝑖=1

𝐽𝑏1 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑖∈𝑏1

𝐽𝑏𝑀 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑖∈𝑏𝑀

𝑀 = 𝑁/(𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒)

Stochastic Gradient Descent (SGD)

𝐽 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑁

𝑖=1

𝐽𝑏1 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑖∈𝑏1

𝐽𝑏𝑀 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑖∈𝑏𝑀

𝑀 = 𝑁/(𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒)

Stochastic Gradient Descent (SGD)

𝐽 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑁

𝑖=1

𝐽𝑏1 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑖∈𝑏1

𝐽𝑏𝑀 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑖∈𝑏𝑀

𝑀 = 𝑁/(𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒)

Stochastic Gradient Descent (SGD)

𝐽 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑁

𝑖=1

𝐽𝑏1 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑖∈𝑏1

𝐽𝑏𝑀 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑖∈𝑏𝑀

𝑀 = 𝑁/(𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒)

Stochastic Gradient Descent (SGD)

𝐽 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑁

𝑖=1

𝐽𝑏1 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑖∈𝑏1

𝐽𝑏𝑀 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑖∈𝑏𝑀

𝑀 = 𝑁/(𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒)

Stochastic Gradient Descent (SGD)

𝐽 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑁

𝑖=1

𝐽𝑏1 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑖∈𝑏1

𝐽𝑏𝑀 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑖∈𝑏𝑀

𝑀 = 𝑁/(𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒)

Stochastic Gradient Descent (SGD)

𝐽 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑁

𝑖=1

𝐽𝑏1 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑖∈𝑏1

𝐽𝑏𝑀 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑖∈𝑏𝑀

𝑀 = 𝑁/(𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒)

Stochastic Gradient Descent (SGD)

𝐽 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑁

𝑖=1

𝐽𝑏1 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑖∈𝑏1

𝐽𝑏𝑀 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑖∈𝑏𝑀

𝑀 = 𝑁/(𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒)

Stochastic Gradient Descent (SGD)

𝐽 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑁

𝑖=1

𝐽𝑏1 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑖∈𝑏1

𝐽𝑏𝑀 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑖∈𝑏𝑀

𝑀 = 𝑁/(𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒)

Stochastic Gradient Descent (SGD)

𝐽 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑁

𝑖=1

𝐽𝑏1 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑖∈𝑏1

𝐽𝑏𝑀 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑖∈𝑏𝑀

𝑀 = 𝑁/(𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒)

Stochastic Gradient Descent (SGD)

𝐽 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑁

𝑖=1

𝐽𝑏1 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑖∈𝑏1

𝐽𝑏𝑀 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑖∈𝑏𝑀

𝑀 = 𝑁/(𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒)

Stochastic Gradient Descent (SGD)

𝐽 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑁

𝑖=1

𝐽𝑏1 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑖∈𝑏1

𝐽𝑏𝑀 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑖∈𝑏𝑀

𝑀 = 𝑁/(𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒)

Stochastic Gradient Descent (SGD) + Momentum

𝐽 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑁

𝑖=1

𝐽𝑏1 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑖∈𝑏1

𝐽𝑏𝑀 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑖∈𝑏𝑀

⋮ 𝑀 = 𝑁/(𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒)

𝒗(𝑡+1) = 𝜇𝒗(𝑡) − 𝛼𝑑𝐽𝑏𝑚 𝒘

𝑡

𝑑𝒘

𝒘(𝑡+1) = 𝒘(𝑡) + 𝒗(𝑡+1)

𝒘(𝑡+1) = 𝒘(𝑡) − 𝛼𝑑𝐽𝑏𝑚 𝒘

(𝑡)

𝑑𝒘

SGD SGD + Momentum

Momentum

79

gradient

velocity 𝒘

𝒗(𝑡+1) = 𝜇𝒗(𝑡) − 𝛼𝑑𝐽𝑏𝑚 𝒘

𝑡

𝑑𝒘

𝒘(𝑡+1) = 𝒘(𝑡) + 𝒗(𝑡+1)

Momentum

80

gradient

velocity 𝒘

𝒗(𝑡+1) = 𝜇𝒗(𝑡) − 𝛼𝑑𝐽𝑏𝑚 𝒘

𝑡

𝑑𝒘

𝒘(𝑡+1) = 𝒘(𝑡) + 𝒗(𝑡+1)

Momentum

81

gradient

velocity 𝒘

𝒗(𝑡+1) = 𝜇𝒗(𝑡) − 𝛼𝑑𝐽𝑏𝑚 𝒘

𝑡

𝑑𝒘

𝒘(𝑡+1) = 𝒘(𝑡) + 𝒗(𝑡+1)

Momentum

82

gradient

velocity 𝒘

𝒗(𝑡+1) = 𝜇𝒗(𝑡) − 𝛼𝑑𝐽𝑏𝑚 𝒘

𝑡

𝑑𝒘

𝒘(𝑡+1) = 𝒘(𝑡) + 𝒗(𝑡+1)

Momentum

83

gradient

velocity 𝒘

𝒗(𝑡+1) = 𝜇𝒗(𝑡) − 𝛼𝑑𝐽𝑏𝑚 𝒘

𝑡

𝑑𝒘

𝒘(𝑡+1) = 𝒘(𝑡) + 𝒗(𝑡+1)

Momentum

84

gradient

velocity 𝒘

𝒗(𝑡+1) = 𝜇𝒗(𝑡) − 𝛼𝑑𝐽𝑏𝑚 𝒘

𝑡

𝑑𝒘

𝒘(𝑡+1) = 𝒘(𝑡) + 𝒗(𝑡+1)

Batch Size 𝐽𝑏1 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑖∈𝑏1

𝐽𝑏𝑀 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑖∈𝑏𝑀

𝑇𝑖𝑚𝑒 𝑇𝑜 𝑇𝑟𝑎𝑖𝑛 (𝑇𝑇𝑇)

𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒

Batch Size 𝐽𝑏1 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑖∈𝑏1

𝐽𝑏𝑀 𝒘 = 𝑐𝑜𝑠𝑡(𝒘, 𝒙𝑖)

𝑖∈𝑏𝑀

𝑇𝑖𝑚𝑒 𝑇𝑜 𝑇𝑟𝑎𝑖𝑛 (𝑇𝑇𝑇)

𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒

sweet spot

Overfitting

• Larger networks have a lot of weights

• The network can memorize the weights and do excellent on the training data and very poor on the validation data

• The networks does not generalize

Overfitting

• Larger networks have a lot of weights

• The network can memorize the weights and do excellent on the training data and very poor on the validation data

• The networks does not generalize

Overfitting

• Larger networks have a lot of weights

• The network can memorize the weights and do excellent on the training data and very poor on the validation data

• The networks does not generalize

Overfitting

• Larger networks have a lot of weights

• The network can memorize the weights and do excellent on the training data and very poor on the validation data

• The networks does not generalize

Overfitting

• Larger networks have a lot of weights

• The network can memorize the weights and do excellent on the training data and very poor on the validation data

• The networks does not generalize

Overfitting -- Solutions

• More training data

• Stop training when validation performance gets worse

• Penalize large weights – weight_decay

• Randomly ignored some weights in fully connected layers – dropout_ratio

• Training with single precision floating points

92

Caffe solver.prototxt

net: "models/bvlc_googlenet/train_val.prototxt"

base_lr: 0.01

lr_policy: "step"

stepsize: 320000

gamma: 0.96

max_iter: 10000000

momentum: 0.9

weight_decay: 0.0002

93

https://github.com/BVLC/caffe/blob/master/models/bvlc_googlenet/solver.prototxt

batch_size: 32

𝒗(𝑡+1) = 𝜇𝒗(𝑡) − 𝛼𝑑𝐽𝑏𝑚 𝒘

𝑡

𝑑𝒘

𝒘(𝑡+1) = 𝒘(𝑡) + 𝒗(𝑡+1)

Caffe solver.prototxt

net: "models/bvlc_googlenet/train_val.prototxt"

base_lr: 0.01

lr_policy: "step"

stepsize: 320000

gamma: 0.96

max_iter: 10000000

momentum: 0.9

weight_decay: 0.0002

94

https://github.com/BVLC/caffe/blob/master/models/bvlc_googlenet/solver.prototxt

batch_size: 32

𝒗(𝑡+1) = 𝜇𝒗(𝑡) − 𝛼𝑑𝐽𝑏𝑚 𝒘

𝑡

𝑑𝒘

𝒘(𝑡+1) = 𝒘(𝑡) + 𝒗(𝑡+1)

Caffe solver.prototxt

net: "models/bvlc_googlenet/train_val.prototxt"

base_lr: 0.01

lr_policy: "step"

stepsize: 320000

gamma: 0.96

max_iter: 10000000

momentum: 0.9

weight_decay: 0.0002

95

https://github.com/BVLC/caffe/blob/master/models/bvlc_googlenet/solver.prototxt

batch_size: 32

𝒗(𝑡+1) = 𝜇𝒗(𝑡) − 𝛼𝑑𝐽𝑏𝑚 𝒘

𝑡

𝑑𝒘

𝒘(𝑡+1) = 𝒘(𝑡) + 𝒗(𝑡+1)

Caffe solver.prototxt

net: "models/bvlc_googlenet/train_val.prototxt"

base_lr: 0.01

lr_policy: "step"

stepsize: 320000

gamma: 0.96

max_iter: 10000000

momentum: 0.9

weight_decay: 0.0002

96

https://github.com/BVLC/caffe/blob/master/models/bvlc_googlenet/solver.prototxt

batch_size: 32

Caffe solver.prototxt

net: "models/bvlc_googlenet/train_val.prototxt"

base_lr: 0.01

lr_policy: "step"

stepsize: 320000

gamma: 0.96

max_iter: 10000000

momentum: 0.9

weight_decay: 0.0002

97

https://github.com/BVLC/caffe/blob/master/models/bvlc_googlenet/solver.prototxt

batch_size: 32

1 epoch = 1 cycle through all training samples

Caffe solver.prototxt

net: "models/bvlc_googlenet/train_val.prototxt"

base_lr: 0.01

lr_policy: "step"

stepsize: 320000

gamma: 0.96

max_iter: 10000000

momentum: 0.9

weight_decay: 0.0002

98

https://github.com/BVLC/caffe/blob/master/models/bvlc_googlenet/solver.prototxt

batch_size: 32

1 epoch = 1 cycle through all training samples

=10M iter ∗ 32 imgs/iter

1.28M imgs/epoch = 250 epochs

Caffe solver.prototxt

net: "models/bvlc_googlenet/train_val.prototxt"

base_lr: 0.01

lr_policy: "step"

stepsize: 320000

gamma: 0.96

max_iter: 10000000

momentum: 0.9

weight_decay: 0.0002

99

https://github.com/BVLC/caffe/blob/master/models/bvlc_googlenet/solver.prototxt

batch_size: 32

𝒗(𝑡+1) = 𝜇𝒗(𝑡) − 𝛼𝑑𝐽𝑏𝑚 𝒘

𝑡

𝑑𝒘

𝒘(𝑡+1) = 𝒘(𝑡) + 𝒗(𝑡+1)

Caffe solver.prototxt

net: "models/bvlc_googlenet/train_val.prototxt"

base_lr: 0.01

lr_policy: "step"

stepsize: 320000

gamma: 0.96

max_iter: 10000000

momentum: 0.9

weight_decay: 0.0002

100

https://github.com/BVLC/caffe/blob/master/models/bvlc_googlenet/solver.prototxt

batch_size: 32

𝒗(𝑡+1) = 𝜇𝒗(𝑡) − 𝛼𝑑𝐽𝑏𝑚 𝒘

𝑡

𝑑𝒘

𝒘(𝑡+1) = 𝒘(𝑡) + 𝒗(𝑡+1)

Caffe solver.prototxt cont’…

test_iter: 1000

test_interval: 4000

snapshot: 40000

snapshot_prefix: "models/bvlc_googlenet/bvlc_googlenet“

solver_mode: CPU

101

Caffe solver.prototxt cont’…

test_iter: 1000

test_interval: 4000

snapshot: 40000

snapshot_prefix: "models/bvlc_googlenet/bvlc_googlenet“

solver_mode: CPU

102

Caffe solver.prototxt cont’…

test_iter: 1000

test_interval: 4000

snapshot: 40000

snapshot_prefix: "models/bvlc_googlenet/bvlc_googlenet“

solver_mode: CPU

103

Choosing hyperparams

• Experience. Experience. Experience

• Look through examples and models and practice modifying them

104

105

Multi-node Distributed training

Model Parallelism

• Break the model into N nodes

• The same data is in all the nodes

Data Parallelism

• Break the dataset into N nodes

• The same model is in all the nodes

• Good for networks with few weights, e.g., GoogleNet

• Intel Optimized Caffe uses this. Model+Data Parallelism is work in progress.

106

Data Parallelism

Forrest Iandola, et al., “Scaling DNN Training on Intel Platforms.” 2016

107

Scaling Efficiency: Intel® Xeon Phi™ Processor Deep Learning Image Classification Training Performance - MULTI-NODE Scaling

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance . *Other names and brands may be property of others Configurations: • Intel® Xeon Phi™ Processor 7250 (68 Cores, 1.4 GHz, 16GB MCDRAM), 128 GB memory, Red Hat* Enterprise Linux 6.7, Intel® Optimized Framework

0

10

20

30

40

50

60

70

80

90

100

1 2 4 8 16 32 64 128

SCA

LIN

G E

FFIC

IEN

CY

%

# OF INTEL® XEON PHI™ PROCESSOR 7250 (68-CORES, 1.4 GHZ, 16 GB) NODES

OverFeat AlexNet VGG-A GoogLeNet

62

87

108

Challenges • Very large batch sizes train too slow and may

not reach the same accuracy

• GoogleNet performance decreases for batch sizes > 1024 𝑇𝑖𝑚𝑒 𝑇𝑜 𝑇𝑟𝑎𝑖𝑛 (𝑇𝑇𝑇)

𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒

sweet spot

109

Challenges • Total batch size = 1024:

• 1024 nodes each w/node batch size 1 (too much communication)

• 256 nodes each w/node batch size 4

• 64 nodes each w/node batch size 16 (most communication is hidden in the computation)

𝑇𝑖𝑚𝑒 𝑇𝑜 𝑇𝑟𝑎𝑖𝑛 (𝑇𝑇𝑇)

𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒

sweet spot

110

Challenges • Total batch size = 1024:

• 1024 nodes each w/node batch size 1 (too much communication)

• 256 nodes each w/node batch size 4

• 64 nodes each w/node batch size 16 (most communication is hidden in the computation)

• Guideline: Batch size is correlated with learning rate (not always linearly but it’s a good place to start)

𝑇𝑖𝑚𝑒 𝑇𝑜 𝑇𝑟𝑎𝑖𝑛 (𝑇𝑇𝑇)

𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒

sweet spot

111

https://software.intel.com/en-us/articles/training-and-deploying-deep-learning-networks-with-caffe-optimized-for-intel-architecture

112

Resources • Getting started with Caffe tutorial

• https://software.intel.com/en-us/articles/training-and-deploying-deep-learning-networks-with-caffe-optimized-for-intel-architecture

• Deep learning SDK

• https://software.intel.com/en-us/machine-learning/deep-learning/sdk-signup

• Intel Optimized Frameworks

• https://github.com/intel/caffe

• https://github.com/intel/theano

• https://github.com/intel/torch

• other frameworks coming soon…

For questions contact Nadya Plotnikova, Sr SW Engineer, CRT/DCG Russia

Nadezhda.Plotnikova@intel.com

113

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer. No computer system can be absolutely secure.

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.

Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.

Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.

Intel, the Intel logo, Pentium, Celeron, Atom, Core, Xeon and others are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

© 2016 Intel Corporation.

Legal Notices & Disclaimers

top related