intro to deep learning with gpus

39
March 2016 Deep Learning on GPUs

Upload: dinhhanh

Post on 13-Feb-2017

229 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Intro to deep learning with GPUs

March 2016

Deep Learning on GPUs

Page 2: Intro to deep learning with GPUs

2

AGENDA

What is Deep Learning?

GPUs and DL

DL in practice

Scaling up DL

Page 3: Intro to deep learning with GPUs

3

What is Deep Learning?

Page 4: Intro to deep learning with GPUs

4

DEEP LEARNING EVERYWHERE

INTERNET & CLOUD

Image ClassificationSpeech Recognition

Language TranslationLanguage ProcessingSentiment AnalysisRecommendation

MEDIA & ENTERTAINMENT

Video CaptioningVideo Search

Real Time Translation

AUTONOMOUS MACHINES

Pedestrian DetectionLane Tracking

Recognize Traffic Sign

SECURITY & DEFENSE

Face DetectionVideo SurveillanceSatellite Imagery

MEDICINE & BIOLOGY

Cancer Cell DetectionDiabetic GradingDrug Discovery

Page 5: Intro to deep learning with GPUs

5

Traditional machine perceptionHand crafted feature extractors

Speaker ID, speech transcription, …

Raw data Feature extraction ResultClassifier/detector

SVM, shallow neural net,

HMM, shallow neural net,

Clustering, HMM,LDA, LSA

Topic classification, machine translation, sentiment analysis…

Page 6: Intro to deep learning with GPUs

6

Deep learning approach

Deploy:

Dog

Cat

Honey badger

Errors

DogCat

Raccoon

Dog

Train:

MODEL

MODEL

Page 7: Intro to deep learning with GPUs

7

Artificial neural networkA collection of simple, trainable mathematical units that collectively learn complex functions

Input layer Output layer

Hidden layers

Given sufficient training data an artificial neural network can approximate very complexfunctions mapping raw data to output decisions

Page 8: Intro to deep learning with GPUs

8

Artificial neurons

From Stanford cs231n lecture notes

Biological neuron

w1 w2 w3

x1 x2 x3

y

y=F(w1x1+w2x2+w3x3)

F(x)=max(0,x)

Artificial neuron

Page 9: Intro to deep learning with GPUs

9

Deep neural network (dnn)

Input Result

Application components:

Task objectivee.g. Identify face

Training data10-100M images

Network architecture~10 layers1B parameters

Learning algorithm~30 Exaflops~30 GPU days

Raw data Low-level features Mid-level features High-level features

Page 10: Intro to deep learning with GPUs

10

Deep learning benefits

§ Robust

§ No need to design the features ahead of time – features are automatically learned to be optimal for the task at hand

§ Robustness to natural variations in the data is automatically learned

§ Generalizable

§ The same neural net approach can be used for many different applications and data types

§ Scalable

§ Performance improves with more data, method is massively parallelizable

Page 11: Intro to deep learning with GPUs

11

Baidu Deep Speech 2

English and Mandarin speech recognition

Transition from English to Mandarin made simpler by end-to-end DL

No feature engineering or Mandarin-specifics required

More accurate than humans

Error rate 3.7% vs. 4% for human tests

http://svail.github.io/mandarin/

http://arxiv.org/abs/1512.02595

End-to-end Deep Learning for English and Mandarin Speech Recognition

Page 12: Intro to deep learning with GPUs

12

AlphaGo

Training DNNs: 3 weeks, 340 million training steps on 50 GPUs

Play: Asynchronous multi-threaded search

Simulations on CPUs, policy and value DNNs in parallel on GPUs

Single machine: 40 search threads, 48 CPUs, and 8 GPUs

Distributed version: 40 search threads, 1202 CPUs and 176 GPUs

Outcome: Beat both European and World Go champions in best of 5 matches

First Computer Program to Beat a Human Go Professional

http://www.nature.com/nature/journal/v529/n7587/full/nature16961.htmlhttp://deepmind.com/alpha-go.html

Page 13: Intro to deep learning with GPUs

13

Deep Learning for Autonomous vehicles

Page 14: Intro to deep learning with GPUs

14

Deep Learning Synthesis

Texture synthesis and transfer using CNNs. Timo Aila et al., NVIDIA Research

Page 15: Intro to deep learning with GPUs

15

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

2009 2010 2011 2012 2013 2014 2015 2016

THE AI RACE IS ON

IBM Watson Achieves Breakthrough in Natural Language Processing

Facebook Launches Big Sur

Baidu Deep Speech 2Beats Humans

Google Launches TensorFlow

Microsoft & U. Science & Tech, China Beat Humans on IQ

Toyota Invests $1B in AI Labs

IMAGENETAccuracy Rate

Traditional CV Deep Learning

Page 16: Intro to deep learning with GPUs

16

The Big Bang in Machine Learning

“ Google’s AI engine also reflects how the world of computer hardware is changing. (It) depends on machines equipped with GPUs… And it depends on these chips more than the larger tech universe realizes.”

DNN GPUBIG DATA

Page 17: Intro to deep learning with GPUs

17

GPUs and DL

USE MORE PROCESSORS TO GO FASTER

Page 18: Intro to deep learning with GPUs

18

Deep learning development cycle

Page 19: Intro to deep learning with GPUs

19

Three Kinds of Networks

DNN – all fully connected layers

CNN – some convolutional layers

RNN – recurrent neural network, LSTM

Page 20: Intro to deep learning with GPUs

20

DNNKey operation is dense M x V

Backpropagation uses dense matrix-matrix multiply starting from softmax scores

Page 21: Intro to deep learning with GPUs

21

DNNBatching for training and latency insensitive. M x M

Batched operation is M x M – gives re-use of weights.

Without batching, would use each element of Weight matrix once.

Want 10-50 arithmetic operations per memory fetch for modern compute architectures.

Page 22: Intro to deep learning with GPUs

22

CNNRequires convolution and M x V

Filters conserved through plane

Multiply limited – even without batching.

Page 23: Intro to deep learning with GPUs

23

Other OperationsTo finish building a DNN

These are not limiting factors with appropriate GPU use

Complex networks have hundreds of millions of weights.

Page 24: Intro to deep learning with GPUs

24

Lots of Parallelism Available in a DNN

Page 25: Intro to deep learning with GPUs

25

TESLA M40World’s Fastest Accelerator for Deep Learning Training

0 1 2 3 4 5 6 7 8 9 10 11 12 13

GPU Server with 4x TESLA M40

Dual CPU Server

13x Faster TrainingCaffe

Number of Days

CUDA Cores 3072

Peak SP 7 TFLOPS

GDDR5 Memory 12 GB

Bandwidth 288 GB/s

Power 250W

Reduce Training Time from 13 Days to just 1 Day

Note: Caffe benchmark with AlexNet,CPU server uses 2x E5-2680v3 12 Core 2.5GHz CPU, 128GB System Memory, Ubuntu 14.04

28 Gflop/W

Page 26: Intro to deep learning with GPUs

26

Comparing CPU and GPU – server classXeon E5-2698 and Tesla M40

NVIDIA Whitepaper “GPU based deep learning inference: A performance and power analysis.”

Page 27: Intro to deep learning with GPUs

27

DL in practice

Page 28: Intro to deep learning with GPUs

28

EDUCATION

START-UPS

CNTKTENSORFLOW

DL4J

The Engine of Modern AI

NVIDIA GPU PLATFORM

*U. Washington, CMU, Stanford, TuSimple, NYU, Microsoft, U. Alberta, MIT, NYU Shanghai

VITRUVIANSCHULTS

LABORATORIES

TORCH

THEANO

CAFFE

MATCONVNET

PURINEMOCHA.JL

MINERVA MXNET*

CHAINER

BIG SUR WATSON

OPENDEEPKERAS

Page 29: Intro to deep learning with GPUs

29

CUDA for Deep Learning Development

TITAN X GPU CLOUDDEVBOX

DEEP LEARNING SDK

cuSPARSE cuBLASDIGITS NCCLcuDNN

Page 30: Intro to deep learning with GPUs

30

§ GPU-accelerated Deep Learning subroutines

§ High performance neural network training

§ Accelerates Major Deep Learning frameworks: Caffe, Theano, Torch, TensorFlow

§ Up to 3.5x faster AlexNet training in Caffe than baseline GPU

Millions of Images Trained Per Day

Tiled FFT up to 2x faster than FFT

developer.nvidia.com/cudnn

0

20

40

60

80

100

cuDNN 1 cuDNN 2 cuDNN 3 cuDNN 4

0.0x

0.5x

1.0x

1.5x

2.0x

2.5x

Deep Learning Primitives

AcceleratingArtificial Intelligence

Page 31: Intro to deep learning with GPUs

31

CUDA BOOSTS DEEP LEARNING

5X IN 2 YEARSPe

rfor

man

ce

AlexNet training throughput based on 20 iterations, CPU: 1x E5-2680v3 12 Core 2.5GHz. 128GB System Memory, Ubuntu 14.04

Caffe Performance

K40K40+cuDNN1

M40+cuDNN3

M40+cuDNN4

0

1

2

3

4

5

6

11/2013 9/2014 7/2015 12/2015

Page 32: Intro to deep learning with GPUs

32

NVIDIA DIGITSInteractive Deep Learning GPU Training System

Test ImageTest Image

Monitor ProgressConfigure DNNProcess Data Visualize Layers

developer.nvidia.com/digits

Page 33: Intro to deep learning with GPUs

33

PC GAMING

ONE ARCHITECTURE — END-TO-END AI

DRIVE PXfor Auto

Titan Xfor PC

Teslafor Cloud

Jetsonfor Embedded

Page 34: Intro to deep learning with GPUs

34

Scaling DL

Page 35: Intro to deep learning with GPUs

35

Scaling Neural NetworksData Parallelism

Machine 1

Image1

Machine2

Image2

Sync.

Notes:Need to sync model across machines.Largest models do not fit on one GPU.Requires P-fold larger batch size.Works across many nodes – parameter server approach – linear speedup.

Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Ng and Bryan Catanzaro

W W

Page 36: Intro to deep learning with GPUs

36

Multiple GPUsNear linear scaling – data parallel.

Ren Wu et al, Baidu, “Deep Image: Scaling up Image Recognition.” arXiv 2015

Page 37: Intro to deep learning with GPUs

37

Scaling Neural NetworksModel Parallelism

Machine1 Machine2

Image1

W

Notes:Allows for larger models than fit on one GPU.Requires much more frequent communication between GPUs.Most commonly used within a node – GPU P2P.Effective for the fully connected layers.

Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Ng and Bryan Catanzaro

Page 38: Intro to deep learning with GPUs

38

Scaling Neural NetworksHyper Parameter Parallelism

Try many alternative neural networks in parallel – on different CPU / GPU / Machines.Probably the most obvious and effective way!

Page 39: Intro to deep learning with GPUs

39

Deep Learning Everywhere

NVIDIA Titan X

NVIDIA Jetson

NVIDIA Tesla

NVIDIA DRIVE PX

Contact: [email protected]