differences of deep learning frameworks

45
Tutorial: Deep Learning Implementations and Frameworks Seiya Tokui*, Kenta Oono*, Atsunori Kanemura + , Toshihiro Kamishima + *Preferred Networks, Inc. (PFN) {tokui,oono}@preferred.jp + National Institute of Advanced Industrial Science and Technology (AIST) [email protected], [email protected] 2 nd session 1

Upload: seiya-tokui

Post on 14-Apr-2017

1.886 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Differences of Deep Learning Frameworks

Tutorial:Deep Learning Implementations

and FrameworksSeiya Tokui*, Kenta Oono*, Atsunori Kanemura+, Toshihiro Kamishima+

*Preferred Networks, Inc. (PFN){tokui,oono}@preferred.jp

+National Institute of Advanced Industrial Science and Technology (AIST)[email protected], [email protected]

2nd session

1

Page 2: Differences of Deep Learning Frameworks

Overview of this tutorial

•1st session (KO, 8:30 – 10:00)• Introduction•Basics of neural networks•Common design of neural network implementations

•2nd session (ST, 10:30 – 12:30)•Differences of deep learning frameworks•Coding examples of frameworks•Conclusion

Page 3: Differences of Deep Learning Frameworks

Differences ofDeep Learning FrameworksSeiya Tokui

Preferred Networks, Inc.

PAKDD2016 DLIF Tutorial 3

Page 4: Differences of Deep Learning Frameworks

Objective of this part

•List up the design choices of NN frameworks

• Introduce the objective differences between existing frameworks on these choices• Two or more choices at each topic• Pros/cons of each choice

PAKDD2016 DLIF Tutorial 4

Page 5: Differences of Deep Learning Frameworks

Outline

•Recall the steps of training NNs

•Quick comparison of existing frameworks

•Details of design choices

PAKDD2016 DLIF Tutorial 5

Page 6: Differences of Deep Learning Frameworks

Outline

•Recall the steps of training NNs

•Quick comparison of existing frameworks

•Details of design choices

PAKDD2016 DLIF Tutorial 6

Page 7: Differences of Deep Learning Frameworks

Steps for Training Neural Networks

Prepare the training dataset

Repeat until meeting some criterion

Prepare for the next (mini) batch

Compute the loss (forward prop)

Initialize the NN parameters

Save the NN parameters

Define how to compute the loss of this batch

Compute the gradient (backprop)

Update the NN parameters

PAKDD2016 DLIF Tutorial 7

Page 8: Differences of Deep Learning Frameworks

Training of Neural Networks

Prepare the training dataset

Repeat until meeting some criterion

Prepare for the next (mini) batch

Compute the loss (forward prop)

Initialize the NN parameters

Save the NN parameters

Define how to compute the loss of this batch

Compute the gradient (backprop)

Update the NN parameters

automated

PAKDD2016 DLIF Tutorial 8

Page 9: Differences of Deep Learning Frameworks

Training of Neural Networks

Prepare the training dataset

Repeat until meeting some criterion

Prepare for the next (mini) batch

Compute the loss (forward prop)

Initialize the NN parameters

Save the NN parameters

Define how to compute the loss of this batch

Compute the gradient (backprop)

Update the NN parameters

automated

PAKDD2016 DLIF Tutorial 9

Page 10: Differences of Deep Learning Frameworks

Framework Design Choices

• The most crucial part of NN frameworks is• How to define the parameters

• How to define the loss function of the parameters(= how to write computational graphs)

• These also influence on APIs for forward prop, backprop, and parameter updates (i.e., numerical optimization)

• And all of these are determined by how to implement computational graphs

• Other parts are also important, but are mostly common to implementations of other types of machine learning methods

PAKDD2016 DLIF Tutorial 10

Page 11: Differences of Deep Learning Frameworks

Outline

•Recall the steps of training NNs

•Quick comparison of existing frameworks

•Details of design choices

PAKDD2016 DLIF Tutorial 11

Page 12: Differences of Deep Learning Frameworks

List of Frameworks (not exhaustive)

• Torch.nn

• Theano and ones on top of it (Keras, Blocks, Lasagne, etc.)• We omit introduction of each NN framework here, since

1) there are too many frameworks on top of Theano, and2) most of them share characteristics derived from Theano

• Caffe

• autograd (NumPy, Torch)

• Chainer

• MXNet

• TensorFlow

PAKDD2016 DLIF Tutorial 12

Page 13: Differences of Deep Learning Frameworks

Torch.nn

PAKDD2016 DLIF Tutorial 13

• MATLAB-like environment built on LuaJIT

• Fast scripting, CPU/GPU support with unified array backend

Page 14: Differences of Deep Learning Frameworks

Theano (and ones on top of it)

PAKDD2016 DLIF Tutorial 14

• Support computational optimizations and compilations

• Python package to build computational graphs

Page 15: Differences of Deep Learning Frameworks

Caffe

• Fast implementation of NNs in C++

• Mainly focusing on computer vision applications

PAKDD2016 DLIF Tutorial 15

Page 16: Differences of Deep Learning Frameworks

autograd (NumPy, Torch)

• Original one adds automatic differentiation on NumPy APIs

• It is also ported to Torch

PAKDD2016 DLIF Tutorial 16

Page 17: Differences of Deep Learning Frameworks

Chainer

• Support backprop through dynamically constructed graphs

• It also provides a NumPy-compatible GPU array backend

PAKDD2016 DLIF Tutorial 17

Page 18: Differences of Deep Learning Frameworks

MXNet

• Mixed paradigm support (symbolic/imperative computations)

• It also supports distributed computations

PAKDD2016 DLIF Tutorial 18

Page 19: Differences of Deep Learning Frameworks

TensorFlow

• Fast execution by distributed computations

• It also supports some control flows on top of the graphs

PAKDD2016 DLIF Tutorial 19

Page 20: Differences of Deep Learning Frameworks

Framework Comparison: Basic information*

Viewpoint Torch.nn** Theano*** Caffeautograd(NumPy, Torch)

Chainer MXNetTensor-

Flow

GitHub stars

4,719 3,457 9,590N: 654T: 554

1,295 3,316 20,981

Started from

2002 2008 2013 2015 2015 2015 2015

Open issues/PRs

97/26 525/105 407/204N: 9/0T: 3/1

95/25 271/18 330/33

Main developers

Facebook, Twitter,

Google, etc.

Universitéde Montréal

BVLC(U.C. Berkeley)

N: HIPS (Harvard Univ.)

T: Twitter

PreferredNetworks

DMLC Google

Core languages

C/Lua C/Python C++ Python/Lua Python C++ C++/Python

Supported languages

Lua PythonC++/Python

MATLABPython/Lua Python

C++/PythonR/Julia/Go

etc.C++/Python

* Data was taken on Apr. 12, 2016** Includes statistics of Torch7*** There are many frameworks on top of Theano, though we omit them due to the space constraints

PAKDD2016 DLIF Tutorial 20

Page 21: Differences of Deep Learning Frameworks

List of Important Design Choices

Programming paradigms

1. How to write NNs in text format

2. How to build computational graphs

3. How to compute backprop

4. How to represent parameters

5. How to update parameters

Performance improvements

6. How to achieve the computational performance

7. How to scale the computations

PAKDD2016 DLIF Tutorial 21

Page 22: Differences of Deep Learning Frameworks

Framework Comparison: Design ChoicesDesignChoice

Torch.nnTheano-

basedCaffe

autograd(NumPy, Torch)

Chainer MXNetTensor-

Flow

1.NN definition

Script (Lua)

Script* (Python)

Data(protobuf)

Script (Python,

Lua)

Script (Python)

Script (many)

Script (Python)

2. Graph construction

Prebuild Prebuild Prebuild Dynamic Dynamic Prebuild** Prebuild

3. Backprop

Through graph

Extendedgraph

Through graph

Extended graph

Through graph

Throughgraph

Extended graph

4. Parameters

Hidden in operators

Separate nodes

Hidden in operators

Separate nodes

Separate nodes

Separate nodes

Separate nodes

5. Update formula

Outside of graphs

Part of graphs

Outside of graphs

Outside of graphs

Outside of graphs

Outside of graphs**

Part of graphs

6. Optimization

-Advanced

optimization- - - -

Simple optimization

57 Parallel computation

Multi GPUMulti GPU

(libgpuarray)Multi GPU

Multi GPU (Torch)

Multi GPUMulti nodeMulti GPU

Multi nodeMulti GPU

* Some of Theano-based frameworks use data (e.g. yaml)** Dynamic dependency analysis and optimization is supported (no autodiff support) 22

Page 23: Differences of Deep Learning Frameworks

Outline

• Recall the steps of training NNs

• Quick comparison of existing frameworks

• Details of design choices

PAKDD2016 DLIF Tutorial 23

Page 24: Differences of Deep Learning Frameworks

List of Important Design Choices

Programming paradigms

1. How to write NNs in text format

2. How to build computational graphs

3. How to compute backprop

4. How to represent parameters

5. How to update parameters

Performance improvements

6. How to achieve the computational performance

7. How to scale the computations

PAKDD2016 DLIF Tutorial 24

Page 25: Differences of Deep Learning Frameworks

How to write NNs in text format

Write NNs in declarative configuration files

Framework builds layers of NNs as written in the files (e.g. prototxt, YAML).

E.g.: Caffe (prototxt), Pylearn2 (YAML)

PAKDD2016 DLIF Tutorial 25

Write NNs by procedural scripting

Framework provides APIs of scripting languages to build NNs.

E.g.: most other frameworks

Page 26: Differences of Deep Learning Frameworks

How to write NNs in text format

Write NNs in declarative configuration files

High portabilityThe configuration files are easy to parse, and reuse for other frameworks.

Low flexibilityMost static data format does not support structured programming, so it is hart to write complex NNs.

PAKDD2016 DLIF Tutorial 26

Write NNs by procedural scripting

Low portabilityIt requires much efforts to port NNs to other frameworks.

High flexibilityUsers can use the abstraction power of the scripting languages on building NNs.

Page 27: Differences of Deep Learning Frameworks

List of Important Design Choices

Programming paradigms

1. How to write NNs in text format

2. How to build computational graphs

3. How to compute backprop

4. How to represent parameters

5. How to update parameters

Performance improvements

6. How to achieve the computational performance

7. How to scale the computations

PAKDD2016 DLIF Tutorial 27

Page 28: Differences of Deep Learning Frameworks

2. How to build computational graphs

Prepare the training dataset

Repeat until meeting some criterion

Prepare for the next (mini) batch

Compute the loss (forward prop)

Initialize the NN parameters

Save the NN parameters

Compute the gradient (backprop)

Update the NN parameters

Define how to compute the loss

PAKDD2016 DLIF Tutorial 28

Prepare the training dataset

Repeat until meeting some criterion

Prepare for the next (mini) batch

Compute the loss (forward prop)

Initialize the NN parameters

Save the NN parameters

Define how to compute the loss

Compute the gradient (backprop)

Update the NN parameters

Build once, run several times Build one at every iteration

Page 29: Differences of Deep Learning Frameworks

2. How to build computational graphs

PAKDD2016 DLIF Tutorial 29

Build once, run several times

Computational graphs are built once before entering the loop.

E.g.: most frameworks (Torch.nn, Theano, Caffe, TensorFlow, MXNet, etc.)

Build one at every iteration

Computational graphs are rebuilt at every iteration.

E.g.: autograd, Chainer

Page 30: Differences of Deep Learning Frameworks

2. How to build computational graphs

PAKDD2016 DLIF Tutorial 30

Build once, run several times

Easy to optimize the computationsFramework can optimize the computational graphs on constructing them.

Low flexibility and usabilityUsers cannot build different graphs for different iterations using language syntaxes.

Build one at every iteration

Hard to optimize the computationsIt is basically difficult to do optimization every iteration due to its computational cost.

High flexibility and usabilityUsers can build different graphs for different iterations using language syntaxes.

Page 31: Differences of Deep Learning Frameworks

Flexibility and availability of runtime language syntaxes

Example: recurrent nets for variable length sequences

Batch 1

Batch 2

Batch 3

Batch 4

In “build once” approach, we must build all possible graphs beforehand, or use framework-specific “control flow operators”.

PAKDD2016 DLIF Tutorial 31

In “build every time” approach, we can use for loops of the underlying languages to build such graphs, using data-dependent termination conditions.

Page 32: Differences of Deep Learning Frameworks

List of Important Design Choices

Programming paradigms

1. How to write NNs in text format

2. How to build computational graphs

3. How to compute backprop

4. How to represent parameters

5. How to update parameters

Performance improvements

6. How to achieve the computational performance

7. How to scale the computations

PAKDD2016 DLIF Tutorial 32

Page 33: Differences of Deep Learning Frameworks

3. How to compute backprop

PAKDD2016 DLIF Tutorial 33

Backprop through graphs

Framework only builds graphs of forward prop, and do backprop by backtracking the graphs.

E.g.: Torch.nn, Caffe, MXNet, Chainer

Backprop as extended graphs

Framework builds graphs for backprop as well as those for forward prop.

E.g.: Theano, TensorFlow

a mul suby

c

z

b

a mul suby

c

z

b

dzid

neg

mul

mul

dy

dc

da

db

∇y z∇x1 z ∇z z = 1

Page 34: Differences of Deep Learning Frameworks

3. How to compute backprop

PAKDD2016 DLIF Tutorial 34

Backprop through graphs

Easy and simple to implementBackprop computation need not be defined as graphs.

Low flexibilityFeatures available for graphs may not apply to backpropcomputations (e.g., applying additional backprop thorughthem, computational optimizations, etc.).

Backprop as extended graphs

Implementation gets complicated

High flexibilityAny features available for graphs can also be applied to backprop computations.

Page 35: Differences of Deep Learning Frameworks

List of Important Design Choices

Programming paradigms

1. How to write NNs in text format

2. How to build computational graphs

3. How to compute backprop

4. How to represent parameters

5. How to update parameters

Performance improvements

6. How to achieve the computational performance

7. How to scale the computations

PAKDD2016 DLIF Tutorial 35

Page 36: Differences of Deep Learning Frameworks

4. How to represent parameters

PAKDD2016 DLIF Tutorial 36

Parameters as part of operator nodes

Parameters are owned by operator nodes (e.g., convolution layers), and not directly appear in the graphs.

E.g.: Torch.nn, Caffe, MXNet

Parameters as separate nodes in the graphs

Parameters are represented as separate variable nodes.

E.g.: Theano, Chainer, TensorFlow

xAffine

(own W and b)y

x

Affine yW

b

Page 37: Differences of Deep Learning Frameworks

4. How to represent parameters

PAKDD2016 DLIF Tutorial 37

Parameters as part of operator nodes

IntuitivenessThis representation resembles the classical formulation of NNs.

Low flexibility and reusabilityWe cannot do same things for the parameters that can be done for variable nodes.

Parameters as separate nodes in the graphs

High flexibility and reusabilityWe can apply any operations that can be done for variable nodes to the parameters.

Page 38: Differences of Deep Learning Frameworks

5. How to update parameters

PAKDD2016 DLIF Tutorial 38

Update parameters by own routines outside of the graphs

Update formulae are implemented directly using the backend array libraries.

E.g.: Torch.nn, Caffe, MXNet, Chainer

Represent update formulae as a part of the graphs

Update formulae are built as a part of computational graphs.

E.g.: Theano, TensorFlow

Page 39: Differences of Deep Learning Frameworks

5. How to update parameters

PAKDD2016 DLIF Tutorial 39

Update parameters by own routines outside of the graphs

Easy to implementWe can use any features of the array backend on writing update formulae.

Low integrityUpdate formulae are not integrated to computational graphs.

Represent update formulae as a part of the graphs

Implementation gets complicatedFramework must support assignor update operations within the computational graphs.

High integrityWe can apply e.g. optimizations to the update formulae.

Page 40: Differences of Deep Learning Frameworks

List of Important Design Choices

Programming paradigms

1. How to write NNs in text format

2. How to build computational graphs

3. How to compute backprop

4. How to represent parameters

5. How to update parameters

Performance improvements

6. How to achieve the computational performance

7. How to scale the computations

PAKDD2016 DLIF Tutorial 40

Page 41: Differences of Deep Learning Frameworks

6. How to achieve the computational performance

PAKDD2016 DLIF Tutorial 41

Transform the graphs to optimize the computations

There are many ways to optimize the computations.

Theano supports varioutoptimizations.

TensorFlow does simple ones.

Provide easy ways to write custom operator nodes

Users can write their own operator nodes optimized to their purposes.

Torch, MXNet, and Chainerprovide ways to write one code that runs both on CPU and GPU.

Chainer also provides ways to write custom CUDA kernels without manual compilation steps.

Page 42: Differences of Deep Learning Frameworks

7. How to scale the computations

PAKDD2016 DLIF Tutorial 42

Multi-GPU parallelizations

Nowadays, most popular frameworks start supporting multi-GPU computations.

Multi-GPU (one machine) is enough for most use cases today.

Distributed computations (i.e., multi-node parallelizations)

Some frameworks also support distributed computations to further scale the learning.

MXNet uses a simple distributed key-value store.

TensorFlow uses gRPC. It will also support easy-to-use cloud environments.

CNTK uses simple MPI.

Page 43: Differences of Deep Learning Frameworks

Ease and comfortability of writing NNs

• I mainly explained the abilities of each framework

• But it does not include many things around the framework comparison

• Choice of frameworks actually depends on the ease and comfortability of writing NNs on them• Many people chooses Torch for research, because Lua is simple

and fast so that they do not have to care about the performance (in most cases)

• Try and error is important here again (as well as its importance on deep learning research itself)• The choice of frameworks finally depends on your preference

• The capabilities are still important to satisfy your demands

PAKDD2016 DLIF Tutorial 43

Page 44: Differences of Deep Learning Frameworks

Summary

• The important points of framework differences are in the ways to define computational graphs and how to use them

• There are several design choices on the framework development• Each of them influences on their performance and flexibility

(i.e., the range of easily representable NNs and their learning procedures)

• Once your demands are satisfied, choose one that you feel comfortable (it strongly depends on your own preferences!)

PAKDD2016 DLIF Tutorial 44

Page 45: Differences of Deep Learning Frameworks

Conclusion

• We introduced the basics of NNs, typical designs of their implementations, and pros/cons of various design choices.

• Deep learning is an emerging field with increasing speed of development, so quick try-and-error is crutial for the research/development in this field

• In that mean, using frameworks as highly reusable parts of NNs is important

• There are growing number of frameworks in this world, though most of them have different aspects, so it is also important to choose one appropriate for your purpose

PAKDD2016 DLIF Tutorial 45