simple and efficient learning with automatic operation...
TRANSCRIPT
![Page 1: Simple and Efficient Learning with Automatic Operation …phontron.com/slides/neubig17howtocode.pdf · Dynamic Decisions a=1 a=1 a=2. ... of Static Graphs ... • Variably Lengthed](https://reader034.vdocuments.us/reader034/viewer/2022051508/5aa1c8747f8b9a84398c2786/html5/thumbnails/1.jpg)
Simple and Efficient Learning with Automatic
Operation BatchingGraham Neubig
http://dynet.io/autobatch/
joint work w/ Yoav Goldberg and Chris Dyer
in https://github.com/neubig/howtocode-2017
![Page 2: Simple and Efficient Learning with Automatic Operation …phontron.com/slides/neubig17howtocode.pdf · Dynamic Decisions a=1 a=1 a=2. ... of Static Graphs ... • Variably Lengthed](https://reader034.vdocuments.us/reader034/viewer/2022051508/5aa1c8747f8b9a84398c2786/html5/thumbnails/2.jpg)
Neural Networks w/ Complicated Structures
Phrases
Words Sentences
Alice gave a message to Bob
PPNP
VP
VP
S
Dynamic Decisionsa=1 a=1 a=2
![Page 3: Simple and Efficient Learning with Automatic Operation …phontron.com/slides/neubig17howtocode.pdf · Dynamic Decisions a=1 a=1 a=2. ... of Static Graphs ... • Variably Lengthed](https://reader034.vdocuments.us/reader034/viewer/2022051508/5aa1c8747f8b9a84398c2786/html5/thumbnails/3.jpg)
Neural Net Programming Paradigms
![Page 4: Simple and Efficient Learning with Automatic Operation …phontron.com/slides/neubig17howtocode.pdf · Dynamic Decisions a=1 a=1 a=2. ... of Static Graphs ... • Variably Lengthed](https://reader034.vdocuments.us/reader034/viewer/2022051508/5aa1c8747f8b9a84398c2786/html5/thumbnails/4.jpg)
What is Necessary for Neural Network Training
• define computation
• add data
• calculate result (forward)
• calculate gradients (backward)
• update parameters
![Page 5: Simple and Efficient Learning with Automatic Operation …phontron.com/slides/neubig17howtocode.pdf · Dynamic Decisions a=1 a=1 a=2. ... of Static Graphs ... • Variably Lengthed](https://reader034.vdocuments.us/reader034/viewer/2022051508/5aa1c8747f8b9a84398c2786/html5/thumbnails/5.jpg)
Paradigm 1: Static Graphs(Tensorflow, Theano)
• define
• for each data point:
• add data
• forward
• backward
• update
![Page 6: Simple and Efficient Learning with Automatic Operation …phontron.com/slides/neubig17howtocode.pdf · Dynamic Decisions a=1 a=1 a=2. ... of Static Graphs ... • Variably Lengthed](https://reader034.vdocuments.us/reader034/viewer/2022051508/5aa1c8747f8b9a84398c2786/html5/thumbnails/6.jpg)
Advantages/Disadvantages of Static Graphs
• Advantages:
• Can be optimized at definition time
• Easy to feed data to GPUs, etc., via data iterators
• Disadvantages:
• Difficult to implement nets with varying structure (trees, graphs, flow control)
• Need to learn big API that implements flow control in the “graph” language
![Page 7: Simple and Efficient Learning with Automatic Operation …phontron.com/slides/neubig17howtocode.pdf · Dynamic Decisions a=1 a=1 a=2. ... of Static Graphs ... • Variably Lengthed](https://reader034.vdocuments.us/reader034/viewer/2022051508/5aa1c8747f8b9a84398c2786/html5/thumbnails/7.jpg)
Paradigm 2: Dynamic+Eager Evaluation
(PyTorch, Chainer)
• for each data point:
• define/add data/forward
• backward
• update
![Page 8: Simple and Efficient Learning with Automatic Operation …phontron.com/slides/neubig17howtocode.pdf · Dynamic Decisions a=1 a=1 a=2. ... of Static Graphs ... • Variably Lengthed](https://reader034.vdocuments.us/reader034/viewer/2022051508/5aa1c8747f8b9a84398c2786/html5/thumbnails/8.jpg)
Advantages/Disadvantages of Dynamic+Eager Evaluation• Advantages:
• Easy to implement nets with varying structure, API is closer to standard Python/C++
• Easy to debug because errors occur immediately • Disadvantages:
• Cannot be optimized at definition time • Hard to serialize graphs w/o program logic,
decide device placement, etc.
![Page 9: Simple and Efficient Learning with Automatic Operation …phontron.com/slides/neubig17howtocode.pdf · Dynamic Decisions a=1 a=1 a=2. ... of Static Graphs ... • Variably Lengthed](https://reader034.vdocuments.us/reader034/viewer/2022051508/5aa1c8747f8b9a84398c2786/html5/thumbnails/9.jpg)
Paradigm 3: Dynamic+Lazy Evaluation (DyNet)
• for each data point:
• define/add data
• forward
• backward
• update
![Page 10: Simple and Efficient Learning with Automatic Operation …phontron.com/slides/neubig17howtocode.pdf · Dynamic Decisions a=1 a=1 a=2. ... of Static Graphs ... • Variably Lengthed](https://reader034.vdocuments.us/reader034/viewer/2022051508/5aa1c8747f8b9a84398c2786/html5/thumbnails/10.jpg)
Advantages/Disadvantages of Dynamic+Lazy Evaluation• Advantages:
• Easy to implement nets with varying structure,API is closer to standard Python/C++
• Can be optimized at definition time (this presentation!)
• Disadvantages:• Harder to debug because errors occur immediately • Still hard to serialize graphs w/o program logic,
decide device placement, etc.
![Page 11: Simple and Efficient Learning with Automatic Operation …phontron.com/slides/neubig17howtocode.pdf · Dynamic Decisions a=1 a=1 a=2. ... of Static Graphs ... • Variably Lengthed](https://reader034.vdocuments.us/reader034/viewer/2022051508/5aa1c8747f8b9a84398c2786/html5/thumbnails/11.jpg)
Efficiency Tricks: Operation Batching
![Page 12: Simple and Efficient Learning with Automatic Operation …phontron.com/slides/neubig17howtocode.pdf · Dynamic Decisions a=1 a=1 a=2. ... of Static Graphs ... • Variably Lengthed](https://reader034.vdocuments.us/reader034/viewer/2022051508/5aa1c8747f8b9a84398c2786/html5/thumbnails/12.jpg)
Efficiency Tricks: Mini-batching
• On modern hardware 10 operations of size 1 is much slower than 1 operation of size 10
• Minibatching combines together smaller operations into one big one
![Page 13: Simple and Efficient Learning with Automatic Operation …phontron.com/slides/neubig17howtocode.pdf · Dynamic Decisions a=1 a=1 a=2. ... of Static Graphs ... • Variably Lengthed](https://reader034.vdocuments.us/reader034/viewer/2022051508/5aa1c8747f8b9a84398c2786/html5/thumbnails/13.jpg)
Minibatching
![Page 14: Simple and Efficient Learning with Automatic Operation …phontron.com/slides/neubig17howtocode.pdf · Dynamic Decisions a=1 a=1 a=2. ... of Static Graphs ... • Variably Lengthed](https://reader034.vdocuments.us/reader034/viewer/2022051508/5aa1c8747f8b9a84398c2786/html5/thumbnails/14.jpg)
Manual Mini-batching• DyNet has special minibatch operations for lookup
and loss functions, everything else automatic
• You need to:
• Group sentences into a mini batch (optionally, for efficiency group sentences by length)
• Select the “t”th word in each sentence, and send them to the lookup and loss functions
![Page 15: Simple and Efficient Learning with Automatic Operation …phontron.com/slides/neubig17howtocode.pdf · Dynamic Decisions a=1 a=1 a=2. ... of Static Graphs ... • Variably Lengthed](https://reader034.vdocuments.us/reader034/viewer/2022051508/5aa1c8747f8b9a84398c2786/html5/thumbnails/15.jpg)
Example Task: Sentiment
I hate this movie
I love this movie
I do n’t hate this movie
very good good
neutral bad
very bad
very good good
neutral bad
very bad
very good good
neutral bad
very bad
![Page 16: Simple and Efficient Learning with Automatic Operation …phontron.com/slides/neubig17howtocode.pdf · Dynamic Decisions a=1 a=1 a=2. ... of Static Graphs ... • Variably Lengthed](https://reader034.vdocuments.us/reader034/viewer/2022051508/5aa1c8747f8b9a84398c2786/html5/thumbnails/16.jpg)
Continuous Bag of Words (CBOW)
I hate this movie
+
bias
=
scores
+ + +
lookup lookup lookuplookup
W
=
![Page 17: Simple and Efficient Learning with Automatic Operation …phontron.com/slides/neubig17howtocode.pdf · Dynamic Decisions a=1 a=1 a=2. ... of Static Graphs ... • Variably Lengthed](https://reader034.vdocuments.us/reader034/viewer/2022051508/5aa1c8747f8b9a84398c2786/html5/thumbnails/17.jpg)
I
Batching CBOW
I hate this movie
+ + +
lookup lookup lookuplookup
love that movie
![Page 18: Simple and Efficient Learning with Automatic Operation …phontron.com/slides/neubig17howtocode.pdf · Dynamic Decisions a=1 a=1 a=2. ... of Static Graphs ... • Variably Lengthed](https://reader034.vdocuments.us/reader034/viewer/2022051508/5aa1c8747f8b9a84398c2786/html5/thumbnails/18.jpg)
Mini-batched Code Example
![Page 19: Simple and Efficient Learning with Automatic Operation …phontron.com/slides/neubig17howtocode.pdf · Dynamic Decisions a=1 a=1 a=2. ... of Static Graphs ... • Variably Lengthed](https://reader034.vdocuments.us/reader034/viewer/2022051508/5aa1c8747f8b9a84398c2786/html5/thumbnails/19.jpg)
Mini-batching Sequencesthis is an example </s>this is another </s> </s>
PaddingLoss Calculation
Mask
11� 1
1� 11� 1
1� 10�
Take Sum
![Page 20: Simple and Efficient Learning with Automatic Operation …phontron.com/slides/neubig17howtocode.pdf · Dynamic Decisions a=1 a=1 a=2. ... of Static Graphs ... • Variably Lengthed](https://reader034.vdocuments.us/reader034/viewer/2022051508/5aa1c8747f8b9a84398c2786/html5/thumbnails/20.jpg)
Bi-directional LSTMI hate this movie
+
bias
=
scores
W
LSTM LSTM LSTM LSTM
LSTM LSTM LSTM LSTM
concat
![Page 21: Simple and Efficient Learning with Automatic Operation …phontron.com/slides/neubig17howtocode.pdf · Dynamic Decisions a=1 a=1 a=2. ... of Static Graphs ... • Variably Lengthed](https://reader034.vdocuments.us/reader034/viewer/2022051508/5aa1c8747f8b9a84398c2786/html5/thumbnails/21.jpg)
Tree-structured RNN/LSTMI hate this movie
+
bias
=
scores
W
RNN
RNN
RNN
![Page 22: Simple and Efficient Learning with Automatic Operation …phontron.com/slides/neubig17howtocode.pdf · Dynamic Decisions a=1 a=1 a=2. ... of Static Graphs ... • Variably Lengthed](https://reader034.vdocuments.us/reader034/viewer/2022051508/5aa1c8747f8b9a84398c2786/html5/thumbnails/22.jpg)
And What About These?
Phrases
Words Sentences
Alice gave a message to Bob
PPNP
VP
VP
S
Dynamic Decisionsa=1 a=1 a=2
![Page 23: Simple and Efficient Learning with Automatic Operation …phontron.com/slides/neubig17howtocode.pdf · Dynamic Decisions a=1 a=1 a=2. ... of Static Graphs ... • Variably Lengthed](https://reader034.vdocuments.us/reader034/viewer/2022051508/5aa1c8747f8b9a84398c2786/html5/thumbnails/23.jpg)
Automatic Operation Batching
![Page 24: Simple and Efficient Learning with Automatic Operation …phontron.com/slides/neubig17howtocode.pdf · Dynamic Decisions a=1 a=1 a=2. ... of Static Graphs ... • Variably Lengthed](https://reader034.vdocuments.us/reader034/viewer/2022051508/5aa1c8747f8b9a84398c2786/html5/thumbnails/24.jpg)
Automatic Mini-batching!
• Innovatd by TensorFlow Fold (faster than unbatched, but implementation relatively complicated)
• DyNet Autobatch (basically effortless implementation)
![Page 25: Simple and Efficient Learning with Automatic Operation …phontron.com/slides/neubig17howtocode.pdf · Dynamic Decisions a=1 a=1 a=2. ... of Static Graphs ... • Variably Lengthed](https://reader034.vdocuments.us/reader034/viewer/2022051508/5aa1c8747f8b9a84398c2786/html5/thumbnails/25.jpg)
Programming Paradigm
for minibatch in training_data: loss_values = [] for x, y in minibatch: loss_values.append(calculate_loss(x,y)) loss_sum = sum(loss_values) loss_sum.forward() loss_sum.backward() trainer.update()
Just write a for loop!
Batching occurs here
![Page 26: Simple and Efficient Learning with Automatic Operation …phontron.com/slides/neubig17howtocode.pdf · Dynamic Decisions a=1 a=1 a=2. ... of Static Graphs ... • Variably Lengthed](https://reader034.vdocuments.us/reader034/viewer/2022051508/5aa1c8747f8b9a84398c2786/html5/thumbnails/26.jpg)
Under the Hood• Each node has “profile”, same profile → batchable
• Batch and execute items with their dependencies satisfied
![Page 27: Simple and Efficient Learning with Automatic Operation …phontron.com/slides/neubig17howtocode.pdf · Dynamic Decisions a=1 a=1 a=2. ... of Static Graphs ... • Variably Lengthed](https://reader034.vdocuments.us/reader034/viewer/2022051508/5aa1c8747f8b9a84398c2786/html5/thumbnails/27.jpg)
Challenges
• This goes in your training loop: must be blazing fast!
• DyNet’s C++ implementation is highly optimized
• Profiles stored as hash functions
• Minimize memory allocation overhead
![Page 28: Simple and Efficient Learning with Automatic Operation …phontron.com/slides/neubig17howtocode.pdf · Dynamic Decisions a=1 a=1 a=2. ... of Static Graphs ... • Variably Lengthed](https://reader034.vdocuments.us/reader034/viewer/2022051508/5aa1c8747f8b9a84398c2786/html5/thumbnails/28.jpg)
Synthetic Experiments• Fixed-length RNN → ideal case for manual batching
• How close can we get?
![Page 29: Simple and Efficient Learning with Automatic Operation …phontron.com/slides/neubig17howtocode.pdf · Dynamic Decisions a=1 a=1 a=2. ... of Static Graphs ... • Variably Lengthed](https://reader034.vdocuments.us/reader034/viewer/2022051508/5aa1c8747f8b9a84398c2786/html5/thumbnails/29.jpg)
Real NLP Tasks• Variably Lengthed RNN, RNN w/ character
embeddings, tree LSTM, dependency parser
![Page 30: Simple and Efficient Learning with Automatic Operation …phontron.com/slides/neubig17howtocode.pdf · Dynamic Decisions a=1 a=1 a=2. ... of Static Graphs ... • Variably Lengthed](https://reader034.vdocuments.us/reader034/viewer/2022051508/5aa1c8747f8b9a84398c2786/html5/thumbnails/30.jpg)
Let’s Try it Out!http://dynet.io/autobatch/
https://github.com/neubig/howtocode-2017