learning and deep learning id2223 scalable machine
TRANSCRIPT
![Page 1: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/1.jpg)
PeltarionHolländargatan 17SE-111 60 StockholmSweden
/ Transformers and Attention:ID2223 Scalable Machine Learning and Deep Learning
Karl Fredrik ErlikssonIndustrial PhD Candidate,KTH and Peltarion
November 25, 2020
1
![Page 2: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/2.jpg)
Peltarion - the operational AI platform
Roadmap
2
05
Distillation and Practical Example
04
BERT
03
TransformersStep-by-Step
02
From RNNs to Transformers
01
Contextualized Embeddings
![Page 3: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/3.jpg)
Peltarion - the operational AI platform
Material based on:
● Christoffer Manning’s NLP Lectures at Stanford
● The Illustrated Transformer by Jay Alammar
● Slides from Jacob Devlin
● Self-attention Video from Peltarion
Acknowledgements
3
![Page 4: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/4.jpg)
PeltarionHolländargatan 17SE-111 60 StockholmSweden
/ ContextualizedEmbeddings
01
4
![Page 5: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/5.jpg)
Peltarion - the operational AI platform
● Word embeddings arethe basis of NLP
● Popular embeddings likeGloVe and Word2Vec are pre-trained on large text corpuses based on co-occurrence statistics
● “A word is characterized by the company it keeps” [Firth, 1957]
Background to Natural Language Processing (NLP)
5
[Peltarion, 2020]
![Page 6: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/6.jpg)
Peltarion - the operational AI platform
Word Embeddings
6[Peltarion, 2020]
![Page 7: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/7.jpg)
Peltarion - the operational AI platform
Word Embeddings
7
Problem: Word embeddings are context-free
[Peltarion, 2020]
![Page 8: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/8.jpg)
Peltarion - the operational AI platform
Word Embeddings
8
Problem: Word embeddings are context-free
[Peltarion, 2020]
![Page 9: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/9.jpg)
Peltarion - the operational AI platform
Word Embeddings
9
Problem: Word embeddings are context-freeSolution: Create contextualized representation
[Peltarion, 2020]
![Page 10: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/10.jpg)
PeltarionHolländargatan 17SE-111 60 StockholmSweden
/ From RNNs toTransformers
02
10
![Page 11: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/11.jpg)
Peltarion - the operational AI platform
● Sequential computations prevents parallelization● Despite GRUs and LSTMs, RNNs still need attention mechanisms to deal
with long range dependencies● Attention gives us access to any state… Maybe we don’t need the costly
recursion? 🤔● Then NLP can have deep models, solves our computer vision envy!
Problems with RNNs - Motivation for Transformers
11
![Page 12: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/12.jpg)
Peltarion - the operational AI platform
● Sequence-to-sequence model for Machine Translation
● Encoder-decoder architecture
● Multi-headed self-attention○ Models context and no
locality bias
Attention is all you need! [Vaswani, 2017]
12[Vaswani et al., 2017]
![Page 13: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/13.jpg)
PeltarionHolländargatan 17SE-111 60 StockholmSweden
/ TransformersStep-by-Step
03
13
![Page 14: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/14.jpg)
Peltarion - the operational AI platform
Understanding the Transformer: Step-by-Step
14[Alammar, 2018]
![Page 15: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/15.jpg)
Peltarion - the operational AI platform
Understanding the Transformer: Step-by-Step
15
No recursion, instead stacking encoder and decoder blocks
● Originally: 6 layers● BERT base: 12 layers● BERT large: 24 layers● GPT2-XL: 48 layers● GPT3: 96 layers
[Alammar, 2018]
![Page 16: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/16.jpg)
Peltarion - the operational AI platform
The Encoder Block
16[Alammar, 2018]
![Page 17: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/17.jpg)
Peltarion - the operational AI platform
The Encoder and Decoder Blocks
17[Alammar, 2018]
![Page 18: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/18.jpg)
Peltarion - the operational AI platform
Mimics the retrieval of a value vi for a query q based on a key ki in a database, but in a probabilistic fashion
Attention Preliminaries
18
Query q Value vi
Database
![Page 19: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/19.jpg)
Peltarion - the operational AI platform
● Queries, keys and values are vectors● Output is a weighted sum of the values● Weights are are computed as the scaled dot-product (similarity) between
the query and the keys
Dot-Product Attention
19
● Self-attention: Let the word embeddings be the queries, keys and values, i.e. let the words select each other
● Can stack multiple queries into a matrix Q
Output is a row-vector
Output is again a matrix
![Page 20: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/20.jpg)
Peltarion - the operational AI platform
Self-Attention Mechanism
20[Alammar, 2018]
![Page 21: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/21.jpg)
Peltarion - the operational AI platform
Self-Attention Mechanism
21[Alammar, 2018]
![Page 22: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/22.jpg)
Peltarion - the operational AI platform
Self-Attention Mechanism in Matrix Notation
22[Alammar, 2018]
![Page 23: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/23.jpg)
Peltarion - the operational AI platform
Multi-Headed Self-Attention
23[Alammar, 2018]
![Page 24: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/24.jpg)
Peltarion - the operational AI platform
Multi-Headed Self-Attention
24[Alammar, 2018]
![Page 25: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/25.jpg)
Peltarion - the operational AI platform
Self-Attention: Putting It All Together
25[Alammar, 2018]
![Page 26: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/26.jpg)
Peltarion - the operational AI platform
Attention visualized
26
![Page 27: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/27.jpg)
Peltarion - the operational AI platform
The Full Encoder Block
27
Encoder block consisting of:
● Multi-headed self-attention
● Feedforward NN (FC 2 layers)
● Skip connections
● Layer normalization - Similar to batch normalization but computed over features (words/tokens) for a single sample
[Alammar, 2018]
![Page 28: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/28.jpg)
Peltarion - the operational AI platform
Encoder-Decoder Architecture - Small Example
28[Alammar, 2018]
![Page 29: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/29.jpg)
Peltarion - the operational AI platform
Positional Encodings
29
● Attention mechanism has no locality bias - no notion of word order
● Add positional encodings to input embeddings to let model learn relative positioning
[Alammar, 2018]
![Page 30: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/30.jpg)
Peltarion - the operational AI platform
Positional Encodings
30[Kazemnejad, 2019]
![Page 31: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/31.jpg)
Peltarion - the operational AI platform
Let’s start the encoding!
31[Alammar, 2018]
![Page 32: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/32.jpg)
Peltarion - the operational AI platform
Decoding procedure
32[Alammar, 2018]
![Page 33: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/33.jpg)
Peltarion - the operational AI platform
● The output from the decoder is passed through a final fully connected linear layer with a softmax activation function
● Produces a probability distribution over the pre-defined vocabulary of output words (tokens)
● Greedy decoding picks the word with the highest probability at each time step
Producing the output text
33[Alammar, 2018]
![Page 34: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/34.jpg)
Peltarion - the operational AI platform
Training Objective
34[Alammar, 2018]
![Page 35: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/35.jpg)
Peltarion - the operational AI platform
Complexity Comparison
35[Vaswani et al., 2017]
![Page 36: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/36.jpg)
PeltarionHolländargatan 17SE-111 60 StockholmSweden
/ BERT04
36
![Page 37: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/37.jpg)
Peltarion - the operational AI platform
Bidirectional Encoder Representations from Transformers
● Self-supervised pre-training of Transformers encoder for language understanding
● Fine-tuning for specific downstream task
BERT
37
![Page 38: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/38.jpg)
Peltarion - the operational AI platform
BERT Training Procedure
38[Devlin et al., 2018]
![Page 39: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/39.jpg)
Peltarion - the operational AI platform
Masked Language Modelling
Next Sentence Prediction
BERT Training Objectives
39[Devlin et al., 2018]
![Page 40: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/40.jpg)
Peltarion - the operational AI platform
BERT Fine-Tuning Examples
40[Devlin et al., 2018]
SentenceClassification
QuestionAnswering
Named EntityRecognition
![Page 41: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/41.jpg)
Peltarion - the operational AI platform
● Scaling up models size and amount of training data helps a lot● Best model is 11B (!!) parameters● Exact pre-training objective (MLM, NSP, corruption) doesn’t matter too much● SuperGLUE benchmark:
Exploring the Limits of Transfer Learning (T5)
41[Raffel et al., 2019]
![Page 42: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/42.jpg)
PeltarionHolländargatan 17SE-111 60 StockholmSweden
/ Practical Examples05
42
![Page 43: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/43.jpg)
Peltarion - the operational AI platform
BERT in low-latency production settings
43[Devlin, 2020]
![Page 44: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/44.jpg)
Peltarion - the operational AI platform
Distillation
44
[Turc, 2020]
[Devlin, 2020]
● Modern pre-trained language models are huge and very computationally expensive
● How are these companies applying them to low-latency applications?● Distillation!
○ Train SOTA teacher model(pre-training + fine-tuning)
○ Train smaller student model thatmimics the teacher’s output on alarge dataset on unlabeled data
● Why does it work so well?
![Page 45: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/45.jpg)
Peltarion - the operational AI platform
● The HuggingFace Library contains a majority of the recent pre-trained State-of-the-art NLP models, as well as over 4 000 community uploaded models
● Works with both TensorFlow and PyTorch
Transformers in TensorFlow using HuggingFace 🤗
45
![Page 46: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/46.jpg)
Peltarion - the operational AI platform
from transformers import BertTokenizerFast, TFBertForSequenceClassificationfrom datasets import load_datasetimport tensorflow as tf
dataset = load_dataset("imdb").shuffle()tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
train_encodings = tokenizer(dataset['train']['text'], truncation=True, padding=True)train_dataset = tf.data.Dataset.from_tensor_slices((dict(train_encodings), dataset['train']['label']))val_dataset = ... // Analogously
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)model.compile(optimizer=optimizer, loss=model.compute_loss)model.fit(train_dataset.batch(16), epochs=3, batch_size=16)
model.evaluate(val_dataset.batch(16), verbose=0)
Transformers in TensorFlow using HuggingFace 🤗
46
![Page 47: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/47.jpg)
Peltarion - the operational AI platform
from transformers import BertTokenizerFast, TFBertForSequenceClassificationfrom datasets import load_datasetimport tensorflow as tf
dataset = load_dataset("imdb").shuffle()tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
train_encodings = tokenizer(dataset['train']['text'], truncation=True, padding=True)train_dataset = tf.data.Dataset.from_tensor_slices((dict(train_encodings), dataset['train']['label']))val_dataset = ... // Analogously
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)model.compile(optimizer=optimizer, loss=model.compute_loss)model.fit(train_dataset.batch(16), epochs=3, batch_size=16)
model.evaluate(val_dataset.batch(16), verbose=0)
Transformers in TensorFlow using HuggingFace 🤗
47
![Page 48: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/48.jpg)
Peltarion - the operational AI platform
from transformers import BertTokenizerFast, TFBertForSequenceClassificationfrom datasets import load_datasetimport tensorflow as tf
dataset = load_dataset("imdb").shuffle()tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
train_encodings = tokenizer(dataset['train']['text'], truncation=True, padding=True)train_dataset = tf.data.Dataset.from_tensor_slices((dict(train_encodings), dataset['train']['label']))val_dataset = ... // Analogously
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)model.compile(optimizer=optimizer, loss=model.compute_loss)model.fit(train_dataset.batch(16), epochs=3, batch_size=16)
model.evaluate(val_dataset.batch(16), verbose=0)
Transformers in TensorFlow using HuggingFace 🤗
48
![Page 49: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/49.jpg)
Peltarion - the operational AI platform
from transformers import BertTokenizerFast, TFBertForSequenceClassificationfrom datasets import load_datasetimport tensorflow as tf
dataset = load_dataset("imdb").shuffle()tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
train_encodings = tokenizer(dataset['train']['text'], truncation=True, padding=True)train_dataset = tf.data.Dataset.from_tensor_slices((dict(train_encodings), dataset['train']['label']))val_dataset = ... // Analogously
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)model.compile(optimizer=optimizer, loss=model.compute_loss)model.fit(train_dataset.batch(16), epochs=3, batch_size=16)
model.evaluate(val_dataset.batch(16), verbose=0)
Transformers in TensorFlow using HuggingFace 🤗
49
![Page 50: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/50.jpg)
PeltarionHolländargatan 17SE-111 60 StockholmSweden
/ Wrap Up06
50
![Page 51: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/51.jpg)
Peltarion - the operational AI platform
● Transformers have blown other architectures out of the water for NLP
● Get rid of recurrence and rely on self-attention
● NLP pre-training using Masked Language Modelling
● Most recent improvements using larger models and more data
● Distillation can make model serving and inference more tractable
Summary
51
![Page 52: Learning and Deep Learning ID2223 Scalable Machine](https://reader034.vdocuments.us/reader034/viewer/2022042516/626192b7c6fbb4098773658c/html5/thumbnails/52.jpg)
PeltarionHolländargatan 17SE-111 60 StockholmSweden
Thanks/November 25, 2020
Karl Fredrik ErlikssonPhD Candidate, KTH and [email protected]
52
Questions?