gpt-3 and the future of language modelingmiyyer/cs685/slides/11-gpt3.pdf · 2020. 10. 5. · gpt-3...
TRANSCRIPT
![Page 1: GPT-3 and the future of language modelingmiyyer/cs685/slides/11-gpt3.pdf · 2020. 10. 5. · GPT-3 and the future of language modeling CS685 Fall 2020 Advanced Natural Language Processing](https://reader036.vdocuments.us/reader036/viewer/2022071510/612f4a891ecc515869435940/html5/thumbnails/1.jpg)
GPT-3 and the future of language modeling
CS685 Fall 2020Advanced Natural Language Processing
Mohit IyyerCollege of Information and Computer Sciences
University of Massachusetts Amherst
![Page 2: GPT-3 and the future of language modelingmiyyer/cs685/slides/11-gpt3.pdf · 2020. 10. 5. · GPT-3 and the future of language modeling CS685 Fall 2020 Advanced Natural Language Processing](https://reader036.vdocuments.us/reader036/viewer/2022071510/612f4a891ecc515869435940/html5/thumbnails/2.jpg)
Stuff from last time• How is the [CLS] token pretrained (e.g., how does it learn
a contextualized vector during pretraining?) Is it shared across all pretraining sentences?
• We get multiple embeddings per token in ELMo and BERT (different layers), how do we choose which to use?
• Project proposal feedback by the end of the week!
• Practice exams available on Piazza
![Page 3: GPT-3 and the future of language modelingmiyyer/cs685/slides/11-gpt3.pdf · 2020. 10. 5. · GPT-3 and the future of language modeling CS685 Fall 2020 Advanced Natural Language Processing](https://reader036.vdocuments.us/reader036/viewer/2022071510/612f4a891ecc515869435940/html5/thumbnails/3.jpg)
Today, an alternative to “pretrain+finetune”, which involves
simply getting rid of fine-tuning
“Language models are few-shot learners”, Brown et al., 2020
![Page 4: GPT-3 and the future of language modelingmiyyer/cs685/slides/11-gpt3.pdf · 2020. 10. 5. · GPT-3 and the future of language modeling CS685 Fall 2020 Advanced Natural Language Processing](https://reader036.vdocuments.us/reader036/viewer/2022071510/612f4a891ecc515869435940/html5/thumbnails/4.jpg)
ELMo: 93M params, 2-layer biLSTMBERT-base: 110M params, 12-layer TransformerBERT-large: 340M params, 24-layer Transformer
The language model “scaling wars”!
![Page 5: GPT-3 and the future of language modelingmiyyer/cs685/slides/11-gpt3.pdf · 2020. 10. 5. · GPT-3 and the future of language modeling CS685 Fall 2020 Advanced Natural Language Processing](https://reader036.vdocuments.us/reader036/viewer/2022071510/612f4a891ecc515869435940/html5/thumbnails/5.jpg)
ELMo: 93M params, 2-layer biLSTMBERT-base: 110M params, 12-layer TransformerBERT-large: 340M params, 24-layer Transformer
The language model “scaling wars”!
![Page 6: GPT-3 and the future of language modelingmiyyer/cs685/slides/11-gpt3.pdf · 2020. 10. 5. · GPT-3 and the future of language modeling CS685 Fall 2020 Advanced Natural Language Processing](https://reader036.vdocuments.us/reader036/viewer/2022071510/612f4a891ecc515869435940/html5/thumbnails/6.jpg)
ELMo: 93M params, 2-layer biLSTMBERT-base: 110M params, 12-layer TransformerBERT-large: 340M params, 24-layer Transformer
The language model “scaling wars”!
![Page 7: GPT-3 and the future of language modelingmiyyer/cs685/slides/11-gpt3.pdf · 2020. 10. 5. · GPT-3 and the future of language modeling CS685 Fall 2020 Advanced Natural Language Processing](https://reader036.vdocuments.us/reader036/viewer/2022071510/612f4a891ecc515869435940/html5/thumbnails/7.jpg)
ELMo: 1B training tokensBERT: 3.3B training tokensRoBERTa: ~30B training tokens
The language model “scaling wars”!
![Page 8: GPT-3 and the future of language modelingmiyyer/cs685/slides/11-gpt3.pdf · 2020. 10. 5. · GPT-3 and the future of language modeling CS685 Fall 2020 Advanced Natural Language Processing](https://reader036.vdocuments.us/reader036/viewer/2022071510/612f4a891ecc515869435940/html5/thumbnails/8.jpg)
ELMo: 1B training tokensBERT: 3.3B training tokensRoBERTa: ~30B training tokens
The language model “scaling wars”!
![Page 9: GPT-3 and the future of language modelingmiyyer/cs685/slides/11-gpt3.pdf · 2020. 10. 5. · GPT-3 and the future of language modeling CS685 Fall 2020 Advanced Natural Language Processing](https://reader036.vdocuments.us/reader036/viewer/2022071510/612f4a891ecc515869435940/html5/thumbnails/9.jpg)
The language model “scaling wars”!
![Page 10: GPT-3 and the future of language modelingmiyyer/cs685/slides/11-gpt3.pdf · 2020. 10. 5. · GPT-3 and the future of language modeling CS685 Fall 2020 Advanced Natural Language Processing](https://reader036.vdocuments.us/reader036/viewer/2022071510/612f4a891ecc515869435940/html5/thumbnails/10.jpg)
The language model “scaling wars”!
Log scale!
![Page 11: GPT-3 and the future of language modelingmiyyer/cs685/slides/11-gpt3.pdf · 2020. 10. 5. · GPT-3 and the future of language modeling CS685 Fall 2020 Advanced Natural Language Processing](https://reader036.vdocuments.us/reader036/viewer/2022071510/612f4a891ecc515869435940/html5/thumbnails/11.jpg)
so… what does all of this scaling buy us?
![Page 12: GPT-3 and the future of language modelingmiyyer/cs685/slides/11-gpt3.pdf · 2020. 10. 5. · GPT-3 and the future of language modeling CS685 Fall 2020 Advanced Natural Language Processing](https://reader036.vdocuments.us/reader036/viewer/2022071510/612f4a891ecc515869435940/html5/thumbnails/12.jpg)
Downstream training data
Downstream test data
![Page 13: GPT-3 and the future of language modelingmiyyer/cs685/slides/11-gpt3.pdf · 2020. 10. 5. · GPT-3 and the future of language modeling CS685 Fall 2020 Advanced Natural Language Processing](https://reader036.vdocuments.us/reader036/viewer/2022071510/612f4a891ecc515869435940/html5/thumbnails/13.jpg)
![Page 14: GPT-3 and the future of language modelingmiyyer/cs685/slides/11-gpt3.pdf · 2020. 10. 5. · GPT-3 and the future of language modeling CS685 Fall 2020 Advanced Natural Language Processing](https://reader036.vdocuments.us/reader036/viewer/2022071510/612f4a891ecc515869435940/html5/thumbnails/14.jpg)
No fine-tuning!!! Literally just take a pretrained LM and give it the following prefix:
“Translate English to French: cheese =>”
![Page 15: GPT-3 and the future of language modelingmiyyer/cs685/slides/11-gpt3.pdf · 2020. 10. 5. · GPT-3 and the future of language modeling CS685 Fall 2020 Advanced Natural Language Processing](https://reader036.vdocuments.us/reader036/viewer/2022071510/612f4a891ecc515869435940/html5/thumbnails/15.jpg)
No fine-tuning!!! Literally just take a pretrained LM and give it the following prefix:
“Translate English to French: sea otter => loutre de mer, cheese =>”
![Page 16: GPT-3 and the future of language modelingmiyyer/cs685/slides/11-gpt3.pdf · 2020. 10. 5. · GPT-3 and the future of language modeling CS685 Fall 2020 Advanced Natural Language Processing](https://reader036.vdocuments.us/reader036/viewer/2022071510/612f4a891ecc515869435940/html5/thumbnails/16.jpg)
No fine-tuning!!! Literally just take a pretrained LM and give it the following prefix:
“Translate English to French: sea otter => loutre de mer, peppermint => … (few more examples), cheese =>”
Max of 100 examples fed into the prefix in this way
![Page 17: GPT-3 and the future of language modelingmiyyer/cs685/slides/11-gpt3.pdf · 2020. 10. 5. · GPT-3 and the future of language modeling CS685 Fall 2020 Advanced Natural Language Processing](https://reader036.vdocuments.us/reader036/viewer/2022071510/612f4a891ecc515869435940/html5/thumbnails/17.jpg)
How does this new paradigm compare to “pretrain + finetune”?
![Page 18: GPT-3 and the future of language modelingmiyyer/cs685/slides/11-gpt3.pdf · 2020. 10. 5. · GPT-3 and the future of language modeling CS685 Fall 2020 Advanced Natural Language Processing](https://reader036.vdocuments.us/reader036/viewer/2022071510/612f4a891ecc515869435940/html5/thumbnails/18.jpg)
TriviaQA
![Page 19: GPT-3 and the future of language modelingmiyyer/cs685/slides/11-gpt3.pdf · 2020. 10. 5. · GPT-3 and the future of language modeling CS685 Fall 2020 Advanced Natural Language Processing](https://reader036.vdocuments.us/reader036/viewer/2022071510/612f4a891ecc515869435940/html5/thumbnails/19.jpg)
![Page 20: GPT-3 and the future of language modelingmiyyer/cs685/slides/11-gpt3.pdf · 2020. 10. 5. · GPT-3 and the future of language modeling CS685 Fall 2020 Advanced Natural Language Processing](https://reader036.vdocuments.us/reader036/viewer/2022071510/612f4a891ecc515869435940/html5/thumbnails/20.jpg)
What does this mean?
![Page 21: GPT-3 and the future of language modelingmiyyer/cs685/slides/11-gpt3.pdf · 2020. 10. 5. · GPT-3 and the future of language modeling CS685 Fall 2020 Advanced Natural Language Processing](https://reader036.vdocuments.us/reader036/viewer/2022071510/612f4a891ecc515869435940/html5/thumbnails/21.jpg)
What about translation? (7% of GPT3’s training data is in
languages other than English)
![Page 22: GPT-3 and the future of language modelingmiyyer/cs685/slides/11-gpt3.pdf · 2020. 10. 5. · GPT-3 and the future of language modeling CS685 Fall 2020 Advanced Natural Language Processing](https://reader036.vdocuments.us/reader036/viewer/2022071510/612f4a891ecc515869435940/html5/thumbnails/22.jpg)
![Page 23: GPT-3 and the future of language modelingmiyyer/cs685/slides/11-gpt3.pdf · 2020. 10. 5. · GPT-3 and the future of language modeling CS685 Fall 2020 Advanced Natural Language Processing](https://reader036.vdocuments.us/reader036/viewer/2022071510/612f4a891ecc515869435940/html5/thumbnails/23.jpg)
Improvements haven’t plateaued!
![Page 24: GPT-3 and the future of language modelingmiyyer/cs685/slides/11-gpt3.pdf · 2020. 10. 5. · GPT-3 and the future of language modeling CS685 Fall 2020 Advanced Natural Language Processing](https://reader036.vdocuments.us/reader036/viewer/2022071510/612f4a891ecc515869435940/html5/thumbnails/24.jpg)
What about reading comprehension QA?
![Page 25: GPT-3 and the future of language modelingmiyyer/cs685/slides/11-gpt3.pdf · 2020. 10. 5. · GPT-3 and the future of language modeling CS685 Fall 2020 Advanced Natural Language Processing](https://reader036.vdocuments.us/reader036/viewer/2022071510/612f4a891ecc515869435940/html5/thumbnails/25.jpg)
Struggles on “harder” datasets
![Page 26: GPT-3 and the future of language modelingmiyyer/cs685/slides/11-gpt3.pdf · 2020. 10. 5. · GPT-3 and the future of language modeling CS685 Fall 2020 Advanced Natural Language Processing](https://reader036.vdocuments.us/reader036/viewer/2022071510/612f4a891ecc515869435940/html5/thumbnails/26.jpg)
![Page 27: GPT-3 and the future of language modelingmiyyer/cs685/slides/11-gpt3.pdf · 2020. 10. 5. · GPT-3 and the future of language modeling CS685 Fall 2020 Advanced Natural Language Processing](https://reader036.vdocuments.us/reader036/viewer/2022071510/612f4a891ecc515869435940/html5/thumbnails/27.jpg)
Data contamination
![Page 28: GPT-3 and the future of language modelingmiyyer/cs685/slides/11-gpt3.pdf · 2020. 10. 5. · GPT-3 and the future of language modeling CS685 Fall 2020 Advanced Natural Language Processing](https://reader036.vdocuments.us/reader036/viewer/2022071510/612f4a891ecc515869435940/html5/thumbnails/28.jpg)
![Page 29: GPT-3 and the future of language modelingmiyyer/cs685/slides/11-gpt3.pdf · 2020. 10. 5. · GPT-3 and the future of language modeling CS685 Fall 2020 Advanced Natural Language Processing](https://reader036.vdocuments.us/reader036/viewer/2022071510/612f4a891ecc515869435940/html5/thumbnails/29.jpg)
![Page 30: GPT-3 and the future of language modelingmiyyer/cs685/slides/11-gpt3.pdf · 2020. 10. 5. · GPT-3 and the future of language modeling CS685 Fall 2020 Advanced Natural Language Processing](https://reader036.vdocuments.us/reader036/viewer/2022071510/612f4a891ecc515869435940/html5/thumbnails/30.jpg)
So… should we drop everything and focus all of our efforts on
training bigger and bigger LMs?
“Climbing towards NLU…”, Bender & Koller, ACL 2020
![Page 31: GPT-3 and the future of language modelingmiyyer/cs685/slides/11-gpt3.pdf · 2020. 10. 5. · GPT-3 and the future of language modeling CS685 Fall 2020 Advanced Natural Language Processing](https://reader036.vdocuments.us/reader036/viewer/2022071510/612f4a891ecc515869435940/html5/thumbnails/31.jpg)
Distinction between “form” and “meaning”
• Form: characters / words making up some text (or sounds etc for spoken language)
• Meaning: How the form of a given text relates to something outside of language (e.g., grounded in some world)
![Page 32: GPT-3 and the future of language modelingmiyyer/cs685/slides/11-gpt3.pdf · 2020. 10. 5. · GPT-3 and the future of language modeling CS685 Fall 2020 Advanced Natural Language Processing](https://reader036.vdocuments.us/reader036/viewer/2022071510/612f4a891ecc515869435940/html5/thumbnails/32.jpg)
Distinction between “form” and “meaning”
• Thought experiment (from Emily Bender):
• Training data: All well-formed Java code on GitHub, but only the text of the code; no output; no understanding of what unit tests mean
• Test input: A single Java program, possibly even from the training data
• Expected output: Result of executing that program
![Page 33: GPT-3 and the future of language modelingmiyyer/cs685/slides/11-gpt3.pdf · 2020. 10. 5. · GPT-3 and the future of language modeling CS685 Fall 2020 Advanced Natural Language Processing](https://reader036.vdocuments.us/reader036/viewer/2022071510/612f4a891ecc515869435940/html5/thumbnails/33.jpg)
Distinction between “form” and “meaning”
• Thought experiment (from Emily Bender):
• Training data: All well-formed Java code on GitHub, but only the text of the code; no output; no understanding of what unit tests mean
• Test input: A single Java program, possibly even from the training data
• Expected output: Result of executing that program
What’s missing is the meaning… what is the program supposed to do, given just the form (code)?
![Page 34: GPT-3 and the future of language modelingmiyyer/cs685/slides/11-gpt3.pdf · 2020. 10. 5. · GPT-3 and the future of language modeling CS685 Fall 2020 Advanced Natural Language Processing](https://reader036.vdocuments.us/reader036/viewer/2022071510/612f4a891ecc515869435940/html5/thumbnails/34.jpg)
The octopus testA B
I’m stranded here… it sucks
Same… luckily we can talk to
each other!
![Page 35: GPT-3 and the future of language modelingmiyyer/cs685/slides/11-gpt3.pdf · 2020. 10. 5. · GPT-3 and the future of language modeling CS685 Fall 2020 Advanced Natural Language Processing](https://reader036.vdocuments.us/reader036/viewer/2022071510/612f4a891ecc515869435940/html5/thumbnails/35.jpg)
The octopus testA B
O
Any plans to escape?
Nope. Just gonna lie here.
![Page 36: GPT-3 and the future of language modelingmiyyer/cs685/slides/11-gpt3.pdf · 2020. 10. 5. · GPT-3 and the future of language modeling CS685 Fall 2020 Advanced Natural Language Processing](https://reader036.vdocuments.us/reader036/viewer/2022071510/612f4a891ecc515869435940/html5/thumbnails/36.jpg)
The octopus test
So where are you from?
Los Angeles, it’s got great
weather
![Page 37: GPT-3 and the future of language modelingmiyyer/cs685/slides/11-gpt3.pdf · 2020. 10. 5. · GPT-3 and the future of language modeling CS685 Fall 2020 Advanced Natural Language Processing](https://reader036.vdocuments.us/reader036/viewer/2022071510/612f4a891ecc515869435940/html5/thumbnails/37.jpg)
The octopus testHelp! I’m being
chased by a bear! All I have is a stick,
what do I do?
Not sure, sorry!
(No idea what a bear or stick is…)
![Page 38: GPT-3 and the future of language modelingmiyyer/cs685/slides/11-gpt3.pdf · 2020. 10. 5. · GPT-3 and the future of language modeling CS685 Fall 2020 Advanced Natural Language Processing](https://reader036.vdocuments.us/reader036/viewer/2022071510/612f4a891ecc515869435940/html5/thumbnails/38.jpg)
O did not learn “meaning”
• O only observed form, without any grounding in the world on these islands
• A could find meaning from O’s utterances, even though O did not “understand” what it was saying
• What if B didn’t know what a bear was either? They might respond similarly to O. However, B can ground their responses in their own world/experience, and as such are formulating their response totally differently from O
![Page 39: GPT-3 and the future of language modelingmiyyer/cs685/slides/11-gpt3.pdf · 2020. 10. 5. · GPT-3 and the future of language modeling CS685 Fall 2020 Advanced Natural Language Processing](https://reader036.vdocuments.us/reader036/viewer/2022071510/612f4a891ecc515869435940/html5/thumbnails/39.jpg)
So what now?
• We need more datasets that are grounded in different modalities and ways of interaction!
• We need ways to test a model’s ability to generalize or adapt to new tasks
• Take some inspiration from human language learning: children do not learn from form alone, why should we force our machines to do so?