ucl x deepmind lecture series

103
In this lecture series, leading research scientists from leading AI research lab, DeepMind, will give 12 lectures on an exciting selection of topics in Deep Learning, ranging from the fundamentals of training neural networks via advanced ideas around memory, attention, and generative modelling to the important topic of responsible innovation. Please join us for a deep dive lecture series into Deep Learning! #UCLxDeepMind UCL x DeepMind WELCOME TO THE lecture series

Upload: others

Post on 19-May-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: UCL x DeepMind lecture series

In this lecture series, leading research scientists from leading AI research lab, DeepMind, will give12 lectures on an exciting selection of topicsin Deep Learning, ranging from the fundamentals of training neural networks via advanced ideas around memory, attention, and generative modelling to the important topic of responsible innovation.

Please join us for a deep dive lecture seriesinto Deep Learning!

#UCLxDeepMind

UCL x DeepMindWELCOME TO THE

lecture series

Page 2: UCL x DeepMind lecture series

TODAY’S SPEAKER

Felix HillFelix Hill has been a Research Scientistat DeepMind since 2016. He did his undergraduate and masters degrees in pure maths at the University of Oxford, and his PhD in Computer Science at the University of Cambridge. His PhD focused on unsupervised learning from text in neural networks. Since joining DeepMind, he has worked on abstract and relational reasoning, and situated models of human language learning learning and usage, primarily in simulated environments.

Page 3: UCL x DeepMind lecture series

TODAY’S LECTURE

Deep Learning for Language Understanding

Since the late 1970s, a key motivation for research into artificial neural networks was as a model for the highly context-dependent nature of semantic cognition. Today, their intrinsic parallel, interactive and context-dependent processing allows deep nets to power many state-of-the-art language technology applications. In this lecture we explain why neural networks can be such an effective tool for modelling language, focusing on the Transformer, a recently-proposed architecture for sequence processing. We also dive into recent research to develop embodied neural-network 'agents' that can relate language to the world around them.

Page 4: UCL x DeepMind lecture series

UCL x DeepMind Lectures

Deep Learning for Language Understanding

Felix Hill

Page 5: UCL x DeepMind lecture series

Private & ConfidentialPlan for this Lecture

01Background: deep learning and language

02The Transformer

03Unsupervised and transfer learning with BERT

ContentHigh level plan of the lecture

GoalPrepare people for the overall structure, make sure that they know what to expect

04Grounded language learning at DeepMind: towards language understanding in a situated agent

Page 6: UCL x DeepMind lecture series

Private & ConfidentialWhat is not covered in this lecture

01Recurrent networks

- Seq-to-seq models and neural MT

- Speech recognition or synthesis

03Grounding in image/video

Visual question-answering or captioning

CLEVR and visual reasoning

Video captioning

02Many NLP tasks and applications

- Machine comprehension

- Question answering

- Dialogue

ContentNeural networks in the wild

GoalProvide people with slightly broader perspective, and also sneak peek into future lectures

Page 7: UCL x DeepMind lecture series

1 Background:Deep learning and language

Extra notes/ideasI removed most of the “explicit” branding, only first slide has DeepMind name on it, everything else - a small logo in the corner. Also no “confidential” info

Page 8: UCL x DeepMind lecture series
Page 9: UCL x DeepMind lecture series

ContentVarious successes

GoalProvides a bit of extra excitement, and makes sure people see applications before getting into technicalities

Machine Translation

Speech recognition

Question answering

Speech synthesis

Summarization

Text classification

Search / information

retrieval

Dialogue systems

Home assistants

Very neural

Less / non-neural

Page 10: UCL x DeepMind lecture series

ContentVarious successes

GoalProvides a bit of extra excitement, and makes sure people see applications before getting into technicalities

Page 11: UCL x DeepMind lecture series

ContentVarious successes

GoalProvides a bit of extra excitement, and makes sure people see applications before getting into technicalities

Page 12: UCL x DeepMind lecture series

Why is Deep Learning such an effective tool for language processing?

Extra notes/ideasI removed most of the “explicit” branding, only first slide has DeepMind name on it, everything else - a small logo in the corner. Also no “confidential” info

Page 13: UCL x DeepMind lecture series

Why is Deep Learning such an effective tool for language processing?

Extra notes/ideasI removed most of the “explicit” branding, only first slide has DeepMind name on it, everything else - a small logo in the corner. Also no “confidential” info

Can it be improved?

Page 14: UCL x DeepMind lecture series

Some things about language...

Extra notes/ideasI removed most of the “explicit” branding, only first slide has DeepMind name on it, everything else - a small logo in the corner. Also no “confidential” info

Page 15: UCL x DeepMind lecture series

1. Words are not discrete symbols

Extra notes/ideasI removed most of the “explicit” branding, only first slide has DeepMind name on it, everything else - a small logo in the corner. Also no “confidential” info

Page 16: UCL x DeepMind lecture series

● Did you see the look on her face1 ?

● We could see the clock face2 from below

● It could be time to face3 his demons

● There are a few new faces4 in the office today

Page 17: UCL x DeepMind lecture series

● Did you see the look on her face1 ?

● We could see the clock face2 from below

● It could be time to face3 his demons

● There are a few new faces4 in the office today

Face 1

● The most important side (of the head)● Represents you / yourself● Used to inform / communicate● Points forward when you

address/confront something

Page 18: UCL x DeepMind lecture series

● Did you see the look on her face1 ?

● We could see the clock face2 from below

● It could be time to face3 his demons

● There are a few new faces4 in the office today

Face 1

● The most important side (of the head)● Represents you / yourself● Used to inform / communicate● Points forward when you

address/confront something

Face 2

● The most important side (of the head)● Used to inform / communicate

Page 19: UCL x DeepMind lecture series

● Did you see the look on her face1 ?

● We could see the clock face2 from below

● It could be time to face3 his demons

● There are a few new faces4 in the office today

Face 1

● The most important side (of the head)● Represents you / yourself● Used to inform / communicate● Points forward when you

address/confront something

Face 2

● The most important side (of the head)● Used to inform / communicate

Face 3● Points forward when you

address/confront something

Page 20: UCL x DeepMind lecture series

● Did you see the look on her face1 ?

● We could see the clock face2 from below

● It could be time to face3 his demons

● There are a few new faces4 in the office today

Face 1

● The most important side (of the head)● Represents you / yourself● Used to inform / communicate● Points forward when you

address/confront something

Face 2

● The most important side (of the head)● Used to inform / communicate

Face 3● Points forward when you

address/confront something

Face 4

● Represents you / yourself

Page 21: UCL x DeepMind lecture series

Extra notes/ideasI removed most of the “explicit” branding, only first slide has DeepMind name on it, everything else - a small logo in the corner. Also no “confidential” info

2. Disambiguation depends on context

1. Words are not discrete symbols

Page 22: UCL x DeepMind lecture series

Rumelhart, D. E. (1976). Toward an interactive model of reading

Page 23: UCL x DeepMind lecture series
Page 24: UCL x DeepMind lecture series

Extra notes/ideasI removed most of the “explicit” branding, only first slide has DeepMind name on it, everything else - a small logo in the corner. Also no “confidential” info

2. Disambiguation depends on context

3. Important interactions can be non-local

1. Words are not discrete symbols

Page 25: UCL x DeepMind lecture series

The man who ate the pepper sneezed

Page 26: UCL x DeepMind lecture series

The man who ate the pepper sneezed

Page 27: UCL x DeepMind lecture series

But note: The cat who bit the dog barked

The man who ate the pepper sneezed

Page 28: UCL x DeepMind lecture series

The cat who bit the dog barked

The man who ate the pepper sneezed

Page 29: UCL x DeepMind lecture series

Extra notes/ideasI removed most of the “explicit” branding, only first slide has DeepMind name on it, everything else - a small logo in the corner. Also no “confidential” info

2. Disambiguation depends on context

4. How meanings combine depends on those meanings

1. Words are not discrete symbols

3. Important interactions can be non-local

Page 30: UCL x DeepMind lecture series
Page 31: UCL x DeepMind lecture series
Page 32: UCL x DeepMind lecture series
Page 33: UCL x DeepMind lecture series

Pet

PET

brownwhiteblack

Page 34: UCL x DeepMind lecture series

Pet Fish

PET

brownwhiteblack

FISH silver grey

Page 35: UCL x DeepMind lecture series

PET FISH

OrangeGreenBluePurpleYellow

Page 36: UCL x DeepMind lecture series

PLANT

GreenLeavesGrows

Page 37: UCL x DeepMind lecture series

CARNIVORE

Eats meatSharp teeth

Page 38: UCL x DeepMind lecture series

CARNIVOROUS PLANT

GreenLeavesGrowsSharp teethEats insects

Page 39: UCL x DeepMind lecture series

1. Words have many related senses

Extra notes/ideasI removed most of the “explicit” branding, only first slide has DeepMind name on it, everything else - a small logo in the corner. Also no “confidential” info

2. Disambiguation depends on context

4. How meanings combine depends on those meanings

3. Important interactions can be non-local

Page 40: UCL x DeepMind lecture series

2 The Transformer

Extra notes/ideasI removed most of the “explicit” branding, only first slide has DeepMind name on it, everything else - a small logo in the corner. Also no “confidential” info

Vaswani et al. 2018

Page 41: UCL x DeepMind lecture series
Page 42: UCL x DeepMind lecture series

Distributed representations of words

the

theethe

Page 43: UCL x DeepMind lecture series

girl

the girlethe

egirl

Distributed representations of words

Page 44: UCL x DeepMind lecture series

the girl smiled

Input layer has V x D weights

Embedding dimension = D units

Vocabulary size = V units

esmiled

Distributed representations of words

Page 45: UCL x DeepMind lecture series

Mikkulainen & Dyer, 1991

The Transformer builds on solid foundations

Page 46: UCL x DeepMind lecture series

Elman, Finding Structure in Time,

1990

Emergent semantic and syntactic structure in distributed representations

Early contextual word embeddings

Page 47: UCL x DeepMind lecture series

vkq

WvWq

Wk

q = ebeetle Wq

k = ebeetle Wk

V = ebeetle Wv

The Transformer: Self-attention over word input embeddings

the beetle drove off

ebeetle

Page 48: UCL x DeepMind lecture series

vkq

WvWq

Wk

Self-attention over word input embeddings

the beetle drove off

ebeetle

Page 49: UCL x DeepMind lecture series

vkq

WvWq

Wk

λ1 λ3 λ4λ2

Self-attention over word input embeddings

the beetle drove off

ebeetle

Page 50: UCL x DeepMind lecture series

vkq

WvWq

Wk

λ1 λ3 λ4λ2

Self-attention over word input embeddings

the beetle drove off

ebeetle

Page 51: UCL x DeepMind lecture series

vkq

WvWq

Wk

λ1 λ3 λ4λ2

v'

Self-attention over word input embeddings

the beetle drove off

ebeetle

Page 52: UCL x DeepMind lecture series

vkq

WvWq

Wk

Parallelself-attention

Compute self-attention for all words in input (in parallel)

the beetle drove off

ebeetle

Page 53: UCL x DeepMind lecture series

ebeetle

Multi-head self-attention (H = 4)

D / 4 units

D units

the beetle drove off

ebeetle

WvWqWk

Page 54: UCL x DeepMind lecture series

Multi-head self-attention (H = 4)

the beetle drove off

ebeetle

WvWqWk

Page 55: UCL x DeepMind lecture series

the beetle drove off

ebeetle

Multi-head self-attention (H = 4)

WvWqWk

Page 56: UCL x DeepMind lecture series

Multi-head self-attention (H = 4)

the beetle drove off

ebeetle

WvWqWk

Page 57: UCL x DeepMind lecture series

ebeetle

WO

D units

D units

the beetle drove off

ebeetle

WvWqWk

Page 58: UCL x DeepMind lecture series

W1

W2

W1

W2

W1

W2

W1

W2

reLu

Feedforward Layer

the beetle drove off

ebeetle

Page 59: UCL x DeepMind lecture series

the beetle drove off

Multi-head self-attention

A complete Transformer block

Page 60: UCL x DeepMind lecture series

the beetle drove off

Multi-head self-attention

+

Layer Norm

Skip connections

A complete Transformer block

Page 61: UCL x DeepMind lecture series

the beetle drove off

Multi-head self-attention

+

Layer Norm

Feedforward layer

Skip connections

A complete Transformer block

Page 62: UCL x DeepMind lecture series

the beetle drove off

Multi-head self-attention

+

Layer Norm

Feedforward layer

+

Layer Norm

Skip connections

Skip connections

Skip-connections - for "top-down" influences

Page 63: UCL x DeepMind lecture series

The Transformer: Position encoding of words

the beetle drove off

ebeetle

+

● Add fixed quantity to embedding activations

● The quantity added to each input embedding unit ∈ [-1, 1] depends on:

○ The dimension of the unit within the embedding○ The (absolute) position of the words in the input

Page 64: UCL x DeepMind lecture series

Extra notes/ideasI removed most of the “explicit” branding, only first slide has DeepMind name on it, everything else - a small logo in the corner. Also no “confidential” info

2. Disambiguation depends on context

4. How meanings combine depends on those meanings

1. Words are not discrete symbols

3. Important interactions can be non-local

Page 65: UCL x DeepMind lecture series

Extra notes/ideasI removed most of the “explicit” branding, only first slide has DeepMind name on it, everything else - a small logo in the corner. Also no “confidential” info

2. Disambiguation depends on context

Multi-head processing

4. How meanings combine depends on those meanings

1. Words are not discrete symbols Distributed representations

3. Important interactions can be non-local

Page 66: UCL x DeepMind lecture series

Extra notes/ideasI removed most of the “explicit” branding, only first slide has DeepMind name on it, everything else - a small logo in the corner. Also no “confidential” info

2. Disambiguation depends on context

Multi-head processing

Self-attention

4. How meanings combine depends on those meanings

1. Words are not discrete symbols Distributed representations

3. Important interactions can be non-local

Page 67: UCL x DeepMind lecture series

Extra notes/ideasI removed most of the “explicit” branding, only first slide has DeepMind name on it, everything else - a small logo in the corner. Also no “confidential” info

2. Disambiguation depends on context

Multi-head processing

Self-attention

Self-attention

4. How meanings combine depends on those meanings

1. Words are not discrete symbols Distributed representations

3. Important interactions can be non-localMultiple layers

Page 68: UCL x DeepMind lecture series

Extra notes/ideasI removed most of the “explicit” branding, only first slide has DeepMind name on it, everything else - a small logo in the corner. Also no “confidential” info

2. Disambiguation depends on context

Multi-head processing

Self-attention

Self-attention

Distributed representations

4. How meanings combine depends on those meanings

1. Words are not discrete symbols Distributed representations

Skip connections

3. Important interactions can be non-localMultiple layers

Page 69: UCL x DeepMind lecture series

the beetle drove off

+

Layer Norm

Feedforward layer

+

Layer Norm

Skip connections

Skip connections

Skip-connections - for "top-down" influences

Multi-head self-attention

Page 70: UCL x DeepMind lecture series

3 Unsupervised Learning With Transformers (BERT)

Extra notes/ideasI removed most of the “explicit” branding, only first slide has DeepMind name on it, everything else - a small logo in the corner. Also no “confidential” info

Page 71: UCL x DeepMind lecture series

Time flies like an arrow

Page 72: UCL x DeepMind lecture series

Time flies like an arrow

Fruit flies like a banana

Page 73: UCL x DeepMind lecture series

Time flies like an arrow

Fruit flies like a banana

John works like a trojanThe trains run like clockwork

Page 74: UCL x DeepMind lecture series

Time flies like an arrow

Fruit flies like a banana

Fido likes having his tummy rubbedGrandma likes a good cuppa

John works like a trojanThe trains run like clockwork

Page 75: UCL x DeepMind lecture series

1. Words have many related senses

Extra notes/ideasI removed most of the “explicit” branding, only first slide has DeepMind name on it, everything else - a small logo in the corner. Also no “confidential” info

2. Disambiguation depends on context

3. Relevant context can be non-local

4. 'Composition' depends on what words mean

5. Understanding is balancing input with knowledge

Page 76: UCL x DeepMind lecture series

sucking up knowledge from words ...

sucking up knowledge from words ...

BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding (Devlin et al. 2019)

Page 77: UCL x DeepMind lecture series

BERT (Devlin et al. 2019)

sucking up _[MASK]_ from words ...

knowledge

15% of words masked - predict them!

Masked language model pretraining

Page 78: UCL x DeepMind lecture series

sucking up knowledge from words ...

knowledge

15% of words masked - predict them!

10% of these instances - leave word in placeBERT (Devlin et al. 2019)

Masked language model pretraining

Page 79: UCL x DeepMind lecture series

Next sentence prediction pretraining

_[CLS]_ Sid went outside . _[SEP]_ It began to rain .

BERT (Devlin et al. 2019)

Page 80: UCL x DeepMind lecture series

Next sentence prediction pretraining

_[CLS]_ Sid went outside . _[SEP]_ Unfortunately it wasn't .

BERT (Devlin et al. 2019)

Page 81: UCL x DeepMind lecture series

BERT pretraining

_[CLS]_ Masses of text . _[SEP]_ From the internet .

BERT (Devlin et al. 2019)

Page 82: UCL x DeepMind lecture series

BERT fine-tuning

BERT (Devlin et al. 2019)

_[CLS]_ A small amount of . _[SEP]_ Task-specific data

Page 83: UCL x DeepMind lecture series

ContentVarious successes

GoalProvides a bit of extra excitement, and makes sure people see applications before getting into technicalities

BERT supercharges transfer learning

Page 84: UCL x DeepMind lecture series

1. Words have many related senses

Extra notes/ideasI removed most of the “explicit” branding, only first slide has DeepMind name on it, everything else - a small logo in the corner. Also no “confidential” info

2. Disambiguation depends on context

3. Relevant context can be non-local

4. 'Composition' depends on what words mean

5. Understanding is balancing input with knowledge

Page 85: UCL x DeepMind lecture series

1. Words have many related senses

Extra notes/ideasI removed most of the “explicit” branding, only first slide has DeepMind name on it, everything else - a small logo in the corner. Also no “confidential” info

2. Disambiguation depends on context

3. Relevant context can be non-local

4. 'Composition' depends on what words mean

Transfer with unsupervised learning

5. Understanding is balancing input with knowledge

Page 86: UCL x DeepMind lecture series

4 Extracting language-relevant knowledge from the environment

Extra notes/ideasI removed most of the “explicit” branding, only first slide has DeepMind name on it, everything else - a small logo in the corner. Also no “confidential” info

Page 87: UCL x DeepMind lecture series

Building a multi-sensory understanding of the world

Language Vision Actions

Peters, Matthew E., et al. "Deep contextualized word representations." arXiv:1802.05365 (2018).Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv:1810.04805 (2018).Pathak, Deepak, et al. "Context encoders: Feature learning by inpainting." CVPR 2016.Pathak, Deepak, et al. "Learning features by watching objects move." CVPR 2017.Wayne, Greg, et al. "Unsupervised predictive memory in a goal-directed agent." arXiv:1803.10760 (2018).Ha, David, and Jürgen Schmidhuber. "World models." arXiv:1803.10122 (2018).

http://jalammar.github.io/illustrated-bert

Page 88: UCL x DeepMind lecture series

P. EliasPredictive coding

H. BarlowCortex as a model builder.

William James

U. NeisserAnalysis by synthesis

McClelland and RumelhartInteractive activation model

Rao and Ballard

Dayan and HintonHelmholtz machine

J. ElmanFinding structure in time

Knowledge aggregation from prediction

Helmholtz

Page 89: UCL x DeepMind lecture series
Page 90: UCL x DeepMind lecture series

Questions to diagnose knowledge acquisition

Page 91: UCL x DeepMind lecture series

Predictive agents

Page 92: UCL x DeepMind lecture series

Results

Oracle:● Without stop-gradient

Baselines:● Question only ● LSTM

Predictive Models:● SimCore● CPC|A

Page 94: UCL x DeepMind lecture series

To conclude

Page 95: UCL x DeepMind lecture series

1. Words have many related senses

2. Disambiguation depends on context

3. Relevant context can be non-local and non linguistic

4. 'Composition' depends on what words mean

5. Understanding requires background knowledge....

not always from language

Page 96: UCL x DeepMind lecture series

2. Disambiguation depends on context

3. Relevant context can be non-local and non linguistic

4. 'Composition' depends on what words mean

5. Understanding requires background knowledge....

not always from language

1. Words have many related senses

Transformers

Page 97: UCL x DeepMind lecture series

2. Disambiguation depends on context

3. Relevant context can be non-local and non linguistic

4. 'Composition' depends on what words mean

5. Understanding requires background knowledge....

not always from language

1. Words have many related senses

Self-supervised / unsupervised learning

Transformers

Page 98: UCL x DeepMind lecture series

2. Disambiguation depends on context

3. Relevant context can be non-local and non linguistic

4. 'Composition' depends on what words mean

5. Understanding requires background knowledge....

not always from language

Transformers

1. Words have many related senses

Self-supervised / unsupervised learning

Embodied learning

Page 99: UCL x DeepMind lecture series

Understanding intentions

Social learning Event cognition

Fast-mapping

Goal-directed dialogue

Page 100: UCL x DeepMind lecture series

letters words Syntax Meaning Prediction/response

The 'pipeline' view of language processing

Page 101: UCL x DeepMind lecture series

An alternative sehematic model of language processing

Letters / sounds

Meaning Prediction/response

context

general knowledge / model of the world

External stimulus Knowledge we have

words syntax

See. e.g. McClelland et al. 2019

Page 102: UCL x DeepMind lecture series

Selected references

Early treatment of distributed representations in neural language mogelsNatural language processing with modular PDP networks and distributed lexicon. Miikkulainen, Risto, and Michael G. Dyer. Cognitive Science 1991

The transformer architectureAshish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin:Attention Is All You Need. Neurips 2017

BERT Bert: Pre-training of deep bidirectional transformers for language understanding. Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. NAACL 2019.

Embodied language learning at DeepMindEnvironmental Drivers of Generalization and Systematicity in the Situated AgentHill et al. ICLR 2020

Robust Instruction-Following in a Situated Agent via Transfer Learning from TextHill et al. Under review

Probing emergent semantic knowledge in predictive agents via question answeringCarnevale et al. Under review

Page 103: UCL x DeepMind lecture series

Thank you