large pretrained models - nlp.cs.hku.hk

Post on 29-Jan-2022

5 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Lingpeng Kong

Department of Computer Science, The University of Hong Kong Many materials from Stanford CS224n with special thanks!

Large Pretrained ModelsCOMP3361 — Week 9

Pretrained Models in the Past Four Years

Microsoft Research Blog. Oct 6, 2021.

Pretrained Models in the Past Four Years

Microsoft Research Blog. Oct 6, 2021.

Pretrained Models are Expensive

One single training run

552 metric tons of carbon dioxide (120 cars per year)

$12 million

Pretraining and Contextualized Word Representations

Transformer

I feel like eating [SEP][CLS] What you want ? [SEP]like [MASK] [MASK]

NSP MLM

Ep(xi,x̂i)[p(xi | x̂i)]<latexit sha1_base64="pLw1dc0bFf82DgikPLTS3krQVCg=">AAACU3icdVFNS8QwEE3r+rV+rXr0ElwEFVna9aBHUQSPCq4K21LSdHY3mqYlScUl9D+K4ME/4sWDprsr+DkQ8nhvZjLzEuecKe15L447VZuemZ2bry8sLi2vNFbXrlRWSAodmvFM3sREAWcCOpppDje5BJLGHK7ju5NKv74HqVgmLvUwhzAlfcF6jBJtqahxG8TQZ8IQzvpit6wHKdGDODanZWTy7YeI7QVxxhM1TO1lggHR5qEsI7ZTdkcyDlKW4H9ywnoAIvnsHTWaXssbBf4N/AlookmcR42nIMlokYLQlBOlur6X69AQqRnlYGctFOSE3pE+dC0UJAUVmpEnJd6yTIJ7mbRHaDxiv1YYkqpqYptZrax+ahX5l9YtdO8wNEzkhQZBxw/1Co51hiuDccIkUM2HFhAqmZ0V0wGRhGr7DXVrgv9z5d/gqt3y91vti3bz6HhixxzaQJtoG/noAB2hM3SOOoiiR/SK3h3kPDtvruvWxqmuM6lZR9/CXfoAfmy2CQ==</latexit>

Pretraining and Contextualized Word Representations

Jurassic Park lacks the emotional unity of Spielberg’s classics .

Neural Network Encoder (LSTMs, Transformers, etc.)

contextualized word representation

Implicit linguistic knowledge

Pretraining and Fine-tuning

Jurassic Park lacks the emotional unity of Spielberg’s classics .

Neural Network Encoder (LSTMs, Transformers, etc.)

hundreds of millions of parameters

MLP Layer

Hundreds Parameters

$7,079

Key Elements in BERT

Transformer

Masked Language Modeling (MLM), Next Sentence Prediction (NSP)

— pretraining objective

— neural representation learner

Bidirectional Encoder — type of architecture

Neural Representation Learners

Transformer LSTM

ELMo

BERT GPT-2

GPT-3 BART

T5

XLNet

Why Transformers?

computing block 1

<latexit sha1_base64="cYdXFjFi2uU4mX1aMyu2o5al0oU=">AAACRXicbVDLSgMxFM3UVx3funQzWAQXpcyIqBux6MZlBVuFtkgmk2lDM0lI7gh16G+4VfBj+gniR7gTt5rpdOHrQsLhnHu5555QcWbA91+d0szs3PxCedFdWl5ZXVvf2GwZmWpCm0RyqW9CbChngjaBAac3SlOchJxeh4PzXL++o9owKa5gqGg3wT3BYkYwWKrT4ZEEU/y36xW/5k/K+wuCKaicjt0T9fziNm43nEonkiRNqADCsTHtwFfQzbAGRjgduZ3UUIXJAPdo20KBE2q62cT0yNu1TOTFUtsnwJuw3ycynBgzTELbmWDom99aTv6ntVOIj7sZEyoFKkixKE65B9LLE/AipikBPrQAE82sV4/0scYEbE4/tgAb3BdX5IizUGM9zGIGVSUNywNkoleNKJF6EqepKesmkVr1c4FgTkaujTX4HeJf0NqvBYe1g0u/Uj9DRZXRNtpBeyhAR6iOLlADNRFBCj2gR/TkjJ035935KFpLznRmC/0o5/MLLli2qw==</latexit>. . . . . .

FFN

<latexit sha1_base64="swEwItEFPj9KIPdwlEYzj9ErqPI=">AAACYXicbVBNbxMxEHWWQtsUaFKOvawaUXGoot2qgh44VOIAEpcgkbYoiaJZ72xixWtb9ixqsPYfcOC39Fr+CGf+CN6kB/oxkuXn92bs55cZKRwlyZ9W9GTj6bPNre32zvMXL3c73b1zpyvLcci11PYyA4dSKBySIImXxiKUmcSLbPGh0S++o3VCq6+0NDgpYaZEIThQoKadwzHhFa3u8dqCmmHtx5mWuVuWYfNX9dRTXU87vaSfrCp+CNJb0Ds7+Dbo/vr8fjDttnrjXPOqREVcgnOjNDE08WBJcIl1e1w5NMAXMMNRgApKdBO/MlLHrwOTx4W2YSmKV+z/Ex5K1xgMnSXQ3N3XGvIxbVRRcTrxQpmKUPH1Q0UlY9Jxk06cC4uc5DIA4FYErzGfgwVOIcM7r5BY/Fj/okFSZBbs0heCjox2oglXqNlRjjyk2pxc3wQ3pbZm3ggcJK/bIdb0fogPwflxP33bP/kS8v3I1rXF9tkBe8NS9o6dsU9swIaMs5/smt2w362/0XbUifbWrVHrduYVu1PR/j9yOb5R</latexit>xt

computing block 2

FFN

<latexit sha1_base64="OM7nXNLBGIOCvjRbmmfWlkpbQ84=">AAACY3icbVDLThsxFHWmLwh9BMoOVRo1QkItimaqqrBE6gKWIBFAJFF0x3MnseKxLftORbDmF/o13bb/0Q/osv+AJ2HB60qWj8+51z4+mZHCUZL8bUXPnr94+Wpltb32+s3bd531jTOnK8uxz7XU9iIDh1Io7JMgiRfGIpSZxPNs9r3Rz3+gdUKrU5obHJUwUaIQHChQ487OkPCKFvd4bUFNsPbDTMvczcuw+at67OlzWtfjTjfpJYuKH4P0FnQPtv/vH+pPl8fj9VZ3mGtelaiIS3BukCaGRh4sCS6xbg8rhwb4DCY4CFBBiW7kF1bqeDsweVxoG5aieMHenfBQusZi6CyBpu6h1pBPaYOKiv2RF8pUhIovHyoqGZOOm3ziXFjkJOcBALcieI35FCxwCinee4XE7Hr5iwZJkVmwc18I2jXaiSZeoSa7OfKQa3NyPRPclNqaaSNwkLxuh1jThyE+Bmdfeum33teTkO8hW9YK22If2Q5L2R47YEfsmPUZZz/ZL/ab/Wn9i9aijWhz2Rq1bmfes3sVfbgBeM6/Ug==</latexit>xt+1

computing block i

Why Transformers?

<latexit sha1_base64="yKta//3FQy1vQ31n+yNYJMH8Inw=">AAACTXicbVBNSyNBEO3J+pl1NerRy2AQ9iBhZhHXi+iyBz0qGBWSEHp6apImPd293TVidpi/spc96J8Rj+IP8SZiz8SDXwUNj/eq6lW/SAtuMQjuvdqXqemZ2bn5+teFb4tLjeWVU6syw6DNlFDmPKIWBJfQRo4CzrUBmkYCzqLR71I/uwBjuZInONbQS+lA8oQzio7qN1a6CJdY7ckNxEX+q+g3mkErqMr/CMIX0Ny70f/v6rt/jvrLXrMbK5alIJEJam0nDDT2cmqQMwFFvZtZ0JSN6AA6Dkqagu3llWnhbzgm9hNl3JPoV+zriZym1o7TyHWmFIf2vVaSn2mdDJOdXs6lzhAkmxglmfBR+WUSfswNMBRjBygz3N3qsyE1lKHL640L8tHfyS9KJHhkqBnnCcdNrSwvg+RysBkDU6aK1ba0uyZVRg9LgVHBirqLNXwf4kdw+qMVbre2joPm/gGZ1BxZI+vkOwnJT7JPDskRaRNGLsk/ckWuvVvvwXv0niatNe9lZpW8qdrsM07ruTM=</latexit>

A

<latexit sha1_base64="Dpg4A5h/OyqHOrXGvQTH1K0B4Xg=">AAACVXicbZDfShwxFMYz4//Vumt7pzehiyBFlxmR1kuhF/ZSwVXp7rJksmd2w2aSkJwp3Q4DPk1v9W1KH8JHEMzMeuG/A4GP7zvJOfklRgqHUfQ/CBcWl5ZXVtca6xsfNputrY+XTueWQ5drqe11whxIoaCLAiVcGwssSyRcJdPvVX71C6wTWl3gzMAgY2MlUsEZemvY2u4j/Mb6nSKROZTFZFjgQVyWdNhqR52oLvpWxE+ifbJ7f3yqv/w8G24F7f5I8zwDhVwy53pxZHBQMIuCSygb/dyBYXzKxtDzUrEM3KCoh5d01zsjmmrrj0Jau89vFCxzbpYlvjNjOHGvs8p8L+vlmB4PCqFMjqD4fFCaS4qaVkToSFjgKGdeMG6F35XyCbOMo+f2YgqK6Z/5LyolRWKZnRWpwH2jnaiACjXeHwHXtsbrOsZvk2lrJlXAmeRlw2ONX0N8Ky4PO/HXztG553tK5rVKdshnskdi8o2ckB/kjHQJJzfkL7kld8G/4CFcDJfnrWHwdOcTeVFh8xE6+bl4</latexit>

ht�1

<latexit sha1_base64="x6yT+powiPzANGgKm18RcFeQf+k=">AAACY3icbVDLThsxFHWmLwh9BMoOVRo1QkIVjWaqqrBE6gKWIBFAJFF0x3MnseKxLftORbDmF/o13bb/0Q/osv+AJ2HB60qWj8+51z4+mZHCUZL8bUXPnr94+Wpltb32+s3bd531jTOnK8uxz7XU9iIDh1Io7JMgiRfGIpSZxPNs9r3Rz3+gdUKrU5obHJUwUaIQHChQ487OkPCKFvd4bUFNsPbDTMvczcuw+at67OlzWtfjTjfpJYuKH4P0FnQPtv/vH+pPl8fj9VZ3mGtelaiIS3BukCaGRh4sCS6xbg8rhwb4DCY4CFBBiW7kF1bqeDsweVxoG5aieMHenfBQusZi6CyBpu6h1pBPaYOKiv2RF8pUhIovHyoqGZOOm3ziXFjkJOcBALcieI35FCxwCinee4XE7Hr5iwZJkVmwc18I2jXaiSZeoSa7OfKQa3NyPRPclNqaaSNwkLxuh1jThyE+Bmdfeum33teTkO8hW9YK22If2Q5L2R47YEfsmPUZZz/ZL/ab/Wn9i9aijWhz2Rq1bmfes3sVfbgBfIC/VA==</latexit>xt�1

<latexit sha1_base64="yKta//3FQy1vQ31n+yNYJMH8Inw=">AAACTXicbVBNSyNBEO3J+pl1NerRy2AQ9iBhZhHXi+iyBz0qGBWSEHp6apImPd293TVidpi/spc96J8Rj+IP8SZiz8SDXwUNj/eq6lW/SAtuMQjuvdqXqemZ2bn5+teFb4tLjeWVU6syw6DNlFDmPKIWBJfQRo4CzrUBmkYCzqLR71I/uwBjuZInONbQS+lA8oQzio7qN1a6CJdY7ckNxEX+q+g3mkErqMr/CMIX0Ny70f/v6rt/jvrLXrMbK5alIJEJam0nDDT2cmqQMwFFvZtZ0JSN6AA6Dkqagu3llWnhbzgm9hNl3JPoV+zriZym1o7TyHWmFIf2vVaSn2mdDJOdXs6lzhAkmxglmfBR+WUSfswNMBRjBygz3N3qsyE1lKHL640L8tHfyS9KJHhkqBnnCcdNrSwvg+RysBkDU6aK1ba0uyZVRg9LgVHBirqLNXwf4kdw+qMVbre2joPm/gGZ1BxZI+vkOwnJT7JPDskRaRNGLsk/ckWuvVvvwXv0niatNe9lZpW8qdrsM07ruTM=</latexit>

A

<latexit sha1_base64="nrkEd8t8hkoI/4c7l9PSpLwa7iI=">AAACVXicbZDfShwxFMYz4//Vdtf2Tm+CiyCtLDMirZdCL+ylgqvi7rJksmd2w2aSkJwp3Q4DPk1v9W1KH8JHEMzMeuG/A4GP7zvJOfklRgqHUfQ/CBcWl5ZXVtca6xsfPjZbm58unM4thy7XUturhDmQQkEXBUq4MhZYlki4TKY/qvzyF1gntDrHmYFBxsZKpIIz9NawtdVH+I31O0UicyiLybDAr3FZ0mGrHXWiuuhbET+J9vHu/dGJ/nJ9OtwM2v2R5nkGCrlkzvXiyOCgYBYFl1A2+rkDw/iUjaHnpWIZuEFRDy/prndGNNXWH4W0dp/fKFjm3CxLfGfGcOJeZ5X5XtbLMT0aFEKZHEHx+aA0lxQ1rYjQkbDAUc68YNwKvyvlE2YZR8/txRQU0z/zX1RKisQyOytSgftGO1EBFWq8PwKubY3XdYzfJtPWTKqAM8nLhscav4b4VlwcdOJvncMzz/eEzGuVbJMdskdi8p0ck5/klHQJJzfkL7kld8G/4CFcDJfnrWHwdOczeVFh8xE3Rbl2</latexit>

ht+1

<latexit sha1_base64="OM7nXNLBGIOCvjRbmmfWlkpbQ84=">AAACY3icbVDLThsxFHWmLwh9BMoOVRo1QkItimaqqrBE6gKWIBFAJFF0x3MnseKxLftORbDmF/o13bb/0Q/osv+AJ2HB60qWj8+51z4+mZHCUZL8bUXPnr94+Wpltb32+s3bd531jTOnK8uxz7XU9iIDh1Io7JMgiRfGIpSZxPNs9r3Rz3+gdUKrU5obHJUwUaIQHChQ487OkPCKFvd4bUFNsPbDTMvczcuw+at67OlzWtfjTjfpJYuKH4P0FnQPtv/vH+pPl8fj9VZ3mGtelaiIS3BukCaGRh4sCS6xbg8rhwb4DCY4CFBBiW7kF1bqeDsweVxoG5aieMHenfBQusZi6CyBpu6h1pBPaYOKiv2RF8pUhIovHyoqGZOOm3ziXFjkJOcBALcieI35FCxwCinee4XE7Hr5iwZJkVmwc18I2jXaiSZeoSa7OfKQa3NyPRPclNqaaSNwkLxuh1jThyE+Bmdfeum33teTkO8hW9YK22If2Q5L2R47YEfsmPUZZz/ZL/ab/Wn9i9aijWhz2Rq1bmfes3sVfbgBeM6/Ug==</latexit>xt+1

<latexit sha1_base64="iCteJoXG2XVarc3QTnSciZoEvs0=">AAACUXicbZDPbhMxEMZnt/wpaYGUSlzKwSJC4lBFuxWiHCtxaI9FatpKSYi8zmxixWtb9mxFWPbYF+kVbr3zEpx4FG54Nz3QlpEsffq+sWf8y6ySnpLkdxSvPXj46PH6k87G5tNnz7tbL069KZ3AgTDKuPOMe1RS44AkKTy3DnmRKTzLFh+b/OwCnZdGn9DS4rjgMy1zKTgFa9J9OSL8Qu07VaZKrKv5hGo26faSftIWuy/SG9E7eHW5+fl6++fxZCvqjaZGlAVqEop7P0wTS+OKO5JCYd0ZlR4tFws+w2GQmhfox1U7uGZvgjNluXHhaGKt+++NihfeL4ssdBac5v5u1pj/y4Yl5R/GldS2JNRiNSgvFSPDGhpsKh0KUssguHAy7MrEnDsuKDC7NYXk4uvqF41SMnPcLatc0q41XjYwpZ7tTlEY16L1fRu2KYyz8yYQXIm6E7CmdyHeF6d7/fR9/92nwPcQVrUOO/Aa3kIK+3AAR3AMAxDwDa7gO/yIfkV/YojjVWsc3dzZhlsVb/wFGc64BQ==</latexit>

ht

<latexit sha1_base64="swEwItEFPj9KIPdwlEYzj9ErqPI=">AAACYXicbVBNbxMxEHWWQtsUaFKOvawaUXGoot2qgh44VOIAEpcgkbYoiaJZ72xixWtb9ixqsPYfcOC39Fr+CGf+CN6kB/oxkuXn92bs55cZKRwlyZ9W9GTj6bPNre32zvMXL3c73b1zpyvLcci11PYyA4dSKBySIImXxiKUmcSLbPGh0S++o3VCq6+0NDgpYaZEIThQoKadwzHhFa3u8dqCmmHtx5mWuVuWYfNX9dRTXU87vaSfrCp+CNJb0Ds7+Dbo/vr8fjDttnrjXPOqREVcgnOjNDE08WBJcIl1e1w5NMAXMMNRgApKdBO/MlLHrwOTx4W2YSmKV+z/Ex5K1xgMnSXQ3N3XGvIxbVRRcTrxQpmKUPH1Q0UlY9Jxk06cC4uc5DIA4FYErzGfgwVOIcM7r5BY/Fj/okFSZBbs0heCjox2oglXqNlRjjyk2pxc3wQ3pbZm3ggcJK/bIdb0fogPwflxP33bP/kS8v3I1rXF9tkBe8NS9o6dsU9swIaMs5/smt2w362/0XbUifbWrVHrduYVu1PR/j9yOb5R</latexit>xt

<latexit sha1_base64="yKta//3FQy1vQ31n+yNYJMH8Inw=">AAACTXicbVBNSyNBEO3J+pl1NerRy2AQ9iBhZhHXi+iyBz0qGBWSEHp6apImPd293TVidpi/spc96J8Rj+IP8SZiz8SDXwUNj/eq6lW/SAtuMQjuvdqXqemZ2bn5+teFb4tLjeWVU6syw6DNlFDmPKIWBJfQRo4CzrUBmkYCzqLR71I/uwBjuZInONbQS+lA8oQzio7qN1a6CJdY7ckNxEX+q+g3mkErqMr/CMIX0Ny70f/v6rt/jvrLXrMbK5alIJEJam0nDDT2cmqQMwFFvZtZ0JSN6AA6Dkqagu3llWnhbzgm9hNl3JPoV+zriZym1o7TyHWmFIf2vVaSn2mdDJOdXs6lzhAkmxglmfBR+WUSfswNMBRjBygz3N3qsyE1lKHL640L8tHfyS9KJHhkqBnnCcdNrSwvg+RysBkDU6aK1ba0uyZVRg9LgVHBirqLNXwf4kdw+qMVbre2joPm/gGZ1BxZI+vkOwnJT7JPDskRaRNGLsk/ckWuvVvvwXv0niatNe9lZpW8qdrsM07ruTM=</latexit>

A

Why Transformers?

self-attention

Direct pair-wise interaction between any tokens in the sequence

Pretraining Objective

I feel like eating <MASK> today. What <MASK> you want to eat?

training instance (MLM):

x:y: noodles, do

training instance (NSP):

I feel like eating <MASK> today. ||| What <MASK> you want to eat? x:y: True

Pretraining Objective

What makes a good pretraining objective?

1. No human labeling should be involved.

2. Leads to good representations. (How and why?)

Mutual InformationI(A,B) = H(A)�H(A | B)

= H(B)�H(B | A).<latexit sha1_base64="QXgRizGYm9up2pzRMvocLgRogPM=">AAACOXicbZDLSgMxFIYz3h1vVZdugkVpRctMXehGaOum7ipYFTqlZDKnbTCTGZKMUIa+lhvfwp3gxoUibn0B02kFbwdCfr5zSc7vx5wp7TiP1tT0zOzc/MKivbS8srqWW9+4VFEiKTRpxCN57RMFnAloaqY5XMcSSOhzuPJvTkf5q1uQikXiQg9iaIekJ1iXUaIN6uQang89JlLCWU/sDe2zQnW/VsS7J7heqBbxwejyQhZgAz3PznhtzGsZrxZLtgci+JrQyeWdkpMF/ivcicijSTQ6uQcviGgSgtCUE6VarhPrdkqkZpTD0PYSBTGhN6QHLSMFCUG102zzId4xJMDdSJojNM7o946UhEoNQt9UhkT31e/cCP6XayW6e9xOmYgTDYKOH+omHOsIj2zEAZNANR8YQahk5q+Y9okkVBuzbWOC+3vlv+KyXHIPS+Xzcr5Sm9ixgLbQNiogFx2hCqqjBmoiiu7QE3pBr9a99Wy9We/j0ilr0rOJfoT18QkL9aXh</latexit>

Goal of Training:

I(f(A,B)) � Ep(a,b)

2

4f✓(a, b)� Eq(B̃)

2

4logX

b̃2B̃

exp f✓(a, b̃)

3

5

3

5+ log | B̃ |,<latexit sha1_base64="r6JuHrhWnyrMaV3s4JC3i/dP8q8=">AAADAXicdVLLjtMwFHXCayivDixYsLGoQCmUKikLWA5FSLAbJDozUl1FjnOTWuM4mdhBVJbZ8CtsWIAQW/6CHX+D04YR8+BKlo/OOfde+9pJJbjSYfjb8y9cvHT5ytbV3rXrN27e6m/f3lNlUzOYsVKU9UFCFQguYaa5FnBQ1UCLRMB+cviy1fffQ614Kd/pVQWLguaSZ5xR7ah427tLEsi5NFTwXD6yvTdBFrwYTYdD/JDkcIRJQfUyScwrG5sqoKNkaImATM+z2JCkFKlaFW4zRC9BU2vXFvzkRN5RQDQXKZg1yagwU2uHFm8KEVHmmKimcAU3tsRJXOJjd0e7JCfAhwr/v/dxhSGpeb7Ui27Dj/GmT8HTc+q29KhHQKZ/BxH3B+E4XAc+C6IODFAXu3H/F0lL1hQgNRNUqXkUVnphaK05E2B7pFFQUXZIc5g7KGkBamHWL2jxA8ekOCtrt6TGa/bfDEML1V7WOdvDq9NaS56nzRudPV8YLqtGg2SbRlkjsC5x+x1wymtgWqwcoKzm7qyYLWlNmXafpueGEJ2+8lmwNxlHT8eTt5PBzrQbxxa6h+6jAEXoGdpBr9EumiHmffQ+e1+9b/4n/4v/3f+xsfpel3MHnQj/5x8flvZ8</latexit>

Ep(a,b)

2

4f✓(a, b)� logX

b̃2B

exp f✓(a, b̃)

3

5 .

<latexit sha1_base64="HWWqaLShN8G0k23b11CaDOr4GQI=">AAAClHicdVFNb9QwEHXCV1m+tiBx4WKxQmoRrJLlUA4cllaVOKEisW2ldbSynUnWquNE9gSxsvKL+Dfc+Dc46YKghZEsP715M543Fo1WDpPkRxTfuHnr9p2du6N79x88fDTefXzq6tZKWMha1/ZccAdaGVigQg3njQVeCQ1n4uKoz599AetUbT7jpoGs4qVRhZIcA7Uaf2MCSmU816o0L7sRqziuhfDH3co3e/yV2O+YhgKXxcozUevcbapweYZrQN51g4S+pkzXJWWurYIMlc7Bi44yZejQUHLtD7tAwNeG/r/T78p9ZlW5xmw6YmDyX8OtxpNkmgxBr4N0CyZkGyer8XeW17KtwKDU3LllmjSYeW5RSQ3BbOug4fKCl7AM0PAKXOaHpXb0RWByWtQ2HIN0YP+s8LxyvYWg7C26q7me/Fdu2WLxNvPKNC2CkZcPFa2mWNP+h2iuLEjUmwC4tCrMSuWaWy4x/OMoLCG9avk6OJ1N0zfT2afZZH64XccOeUaekz2SkgMyJx/ICVkQGe1GB9E8eh8/jd/FR/HxpTSOtjVPyF8Rf/wJdcrLXA==</latexit> Cross Entropy (Softmax)

f✓(a, b) = g (b)>g!(a)

<latexit sha1_base64="TwjtQpEsIzbklvts/4XlEZugdnk=">AAACXXicbVHBattAEF0pTZuoaeomhx5yWWoKcSlGcg/tpRCSS48JxEnAcs1oNZKXrHbF7qhghH6yt/bSX8nacaBxMrDs4715zOzbrFbSURz/CcKtF9svX+3sRq/33uy/7b07uHKmsQLHwihjbzJwqKTGMUlSeFNbhCpTeJ3dni31619onTT6khY1TisotSykAPLUrEdphqXULShZ6k9dVMzaNDMqd4vKX21KcyToumP4nA34d15uyLWTXswGP32nqbtN2VRYrtyDKEWdP0yZ9frxMF4VfwqSNeizdZ3Per/T3IimQk1CgXOTJK5p2oIlKRR2Udo4rEHcQokTDzVU6KbtKp2Of/RMzgtj/dHEV+z/jhYqt1zYd1ZAc7epLcnntElDxbdpK3XdEGpxP6hoFCfDl1HzXFoUpBYegLDS78rFHCwI8h8S+RCSzSc/BVejYfJlOLoY9U9O13HssCP2gR2zhH1lJ+wHO2djJtjfgAW7QRT8C7fDvXD/vjUM1p5D9qjC93ew4Lce</latexit>

✓ = {!, }<latexit sha1_base64="jkshjWuQ3xBEJGiq6Fxr9wvRr2g=">AAACRHicbVBNSyNBEO1Rd9fNfph1j14aw4LIEmbiQS8LohePCkaFTAg1PZVJY0/30F0jhGF+3F78Ad78BV48KItXsROzYNSCph/v1auufkmhpKMwvA4WFpc+fPy0/Lnx5eu37yvNH6snzpRWYFcYZexZAg6V1NglSQrPCouQJwpPk/P9iX56gdZJo49pXGA/h0zLoRRAnho0e3GCmdQVKJnpzboRJ0albpz7q4pphAQ1/8Pjao43OWZQ/57jCifr2PtRp/+HDZqtsB1Oi78F0Qy02KwOB82rODWizFGTUOBcLwoL6ldgSQqFfnjpsABxDhn2PNSQo+tX0xBq/sszKR8a648mPmVfOirI3WRZ35kDjdxrbUK+p/VKGu70K6mLklCL54eGpeJk+CRRnkqLgtTYAxBW+l25GIEFQT73hg8hev3lt+Ck04622p2jTmt3bxbHMltj62yDRWyb7bIDdsi6TLC/7IbdsfvgMrgN/gUPz60Lwczzk81V8PgEaq60Fg==</latexit>

InfoNCE (Logeswaran & Lee, 2018; van den Oord et al., 2019)

Mutual Information

Ep(a,b)

2

4f✓(a, b)� logX

b̃2B

exp f✓(a, b̃)

3

5 .

<latexit sha1_base64="HWWqaLShN8G0k23b11CaDOr4GQI=">AAAClHicdVFNb9QwEHXCV1m+tiBx4WKxQmoRrJLlUA4cllaVOKEisW2ldbSynUnWquNE9gSxsvKL+Dfc+Dc46YKghZEsP715M543Fo1WDpPkRxTfuHnr9p2du6N79x88fDTefXzq6tZKWMha1/ZccAdaGVigQg3njQVeCQ1n4uKoz599AetUbT7jpoGs4qVRhZIcA7Uaf2MCSmU816o0L7sRqziuhfDH3co3e/yV2O+YhgKXxcozUevcbapweYZrQN51g4S+pkzXJWWurYIMlc7Bi44yZejQUHLtD7tAwNeG/r/T78p9ZlW5xmw6YmDyX8OtxpNkmgxBr4N0CyZkGyer8XeW17KtwKDU3LllmjSYeW5RSQ3BbOug4fKCl7AM0PAKXOaHpXb0RWByWtQ2HIN0YP+s8LxyvYWg7C26q7me/Fdu2WLxNvPKNC2CkZcPFa2mWNP+h2iuLEjUmwC4tCrMSuWaWy4x/OMoLCG9avk6OJ1N0zfT2afZZH64XccOeUaekz2SkgMyJx/ICVkQGe1GB9E8eh8/jd/FR/HxpTSOtjVPyF8Rf/wJdcrLXA==</latexit>

Cross Entropy (Softmax)

“Hope” “Fear”

...<latexit sha1_base64="C+LMdhPjPVUFsZ7cnLmlQNDRtHs=">AAAB7XicbVA9SwNBEJ2LXzF+RS1tFoNgFe5ioWXQxjKC+YDkCHt7m2TN3u2xOxcIR/6DjYUitv4fO/+Nm+QKTXww8Hhvhpl5QSKFQdf9dgobm1vbO8Xd0t7+weFR+fikZVSqGW8yJZXuBNRwKWLeRIGSdxLNaRRI3g7Gd3O/PeHaCBU/4jThfkSHsRgIRtFKrd4kVGj65YpbdRcg68TLSQVyNPrlr16oWBrxGJmkxnQ9N0E/oxoFk3xW6qWGJ5SN6ZB3LY1pxI2fLa6dkQurhGSgtK0YyUL9PZHRyJhpFNjOiOLIrHpz8T+vm+Lgxs9EnKTIY7ZcNEglQUXmr5NQaM5QTi2hTAt7K2EjqilDG1DJhuCtvrxOWrWqd1WtPdQq9ds8jiKcwTlcggfXUId7aEATGDzBM7zCm6OcF+fd+Vi2Fpx85hT+wPn8Acwdj0Q=</latexit>

a b

Masked Language Modeling

g!(a)<latexit sha1_base64="sKPkJ29I6lhRZPL6GvfWNK72OUQ=">AAACH3icbVBNS8NAEN34WeNX1aOXYBHUQ0kqqMeiF48VrBaaUiabaVzc7IbdjVBC/okX/4oXD4qIN/+N21pBqwPLPt57w8y8KONMG9//cGZm5+YXFitL7vLK6tp6dWPzSstcUWxTyaXqRKCRM4FtwwzHTqYQ0ojjdXR7NtKv71BpJsWlGWbYSyERbMAoGEv1q0dhhAkTBXCWiIPSTfpFGEke62FqvyKUKSZQlnuw74Yo4m9fv1rz6/64vL8gmIAamVSrX30PY0nzFIWhHLTuBn5megUowyjH0g1zjRnQW0iwa6GAFHWvGN9XeruWib2BVPYJ443Znx0FpHq0sHWmYG70tDYi/9O6uRmc9AomstygoF+DBjn3jPRGYXkxU0gNH1oAVDG7q0dvQAE1NlLXhhBMn/wXXDXqwWG9cdGoNU8ncVTINtkheyQgx6RJzkmLtAkl9+SRPJMX58F5cl6dty/rjDPp2SK/yvn4BACMo4s=</latexit>

g (b)<latexit sha1_base64="ttH8YcZgdbzsc3jobdaI2YLi8ow=">AAACHXicbVDLSsNAFJ3UV42vqks3wSJUFyWpgi6LblxWsLXQhDCZ3KZDJ5MwMxFKyI+48VfcuFDEhRvxb5w+BG29MMzhnHO5954gZVQq2/4ySkvLK6tr5XVzY3Nre6eyu9eRSSYItEnCEtENsARGObQVVQy6qQAcBwzuguHVWL+7ByFpwm/VKAUvxhGnfUqw0pRfOXMDiCjPMaMRPynMyM/dIGGhHMX6y91U0qKoBcemCzz8cfmVql23J2UtAmcGqmhWLb/y4YYJyWLgijAsZc+xU+XlWChKGBSmm0lIMRniCHoachyD9PLJdYV1pJnQ6idCP66sCfu7I8exHK+rnTFWAzmvjcn/tF6m+hdeTnmaKeBkOqifMUsl1jgqK6QCiGIjDTARVO9qkQEWmCgdqKlDcOZPXgSdRt05rTduGtXm5SyOMjpAh6iGHHSOmugatVAbEfSAntALejUejWfjzXifWkvGrGcf/Snj8xtwB6K7</latexit>

View a — corrupted context of word i

View b — word i

...<latexit sha1_base64="C+LMdhPjPVUFsZ7cnLmlQNDRtHs=">AAAB7XicbVA9SwNBEJ2LXzF+RS1tFoNgFe5ioWXQxjKC+YDkCHt7m2TN3u2xOxcIR/6DjYUitv4fO/+Nm+QKTXww8Hhvhpl5QSKFQdf9dgobm1vbO8Xd0t7+weFR+fikZVSqGW8yJZXuBNRwKWLeRIGSdxLNaRRI3g7Gd3O/PeHaCBU/4jThfkSHsRgIRtFKrd4kVGj65YpbdRcg68TLSQVyNPrlr16oWBrxGJmkxnQ9N0E/oxoFk3xW6qWGJ5SN6ZB3LY1pxI2fLa6dkQurhGSgtK0YyUL9PZHRyJhpFNjOiOLIrHpz8T+vm+Lgxs9EnKTIY7ZcNEglQUXmr5NQaM5QTi2hTAt7K2EjqilDG1DJhuCtvrxOWrWqd1WtPdQq9ds8jiKcwTlcggfXUId7aEATGDzBM7zCm6OcF+fd+Vi2Fpx85hT+wPn8Acwdj0Q=</latexit>

Transformer

[MASK] [MASK]

...<latexit sha1_base64="C+LMdhPjPVUFsZ7cnLmlQNDRtHs=">AAAB7XicbVA9SwNBEJ2LXzF+RS1tFoNgFe5ioWXQxjKC+YDkCHt7m2TN3u2xOxcIR/6DjYUitv4fO/+Nm+QKTXww8Hhvhpl5QSKFQdf9dgobm1vbO8Xd0t7+weFR+fikZVSqGW8yJZXuBNRwKWLeRIGSdxLNaRRI3g7Gd3O/PeHaCBU/4jThfkSHsRgIRtFKrd4kVGj65YpbdRcg68TLSQVyNPrlr16oWBrxGJmkxnQ9N0E/oxoFk3xW6qWGJ5SN6ZB3LY1pxI2fLa6dkQurhGSgtK0YyUL9PZHRyJhpFNjOiOLIrHpz8T+vm+Lgxs9EnKTIY7ZcNEglQUXmr5NQaM5QTi2hTAt7K2EjqilDG1DJhuCtvrxOWrWqd1WtPdQq9ds8jiKcwTlcggfXUId7aEATGDzBM7zCm6OcF+fd+Vi2Fpx85hT+wPn8Acwdj0Q=</latexit>

What you want ?do

Next Sentence Prediction

Transformer

I feel like eating [SEP][CLS] What you want ? [SEP]ramen

NSP

do

Binary Classification — “local” NCE (Gutmann and Hyvarinen, 2012)

“global” NCE

Transformer

I feel like eating [SEP][CLS] ramen

What you want ? [SEP]do[CLS]

Transformer

[SEP][CLS]

Transformer

[SEP][CLS]

Transformer

[SEP][CLS]

Transformer

...<latexit sha1_base64="C+LMdhPjPVUFsZ7cnLmlQNDRtHs=">AAAB7XicbVA9SwNBEJ2LXzF+RS1tFoNgFe5ioWXQxjKC+YDkCHt7m2TN3u2xOxcIR/6DjYUitv4fO/+Nm+QKTXww8Hhvhpl5QSKFQdf9dgobm1vbO8Xd0t7+weFR+fikZVSqGW8yJZXuBNRwKWLeRIGSdxLNaRRI3g7Gd3O/PeHaCBU/4jThfkSHsRgIRtFKrd4kVGj65YpbdRcg68TLSQVyNPrlr16oWBrxGJmkxnQ9N0E/oxoFk3xW6qWGJ5SN6ZB3LY1pxI2fLa6dkQurhGSgtK0YyUL9PZHRyJhpFNjOiOLIrHpz8T+vm+Lgxs9EnKTIY7ZcNEglQUXmr5NQaM5QTi2hTAt7K2EjqilDG1DJhuCtvrxOWrWqd1WtPdQq9ds8jiKcwTlcggfXUId7aEATGDzBM7zCm6OcF+fd+Vi2Fpx85hT+wPn8Acwdj0Q=</latexit>

| B̃ |<latexit sha1_base64="lK4Vk0reLBOJvh3QwDhGbj4CvI4=">AAACIHicbVBNS8NAEN3Urxq/qh69BIsgHkpSD/VY6sVjBfsBTSibzaRdutmE3Y1QQn6KF/+KFw+K6E1/jZu2grY+GHi8N8PMPD9hVCrb/jRKa+sbm1vlbXNnd2//oHJ41JVxKgh0SMxi0fexBEY5dBRVDPqJABz5DHr+5Lrwe/cgJI35nZom4EV4xGlICVZaGlYarg8jyjPM6Ihf5KYb0cCNsBoTzDJXURZA1srzQjZd4MFP47BStWv2DNYqcRakihZoDysfbhCTNAKuCMNSDhw7UV6GhaKEgV6cSkgwmeARDDTlOALpZbMHc+tMK4EVxkIXV9ZM/T2R4UjKaeTrzuJ2uewV4n/eIFXhlZdRnqQKOJkvClNmqdgq0rICKoAoNtUEE0H1rRYZY4GJ0pmaOgRn+eVV0q3XnMta/bZebbYWcZTRCTpF58hBDdREN6iNOoigB/SEXtCr8Wg8G2/G+7y1ZCxmjtEfGF/fMLSkNw==</latexit>

Connections with Computer Vision

Deep InfoMax (DIM; Hjelm et al., 2019)

Type of Architecture

Encoders

Encoder-Decoders Decoders

Parameters are what we get from the pretraining process.

Pros for the “encoders” architecture:

Gets bidirectional context.

Easy to use in language understanding tasks!

Other members in the family:

BERT for Understanding

BERT

<CLS> This must be the greatest movie ever !

Positive / Negative

BERT for Generation

<MASK> <MASK> <MASK> <MASK> <MASK> <MASK> <MASK> <MASK> <MASK>

What

BERT

BERT for Generation

<MASK> <MASK> <MASK> <MASK> <MASK> <MASK> <MASK> <MASK>

What

What

do

Input has been changed. The representations will need to be recomputed!

Not a very good idea…

BERT

Pretrained Models

— pretraining objective

— neural representation learner

— type of architecture

Ep(xi,x̂i)[p(xi | x̂i)]<latexit sha1_base64="pLw1dc0bFf82DgikPLTS3krQVCg=">AAACU3icdVFNS8QwEE3r+rV+rXr0ElwEFVna9aBHUQSPCq4K21LSdHY3mqYlScUl9D+K4ME/4sWDprsr+DkQ8nhvZjLzEuecKe15L447VZuemZ2bry8sLi2vNFbXrlRWSAodmvFM3sREAWcCOpppDje5BJLGHK7ju5NKv74HqVgmLvUwhzAlfcF6jBJtqahxG8TQZ8IQzvpit6wHKdGDODanZWTy7YeI7QVxxhM1TO1lggHR5qEsI7ZTdkcyDlKW4H9ywnoAIvnsHTWaXssbBf4N/AlookmcR42nIMlokYLQlBOlur6X69AQqRnlYGctFOSE3pE+dC0UJAUVmpEnJd6yTIJ7mbRHaDxiv1YYkqpqYptZrax+ahX5l9YtdO8wNEzkhQZBxw/1Co51hiuDccIkUM2HFhAqmZ0V0wGRhGr7DXVrgv9z5d/gqt3y91vti3bz6HhixxzaQJtoG/noAB2hM3SOOoiiR/SK3h3kPDtvruvWxqmuM6lZR9/CXfoAfmy2CQ==</latexit>

GPT (Generative Pretrained Transformer)

Radford et al., 2018

Decoders

Transformer as Decoder

Happy mid autumn festival

Need to prevent the attention the future words.

Happy

Happy

<s>

mid autumn festival

mid

autumn

festival

causal attention

GPT (Generative Pretrained Transformer)

Radford et al., 2018

...<latexit sha1_base64="C+LMdhPjPVUFsZ7cnLmlQNDRtHs=">AAAB7XicbVA9SwNBEJ2LXzF+RS1tFoNgFe5ioWXQxjKC+YDkCHt7m2TN3u2xOxcIR/6DjYUitv4fO/+Nm+QKTXww8Hhvhpl5QSKFQdf9dgobm1vbO8Xd0t7+weFR+fikZVSqGW8yJZXuBNRwKWLeRIGSdxLNaRRI3g7Gd3O/PeHaCBU/4jThfkSHsRgIRtFKrd4kVGj65YpbdRcg68TLSQVyNPrlr16oWBrxGJmkxnQ9N0E/oxoFk3xW6qWGJ5SN6ZB3LY1pxI2fLa6dkQurhGSgtK0YyUL9PZHRyJhpFNjOiOLIrHpz8T+vm+Lgxs9EnKTIY7ZcNEglQUXmr5NQaM5QTi2hTAt7K2EjqilDG1DJhuCtvrxOWrWqd1WtPdQq9ds8jiKcwTlcggfXUId7aEATGDzBM7zCm6OcF+fd+Vi2Fpx85hT+wPn8Acwdj0Q=</latexit>

Previous Context

Next Word

lookup table

Transformer

GIF credit: Lena Voita

GPT for Understanding

GPT

This must be the greatest movie ever !

Positive / Negative

GPT for Generation

GPT

This must be the greatest movie

ever

GPT for Generation

GPT

This must be the greatest movie ever

!

Just “grow” the transformer!

T5 (Text-to-Text Transfer Transformer)

Raffel et al., 2020

Encoder-Decoders

T5 (Text-to-Text Transfer Transformer)

Raffel et al., 2020Thank you <X> me to your party <Y> week.

<X> for inviting <Y> last <Z>

T5 (Text-to-Text Transfer Transformer)

Raffel et al., 2020

T5 (Text-to-Text Transfer Transformer)

Raffel et al., 2020

ELMo (Embeddings from Language Models)

Encoders

Bidirectional Language Model

Peters et al., 2018

ELMo (Embeddings from Language Models)

BART (Denoising Sequence-to-Sequence Pre-training )

Lewis et al., 2018

Encoder-Decoders

BART (Denoising Sequence-to-Sequence Pre-training )

Lewis et al., 2018

InfoWord

Kong et al., 2019

Transformer

TransformerTransformer Transformer Transformer Transformer. . .<latexit sha1_base64="f6gDVSy0KXdLUDs2/Vp+blJOSrY=">AAACAnicbVDLSgNBEOz1GeMr6tHLYBA8hd0o6DHgxWME84BkCbOzk2TM7M4y0yuEJTc/wKt+gjfx6o/4Bf6Gk2QPJrGgoajqprsrSKQw6Lrfztr6xubWdmGnuLu3f3BYOjpuGpVqxhtMSaXbATVcipg3UKDk7URzGgWSt4LR7dRvPXFthIofcJxwP6KDWPQFo2ilZleGCk2vVHYr7gxklXg5KUOOeq/00w0VSyMeI5PUmI7nJuhnVKNgkk+K3dTwhLIRHfCOpTGNuPGz2bUTcm6VkPSVthUjmal/JzIaGTOOAtsZURyaZW8q/ud1Uuzf+JmIkxR5zOaL+qkkqMj0dRIKzRnKsSWUaWFvJWxINWVoA1rYEig1QhqYiU3GW85hlTSrFe+yUr2/KtfcPKMCnMIZXIAH11CDO6hDAxg8wgu8wpvz7Lw7H87nvHXNyWdOYAHO1y/owZhq</latexit>

Global View

Local View

. . .<latexit sha1_base64="f6gDVSy0KXdLUDs2/Vp+blJOSrY=">AAACAnicbVDLSgNBEOz1GeMr6tHLYBA8hd0o6DHgxWME84BkCbOzk2TM7M4y0yuEJTc/wKt+gjfx6o/4Bf6Gk2QPJrGgoajqprsrSKQw6Lrfztr6xubWdmGnuLu3f3BYOjpuGpVqxhtMSaXbATVcipg3UKDk7URzGgWSt4LR7dRvPXFthIofcJxwP6KDWPQFo2ilZleGCk2vVHYr7gxklXg5KUOOeq/00w0VSyMeI5PUmI7nJuhnVKNgkk+K3dTwhLIRHfCOpTGNuPGz2bUTcm6VkPSVthUjmal/JzIaGTOOAtsZURyaZW8q/ud1Uuzf+JmIkxR5zOaL+qkkqMj0dRIKzRnKsSWUaWFvJWxINWVoA1rYEig1QhqYiU3GW85hlTSrFe+yUr2/KtfcPKMCnMIZXIAH11CDO6hDAxg8wgu8wpvz7Lw7H87nvHXNyWdOYAHO1y/owZhq</latexit>

“Real” “Fake”

top related