large pretrained models - nlp.cs.hku.hk
TRANSCRIPT
Lingpeng Kong
Department of Computer Science, The University of Hong Kong Many materials from Stanford CS224n with special thanks!
Large Pretrained ModelsCOMP3361 — Week 9
Pretrained Models in the Past Four Years
Microsoft Research Blog. Oct 6, 2021.
Pretrained Models in the Past Four Years
Microsoft Research Blog. Oct 6, 2021.
Pretrained Models are Expensive
One single training run
552 metric tons of carbon dioxide (120 cars per year)
$12 million
Pretraining and Contextualized Word Representations
Transformer
I feel like eating [SEP][CLS] What you want ? [SEP]like [MASK] [MASK]
NSP MLM
Ep(xi,x̂i)[p(xi | x̂i)]<latexit sha1_base64="pLw1dc0bFf82DgikPLTS3krQVCg=">AAACU3icdVFNS8QwEE3r+rV+rXr0ElwEFVna9aBHUQSPCq4K21LSdHY3mqYlScUl9D+K4ME/4sWDprsr+DkQ8nhvZjLzEuecKe15L447VZuemZ2bry8sLi2vNFbXrlRWSAodmvFM3sREAWcCOpppDje5BJLGHK7ju5NKv74HqVgmLvUwhzAlfcF6jBJtqahxG8TQZ8IQzvpit6wHKdGDODanZWTy7YeI7QVxxhM1TO1lggHR5qEsI7ZTdkcyDlKW4H9ywnoAIvnsHTWaXssbBf4N/AlookmcR42nIMlokYLQlBOlur6X69AQqRnlYGctFOSE3pE+dC0UJAUVmpEnJd6yTIJ7mbRHaDxiv1YYkqpqYptZrax+ahX5l9YtdO8wNEzkhQZBxw/1Co51hiuDccIkUM2HFhAqmZ0V0wGRhGr7DXVrgv9z5d/gqt3y91vti3bz6HhixxzaQJtoG/noAB2hM3SOOoiiR/SK3h3kPDtvruvWxqmuM6lZR9/CXfoAfmy2CQ==</latexit>
Pretraining and Contextualized Word Representations
Jurassic Park lacks the emotional unity of Spielberg’s classics .
Neural Network Encoder (LSTMs, Transformers, etc.)
contextualized word representation
Implicit linguistic knowledge
Pretraining and Fine-tuning
Jurassic Park lacks the emotional unity of Spielberg’s classics .
Neural Network Encoder (LSTMs, Transformers, etc.)
hundreds of millions of parameters
MLP Layer
Hundreds Parameters
$7,079
Key Elements in BERT
Transformer
Masked Language Modeling (MLM), Next Sentence Prediction (NSP)
— pretraining objective
— neural representation learner
Bidirectional Encoder — type of architecture
Neural Representation Learners
Transformer LSTM
ELMo
BERT GPT-2
GPT-3 BART
T5
…
XLNet
Why Transformers?
computing block 1
<latexit sha1_base64="cYdXFjFi2uU4mX1aMyu2o5al0oU=">AAACRXicbVDLSgMxFM3UVx3funQzWAQXpcyIqBux6MZlBVuFtkgmk2lDM0lI7gh16G+4VfBj+gniR7gTt5rpdOHrQsLhnHu5555QcWbA91+d0szs3PxCedFdWl5ZXVvf2GwZmWpCm0RyqW9CbChngjaBAac3SlOchJxeh4PzXL++o9owKa5gqGg3wT3BYkYwWKrT4ZEEU/y36xW/5k/K+wuCKaicjt0T9fziNm43nEonkiRNqADCsTHtwFfQzbAGRjgduZ3UUIXJAPdo20KBE2q62cT0yNu1TOTFUtsnwJuw3ycynBgzTELbmWDom99aTv6ntVOIj7sZEyoFKkixKE65B9LLE/AipikBPrQAE82sV4/0scYEbE4/tgAb3BdX5IizUGM9zGIGVSUNywNkoleNKJF6EqepKesmkVr1c4FgTkaujTX4HeJf0NqvBYe1g0u/Uj9DRZXRNtpBeyhAR6iOLlADNRFBCj2gR/TkjJ035935KFpLznRmC/0o5/MLLli2qw==</latexit>. . . . . .
FFN
<latexit sha1_base64="swEwItEFPj9KIPdwlEYzj9ErqPI=">AAACYXicbVBNbxMxEHWWQtsUaFKOvawaUXGoot2qgh44VOIAEpcgkbYoiaJZ72xixWtb9ixqsPYfcOC39Fr+CGf+CN6kB/oxkuXn92bs55cZKRwlyZ9W9GTj6bPNre32zvMXL3c73b1zpyvLcci11PYyA4dSKBySIImXxiKUmcSLbPGh0S++o3VCq6+0NDgpYaZEIThQoKadwzHhFa3u8dqCmmHtx5mWuVuWYfNX9dRTXU87vaSfrCp+CNJb0Ds7+Dbo/vr8fjDttnrjXPOqREVcgnOjNDE08WBJcIl1e1w5NMAXMMNRgApKdBO/MlLHrwOTx4W2YSmKV+z/Ex5K1xgMnSXQ3N3XGvIxbVRRcTrxQpmKUPH1Q0UlY9Jxk06cC4uc5DIA4FYErzGfgwVOIcM7r5BY/Fj/okFSZBbs0heCjox2oglXqNlRjjyk2pxc3wQ3pbZm3ggcJK/bIdb0fogPwflxP33bP/kS8v3I1rXF9tkBe8NS9o6dsU9swIaMs5/smt2w362/0XbUifbWrVHrduYVu1PR/j9yOb5R</latexit>xt
computing block 2
FFN
<latexit sha1_base64="OM7nXNLBGIOCvjRbmmfWlkpbQ84=">AAACY3icbVDLThsxFHWmLwh9BMoOVRo1QkItimaqqrBE6gKWIBFAJFF0x3MnseKxLftORbDmF/o13bb/0Q/osv+AJ2HB60qWj8+51z4+mZHCUZL8bUXPnr94+Wpltb32+s3bd531jTOnK8uxz7XU9iIDh1Io7JMgiRfGIpSZxPNs9r3Rz3+gdUKrU5obHJUwUaIQHChQ487OkPCKFvd4bUFNsPbDTMvczcuw+at67OlzWtfjTjfpJYuKH4P0FnQPtv/vH+pPl8fj9VZ3mGtelaiIS3BukCaGRh4sCS6xbg8rhwb4DCY4CFBBiW7kF1bqeDsweVxoG5aieMHenfBQusZi6CyBpu6h1pBPaYOKiv2RF8pUhIovHyoqGZOOm3ziXFjkJOcBALcieI35FCxwCinee4XE7Hr5iwZJkVmwc18I2jXaiSZeoSa7OfKQa3NyPRPclNqaaSNwkLxuh1jThyE+Bmdfeum33teTkO8hW9YK22If2Q5L2R47YEfsmPUZZz/ZL/ab/Wn9i9aijWhz2Rq1bmfes3sVfbgBeM6/Ug==</latexit>xt+1
computing block i
Why Transformers?
<latexit sha1_base64="yKta//3FQy1vQ31n+yNYJMH8Inw=">AAACTXicbVBNSyNBEO3J+pl1NerRy2AQ9iBhZhHXi+iyBz0qGBWSEHp6apImPd293TVidpi/spc96J8Rj+IP8SZiz8SDXwUNj/eq6lW/SAtuMQjuvdqXqemZ2bn5+teFb4tLjeWVU6syw6DNlFDmPKIWBJfQRo4CzrUBmkYCzqLR71I/uwBjuZInONbQS+lA8oQzio7qN1a6CJdY7ckNxEX+q+g3mkErqMr/CMIX0Ny70f/v6rt/jvrLXrMbK5alIJEJam0nDDT2cmqQMwFFvZtZ0JSN6AA6Dkqagu3llWnhbzgm9hNl3JPoV+zriZym1o7TyHWmFIf2vVaSn2mdDJOdXs6lzhAkmxglmfBR+WUSfswNMBRjBygz3N3qsyE1lKHL640L8tHfyS9KJHhkqBnnCcdNrSwvg+RysBkDU6aK1ba0uyZVRg9LgVHBirqLNXwf4kdw+qMVbre2joPm/gGZ1BxZI+vkOwnJT7JPDskRaRNGLsk/ckWuvVvvwXv0niatNe9lZpW8qdrsM07ruTM=</latexit>
A
<latexit sha1_base64="Dpg4A5h/OyqHOrXGvQTH1K0B4Xg=">AAACVXicbZDfShwxFMYz4//Vumt7pzehiyBFlxmR1kuhF/ZSwVXp7rJksmd2w2aSkJwp3Q4DPk1v9W1KH8JHEMzMeuG/A4GP7zvJOfklRgqHUfQ/CBcWl5ZXVtca6xsfNputrY+XTueWQ5drqe11whxIoaCLAiVcGwssSyRcJdPvVX71C6wTWl3gzMAgY2MlUsEZemvY2u4j/Mb6nSKROZTFZFjgQVyWdNhqR52oLvpWxE+ifbJ7f3yqv/w8G24F7f5I8zwDhVwy53pxZHBQMIuCSygb/dyBYXzKxtDzUrEM3KCoh5d01zsjmmrrj0Jau89vFCxzbpYlvjNjOHGvs8p8L+vlmB4PCqFMjqD4fFCaS4qaVkToSFjgKGdeMG6F35XyCbOMo+f2YgqK6Z/5LyolRWKZnRWpwH2jnaiACjXeHwHXtsbrOsZvk2lrJlXAmeRlw2ONX0N8Ky4PO/HXztG553tK5rVKdshnskdi8o2ckB/kjHQJJzfkL7kld8G/4CFcDJfnrWHwdOcTeVFh8xE6+bl4</latexit>
ht�1
<latexit sha1_base64="x6yT+powiPzANGgKm18RcFeQf+k=">AAACY3icbVDLThsxFHWmLwh9BMoOVRo1QkIVjWaqqrBE6gKWIBFAJFF0x3MnseKxLftORbDmF/o13bb/0Q/osv+AJ2HB60qWj8+51z4+mZHCUZL8bUXPnr94+Wpltb32+s3bd531jTOnK8uxz7XU9iIDh1Io7JMgiRfGIpSZxPNs9r3Rz3+gdUKrU5obHJUwUaIQHChQ487OkPCKFvd4bUFNsPbDTMvczcuw+at67OlzWtfjTjfpJYuKH4P0FnQPtv/vH+pPl8fj9VZ3mGtelaiIS3BukCaGRh4sCS6xbg8rhwb4DCY4CFBBiW7kF1bqeDsweVxoG5aieMHenfBQusZi6CyBpu6h1pBPaYOKiv2RF8pUhIovHyoqGZOOm3ziXFjkJOcBALcieI35FCxwCinee4XE7Hr5iwZJkVmwc18I2jXaiSZeoSa7OfKQa3NyPRPclNqaaSNwkLxuh1jThyE+Bmdfeum33teTkO8hW9YK22If2Q5L2R47YEfsmPUZZz/ZL/ab/Wn9i9aijWhz2Rq1bmfes3sVfbgBfIC/VA==</latexit>xt�1
<latexit sha1_base64="yKta//3FQy1vQ31n+yNYJMH8Inw=">AAACTXicbVBNSyNBEO3J+pl1NerRy2AQ9iBhZhHXi+iyBz0qGBWSEHp6apImPd293TVidpi/spc96J8Rj+IP8SZiz8SDXwUNj/eq6lW/SAtuMQjuvdqXqemZ2bn5+teFb4tLjeWVU6syw6DNlFDmPKIWBJfQRo4CzrUBmkYCzqLR71I/uwBjuZInONbQS+lA8oQzio7qN1a6CJdY7ckNxEX+q+g3mkErqMr/CMIX0Ny70f/v6rt/jvrLXrMbK5alIJEJam0nDDT2cmqQMwFFvZtZ0JSN6AA6Dkqagu3llWnhbzgm9hNl3JPoV+zriZym1o7TyHWmFIf2vVaSn2mdDJOdXs6lzhAkmxglmfBR+WUSfswNMBRjBygz3N3qsyE1lKHL640L8tHfyS9KJHhkqBnnCcdNrSwvg+RysBkDU6aK1ba0uyZVRg9LgVHBirqLNXwf4kdw+qMVbre2joPm/gGZ1BxZI+vkOwnJT7JPDskRaRNGLsk/ckWuvVvvwXv0niatNe9lZpW8qdrsM07ruTM=</latexit>
A
<latexit sha1_base64="nrkEd8t8hkoI/4c7l9PSpLwa7iI=">AAACVXicbZDfShwxFMYz4//Vdtf2Tm+CiyCtLDMirZdCL+ylgqvi7rJksmd2w2aSkJwp3Q4DPk1v9W1KH8JHEMzMeuG/A4GP7zvJOfklRgqHUfQ/CBcWl5ZXVtca6xsfPjZbm58unM4thy7XUturhDmQQkEXBUq4MhZYlki4TKY/qvzyF1gntDrHmYFBxsZKpIIz9NawtdVH+I31O0UicyiLybDAr3FZ0mGrHXWiuuhbET+J9vHu/dGJ/nJ9OtwM2v2R5nkGCrlkzvXiyOCgYBYFl1A2+rkDw/iUjaHnpWIZuEFRDy/prndGNNXWH4W0dp/fKFjm3CxLfGfGcOJeZ5X5XtbLMT0aFEKZHEHx+aA0lxQ1rYjQkbDAUc68YNwKvyvlE2YZR8/txRQU0z/zX1RKisQyOytSgftGO1EBFWq8PwKubY3XdYzfJtPWTKqAM8nLhscav4b4VlwcdOJvncMzz/eEzGuVbJMdskdi8p0ck5/klHQJJzfkL7kld8G/4CFcDJfnrWHwdOczeVFh8xE3Rbl2</latexit>
ht+1
<latexit sha1_base64="OM7nXNLBGIOCvjRbmmfWlkpbQ84=">AAACY3icbVDLThsxFHWmLwh9BMoOVRo1QkItimaqqrBE6gKWIBFAJFF0x3MnseKxLftORbDmF/o13bb/0Q/osv+AJ2HB60qWj8+51z4+mZHCUZL8bUXPnr94+Wpltb32+s3bd531jTOnK8uxz7XU9iIDh1Io7JMgiRfGIpSZxPNs9r3Rz3+gdUKrU5obHJUwUaIQHChQ487OkPCKFvd4bUFNsPbDTMvczcuw+at67OlzWtfjTjfpJYuKH4P0FnQPtv/vH+pPl8fj9VZ3mGtelaiIS3BukCaGRh4sCS6xbg8rhwb4DCY4CFBBiW7kF1bqeDsweVxoG5aieMHenfBQusZi6CyBpu6h1pBPaYOKiv2RF8pUhIovHyoqGZOOm3ziXFjkJOcBALcieI35FCxwCinee4XE7Hr5iwZJkVmwc18I2jXaiSZeoSa7OfKQa3NyPRPclNqaaSNwkLxuh1jThyE+Bmdfeum33teTkO8hW9YK22If2Q5L2R47YEfsmPUZZz/ZL/ab/Wn9i9aijWhz2Rq1bmfes3sVfbgBeM6/Ug==</latexit>xt+1
<latexit sha1_base64="iCteJoXG2XVarc3QTnSciZoEvs0=">AAACUXicbZDPbhMxEMZnt/wpaYGUSlzKwSJC4lBFuxWiHCtxaI9FatpKSYi8zmxixWtb9mxFWPbYF+kVbr3zEpx4FG54Nz3QlpEsffq+sWf8y6ySnpLkdxSvPXj46PH6k87G5tNnz7tbL069KZ3AgTDKuPOMe1RS44AkKTy3DnmRKTzLFh+b/OwCnZdGn9DS4rjgMy1zKTgFa9J9OSL8Qu07VaZKrKv5hGo26faSftIWuy/SG9E7eHW5+fl6++fxZCvqjaZGlAVqEop7P0wTS+OKO5JCYd0ZlR4tFws+w2GQmhfox1U7uGZvgjNluXHhaGKt+++NihfeL4ssdBac5v5u1pj/y4Yl5R/GldS2JNRiNSgvFSPDGhpsKh0KUssguHAy7MrEnDsuKDC7NYXk4uvqF41SMnPcLatc0q41XjYwpZ7tTlEY16L1fRu2KYyz8yYQXIm6E7CmdyHeF6d7/fR9/92nwPcQVrUOO/Aa3kIK+3AAR3AMAxDwDa7gO/yIfkV/YojjVWsc3dzZhlsVb/wFGc64BQ==</latexit>
ht
<latexit sha1_base64="swEwItEFPj9KIPdwlEYzj9ErqPI=">AAACYXicbVBNbxMxEHWWQtsUaFKOvawaUXGoot2qgh44VOIAEpcgkbYoiaJZ72xixWtb9ixqsPYfcOC39Fr+CGf+CN6kB/oxkuXn92bs55cZKRwlyZ9W9GTj6bPNre32zvMXL3c73b1zpyvLcci11PYyA4dSKBySIImXxiKUmcSLbPGh0S++o3VCq6+0NDgpYaZEIThQoKadwzHhFa3u8dqCmmHtx5mWuVuWYfNX9dRTXU87vaSfrCp+CNJb0Ds7+Dbo/vr8fjDttnrjXPOqREVcgnOjNDE08WBJcIl1e1w5NMAXMMNRgApKdBO/MlLHrwOTx4W2YSmKV+z/Ex5K1xgMnSXQ3N3XGvIxbVRRcTrxQpmKUPH1Q0UlY9Jxk06cC4uc5DIA4FYErzGfgwVOIcM7r5BY/Fj/okFSZBbs0heCjox2oglXqNlRjjyk2pxc3wQ3pbZm3ggcJK/bIdb0fogPwflxP33bP/kS8v3I1rXF9tkBe8NS9o6dsU9swIaMs5/smt2w362/0XbUifbWrVHrduYVu1PR/j9yOb5R</latexit>xt
<latexit sha1_base64="yKta//3FQy1vQ31n+yNYJMH8Inw=">AAACTXicbVBNSyNBEO3J+pl1NerRy2AQ9iBhZhHXi+iyBz0qGBWSEHp6apImPd293TVidpi/spc96J8Rj+IP8SZiz8SDXwUNj/eq6lW/SAtuMQjuvdqXqemZ2bn5+teFb4tLjeWVU6syw6DNlFDmPKIWBJfQRo4CzrUBmkYCzqLR71I/uwBjuZInONbQS+lA8oQzio7qN1a6CJdY7ckNxEX+q+g3mkErqMr/CMIX0Ny70f/v6rt/jvrLXrMbK5alIJEJam0nDDT2cmqQMwFFvZtZ0JSN6AA6Dkqagu3llWnhbzgm9hNl3JPoV+zriZym1o7TyHWmFIf2vVaSn2mdDJOdXs6lzhAkmxglmfBR+WUSfswNMBRjBygz3N3qsyE1lKHL640L8tHfyS9KJHhkqBnnCcdNrSwvg+RysBkDU6aK1ba0uyZVRg9LgVHBirqLNXwf4kdw+qMVbre2joPm/gGZ1BxZI+vkOwnJT7JPDskRaRNGLsk/ckWuvVvvwXv0niatNe9lZpW8qdrsM07ruTM=</latexit>
A
Why Transformers?
self-attention
Direct pair-wise interaction between any tokens in the sequence
Pretraining Objective
I feel like eating <MASK> today. What <MASK> you want to eat?
training instance (MLM):
x:y: noodles, do
training instance (NSP):
I feel like eating <MASK> today. ||| What <MASK> you want to eat? x:y: True
Pretraining Objective
What makes a good pretraining objective?
1. No human labeling should be involved.
2. Leads to good representations. (How and why?)
Mutual InformationI(A,B) = H(A)�H(A | B)
= H(B)�H(B | A).<latexit sha1_base64="QXgRizGYm9up2pzRMvocLgRogPM=">AAACOXicbZDLSgMxFIYz3h1vVZdugkVpRctMXehGaOum7ipYFTqlZDKnbTCTGZKMUIa+lhvfwp3gxoUibn0B02kFbwdCfr5zSc7vx5wp7TiP1tT0zOzc/MKivbS8srqWW9+4VFEiKTRpxCN57RMFnAloaqY5XMcSSOhzuPJvTkf5q1uQikXiQg9iaIekJ1iXUaIN6uQang89JlLCWU/sDe2zQnW/VsS7J7heqBbxwejyQhZgAz3PznhtzGsZrxZLtgci+JrQyeWdkpMF/ivcicijSTQ6uQcviGgSgtCUE6VarhPrdkqkZpTD0PYSBTGhN6QHLSMFCUG102zzId4xJMDdSJojNM7o946UhEoNQt9UhkT31e/cCP6XayW6e9xOmYgTDYKOH+omHOsIj2zEAZNANR8YQahk5q+Y9okkVBuzbWOC+3vlv+KyXHIPS+Xzcr5Sm9ixgLbQNiogFx2hCqqjBmoiiu7QE3pBr9a99Wy9We/j0ilr0rOJfoT18QkL9aXh</latexit>
Goal of Training:
I(f(A,B)) � Ep(a,b)
2
4f✓(a, b)� Eq(B̃)
2
4logX
b̃2B̃
exp f✓(a, b̃)
3
5
3
5+ log | B̃ |,<latexit sha1_base64="r6JuHrhWnyrMaV3s4JC3i/dP8q8=">AAADAXicdVLLjtMwFHXCayivDixYsLGoQCmUKikLWA5FSLAbJDozUl1FjnOTWuM4mdhBVJbZ8CtsWIAQW/6CHX+D04YR8+BKlo/OOfde+9pJJbjSYfjb8y9cvHT5ytbV3rXrN27e6m/f3lNlUzOYsVKU9UFCFQguYaa5FnBQ1UCLRMB+cviy1fffQ614Kd/pVQWLguaSZ5xR7ah427tLEsi5NFTwXD6yvTdBFrwYTYdD/JDkcIRJQfUyScwrG5sqoKNkaImATM+z2JCkFKlaFW4zRC9BU2vXFvzkRN5RQDQXKZg1yagwU2uHFm8KEVHmmKimcAU3tsRJXOJjd0e7JCfAhwr/v/dxhSGpeb7Ui27Dj/GmT8HTc+q29KhHQKZ/BxH3B+E4XAc+C6IODFAXu3H/F0lL1hQgNRNUqXkUVnphaK05E2B7pFFQUXZIc5g7KGkBamHWL2jxA8ekOCtrt6TGa/bfDEML1V7WOdvDq9NaS56nzRudPV8YLqtGg2SbRlkjsC5x+x1wymtgWqwcoKzm7qyYLWlNmXafpueGEJ2+8lmwNxlHT8eTt5PBzrQbxxa6h+6jAEXoGdpBr9EumiHmffQ+e1+9b/4n/4v/3f+xsfpel3MHnQj/5x8flvZ8</latexit>
Ep(a,b)
2
4f✓(a, b)� logX
b̃2B
exp f✓(a, b̃)
3
5 .
<latexit sha1_base64="HWWqaLShN8G0k23b11CaDOr4GQI=">AAAClHicdVFNb9QwEHXCV1m+tiBx4WKxQmoRrJLlUA4cllaVOKEisW2ldbSynUnWquNE9gSxsvKL+Dfc+Dc46YKghZEsP715M543Fo1WDpPkRxTfuHnr9p2du6N79x88fDTefXzq6tZKWMha1/ZccAdaGVigQg3njQVeCQ1n4uKoz599AetUbT7jpoGs4qVRhZIcA7Uaf2MCSmU816o0L7sRqziuhfDH3co3e/yV2O+YhgKXxcozUevcbapweYZrQN51g4S+pkzXJWWurYIMlc7Bi44yZejQUHLtD7tAwNeG/r/T78p9ZlW5xmw6YmDyX8OtxpNkmgxBr4N0CyZkGyer8XeW17KtwKDU3LllmjSYeW5RSQ3BbOug4fKCl7AM0PAKXOaHpXb0RWByWtQ2HIN0YP+s8LxyvYWg7C26q7me/Fdu2WLxNvPKNC2CkZcPFa2mWNP+h2iuLEjUmwC4tCrMSuWaWy4x/OMoLCG9avk6OJ1N0zfT2afZZH64XccOeUaekz2SkgMyJx/ICVkQGe1GB9E8eh8/jd/FR/HxpTSOtjVPyF8Rf/wJdcrLXA==</latexit> Cross Entropy (Softmax)
f✓(a, b) = g (b)>g!(a)
<latexit sha1_base64="TwjtQpEsIzbklvts/4XlEZugdnk=">AAACXXicbVHBattAEF0pTZuoaeomhx5yWWoKcSlGcg/tpRCSS48JxEnAcs1oNZKXrHbF7qhghH6yt/bSX8nacaBxMrDs4715zOzbrFbSURz/CcKtF9svX+3sRq/33uy/7b07uHKmsQLHwihjbzJwqKTGMUlSeFNbhCpTeJ3dni31619onTT6khY1TisotSykAPLUrEdphqXULShZ6k9dVMzaNDMqd4vKX21KcyToumP4nA34d15uyLWTXswGP32nqbtN2VRYrtyDKEWdP0yZ9frxMF4VfwqSNeizdZ3Per/T3IimQk1CgXOTJK5p2oIlKRR2Udo4rEHcQokTDzVU6KbtKp2Of/RMzgtj/dHEV+z/jhYqt1zYd1ZAc7epLcnntElDxbdpK3XdEGpxP6hoFCfDl1HzXFoUpBYegLDS78rFHCwI8h8S+RCSzSc/BVejYfJlOLoY9U9O13HssCP2gR2zhH1lJ+wHO2djJtjfgAW7QRT8C7fDvXD/vjUM1p5D9qjC93ew4Lce</latexit>
✓ = {!, }<latexit sha1_base64="jkshjWuQ3xBEJGiq6Fxr9wvRr2g=">AAACRHicbVBNSyNBEO1Rd9fNfph1j14aw4LIEmbiQS8LohePCkaFTAg1PZVJY0/30F0jhGF+3F78Ad78BV48KItXsROzYNSCph/v1auufkmhpKMwvA4WFpc+fPy0/Lnx5eu37yvNH6snzpRWYFcYZexZAg6V1NglSQrPCouQJwpPk/P9iX56gdZJo49pXGA/h0zLoRRAnho0e3GCmdQVKJnpzboRJ0albpz7q4pphAQ1/8Pjao43OWZQ/57jCifr2PtRp/+HDZqtsB1Oi78F0Qy02KwOB82rODWizFGTUOBcLwoL6ldgSQqFfnjpsABxDhn2PNSQo+tX0xBq/sszKR8a648mPmVfOirI3WRZ35kDjdxrbUK+p/VKGu70K6mLklCL54eGpeJk+CRRnkqLgtTYAxBW+l25GIEFQT73hg8hev3lt+Ck04622p2jTmt3bxbHMltj62yDRWyb7bIDdsi6TLC/7IbdsfvgMrgN/gUPz60Lwczzk81V8PgEaq60Fg==</latexit>
InfoNCE (Logeswaran & Lee, 2018; van den Oord et al., 2019)
Mutual Information
Ep(a,b)
2
4f✓(a, b)� logX
b̃2B
exp f✓(a, b̃)
3
5 .
<latexit sha1_base64="HWWqaLShN8G0k23b11CaDOr4GQI=">AAAClHicdVFNb9QwEHXCV1m+tiBx4WKxQmoRrJLlUA4cllaVOKEisW2ldbSynUnWquNE9gSxsvKL+Dfc+Dc46YKghZEsP715M543Fo1WDpPkRxTfuHnr9p2du6N79x88fDTefXzq6tZKWMha1/ZccAdaGVigQg3njQVeCQ1n4uKoz599AetUbT7jpoGs4qVRhZIcA7Uaf2MCSmU816o0L7sRqziuhfDH3co3e/yV2O+YhgKXxcozUevcbapweYZrQN51g4S+pkzXJWWurYIMlc7Bi44yZejQUHLtD7tAwNeG/r/T78p9ZlW5xmw6YmDyX8OtxpNkmgxBr4N0CyZkGyer8XeW17KtwKDU3LllmjSYeW5RSQ3BbOug4fKCl7AM0PAKXOaHpXb0RWByWtQ2HIN0YP+s8LxyvYWg7C26q7me/Fdu2WLxNvPKNC2CkZcPFa2mWNP+h2iuLEjUmwC4tCrMSuWaWy4x/OMoLCG9avk6OJ1N0zfT2afZZH64XccOeUaekz2SkgMyJx/ICVkQGe1GB9E8eh8/jd/FR/HxpTSOtjVPyF8Rf/wJdcrLXA==</latexit>
Cross Entropy (Softmax)
“Hope” “Fear”
...<latexit sha1_base64="C+LMdhPjPVUFsZ7cnLmlQNDRtHs=">AAAB7XicbVA9SwNBEJ2LXzF+RS1tFoNgFe5ioWXQxjKC+YDkCHt7m2TN3u2xOxcIR/6DjYUitv4fO/+Nm+QKTXww8Hhvhpl5QSKFQdf9dgobm1vbO8Xd0t7+weFR+fikZVSqGW8yJZXuBNRwKWLeRIGSdxLNaRRI3g7Gd3O/PeHaCBU/4jThfkSHsRgIRtFKrd4kVGj65YpbdRcg68TLSQVyNPrlr16oWBrxGJmkxnQ9N0E/oxoFk3xW6qWGJ5SN6ZB3LY1pxI2fLa6dkQurhGSgtK0YyUL9PZHRyJhpFNjOiOLIrHpz8T+vm+Lgxs9EnKTIY7ZcNEglQUXmr5NQaM5QTi2hTAt7K2EjqilDG1DJhuCtvrxOWrWqd1WtPdQq9ds8jiKcwTlcggfXUId7aEATGDzBM7zCm6OcF+fd+Vi2Fpx85hT+wPn8Acwdj0Q=</latexit>
a b
Masked Language Modeling
g!(a)<latexit sha1_base64="sKPkJ29I6lhRZPL6GvfWNK72OUQ=">AAACH3icbVBNS8NAEN34WeNX1aOXYBHUQ0kqqMeiF48VrBaaUiabaVzc7IbdjVBC/okX/4oXD4qIN/+N21pBqwPLPt57w8y8KONMG9//cGZm5+YXFitL7vLK6tp6dWPzSstcUWxTyaXqRKCRM4FtwwzHTqYQ0ojjdXR7NtKv71BpJsWlGWbYSyERbMAoGEv1q0dhhAkTBXCWiIPSTfpFGEke62FqvyKUKSZQlnuw74Yo4m9fv1rz6/64vL8gmIAamVSrX30PY0nzFIWhHLTuBn5megUowyjH0g1zjRnQW0iwa6GAFHWvGN9XeruWib2BVPYJ443Znx0FpHq0sHWmYG70tDYi/9O6uRmc9AomstygoF+DBjn3jPRGYXkxU0gNH1oAVDG7q0dvQAE1NlLXhhBMn/wXXDXqwWG9cdGoNU8ncVTINtkheyQgx6RJzkmLtAkl9+SRPJMX58F5cl6dty/rjDPp2SK/yvn4BACMo4s=</latexit>
g (b)<latexit sha1_base64="ttH8YcZgdbzsc3jobdaI2YLi8ow=">AAACHXicbVDLSsNAFJ3UV42vqks3wSJUFyWpgi6LblxWsLXQhDCZ3KZDJ5MwMxFKyI+48VfcuFDEhRvxb5w+BG29MMzhnHO5954gZVQq2/4ySkvLK6tr5XVzY3Nre6eyu9eRSSYItEnCEtENsARGObQVVQy6qQAcBwzuguHVWL+7ByFpwm/VKAUvxhGnfUqw0pRfOXMDiCjPMaMRPynMyM/dIGGhHMX6y91U0qKoBcemCzz8cfmVql23J2UtAmcGqmhWLb/y4YYJyWLgijAsZc+xU+XlWChKGBSmm0lIMRniCHoachyD9PLJdYV1pJnQ6idCP66sCfu7I8exHK+rnTFWAzmvjcn/tF6m+hdeTnmaKeBkOqifMUsl1jgqK6QCiGIjDTARVO9qkQEWmCgdqKlDcOZPXgSdRt05rTduGtXm5SyOMjpAh6iGHHSOmugatVAbEfSAntALejUejWfjzXifWkvGrGcf/Snj8xtwB6K7</latexit>
View a — corrupted context of word i
View b — word i
...<latexit sha1_base64="C+LMdhPjPVUFsZ7cnLmlQNDRtHs=">AAAB7XicbVA9SwNBEJ2LXzF+RS1tFoNgFe5ioWXQxjKC+YDkCHt7m2TN3u2xOxcIR/6DjYUitv4fO/+Nm+QKTXww8Hhvhpl5QSKFQdf9dgobm1vbO8Xd0t7+weFR+fikZVSqGW8yJZXuBNRwKWLeRIGSdxLNaRRI3g7Gd3O/PeHaCBU/4jThfkSHsRgIRtFKrd4kVGj65YpbdRcg68TLSQVyNPrlr16oWBrxGJmkxnQ9N0E/oxoFk3xW6qWGJ5SN6ZB3LY1pxI2fLa6dkQurhGSgtK0YyUL9PZHRyJhpFNjOiOLIrHpz8T+vm+Lgxs9EnKTIY7ZcNEglQUXmr5NQaM5QTi2hTAt7K2EjqilDG1DJhuCtvrxOWrWqd1WtPdQq9ds8jiKcwTlcggfXUId7aEATGDzBM7zCm6OcF+fd+Vi2Fpx85hT+wPn8Acwdj0Q=</latexit>
Transformer
[MASK] [MASK]
...<latexit sha1_base64="C+LMdhPjPVUFsZ7cnLmlQNDRtHs=">AAAB7XicbVA9SwNBEJ2LXzF+RS1tFoNgFe5ioWXQxjKC+YDkCHt7m2TN3u2xOxcIR/6DjYUitv4fO/+Nm+QKTXww8Hhvhpl5QSKFQdf9dgobm1vbO8Xd0t7+weFR+fikZVSqGW8yJZXuBNRwKWLeRIGSdxLNaRRI3g7Gd3O/PeHaCBU/4jThfkSHsRgIRtFKrd4kVGj65YpbdRcg68TLSQVyNPrlr16oWBrxGJmkxnQ9N0E/oxoFk3xW6qWGJ5SN6ZB3LY1pxI2fLa6dkQurhGSgtK0YyUL9PZHRyJhpFNjOiOLIrHpz8T+vm+Lgxs9EnKTIY7ZcNEglQUXmr5NQaM5QTi2hTAt7K2EjqilDG1DJhuCtvrxOWrWqd1WtPdQq9ds8jiKcwTlcggfXUId7aEATGDzBM7zCm6OcF+fd+Vi2Fpx85hT+wPn8Acwdj0Q=</latexit>
What you want ?do
Next Sentence Prediction
Transformer
I feel like eating [SEP][CLS] What you want ? [SEP]ramen
NSP
do
Binary Classification — “local” NCE (Gutmann and Hyvarinen, 2012)
“global” NCE
Transformer
I feel like eating [SEP][CLS] ramen
What you want ? [SEP]do[CLS]
Transformer
[SEP][CLS]
Transformer
[SEP][CLS]
Transformer
[SEP][CLS]
Transformer
...<latexit sha1_base64="C+LMdhPjPVUFsZ7cnLmlQNDRtHs=">AAAB7XicbVA9SwNBEJ2LXzF+RS1tFoNgFe5ioWXQxjKC+YDkCHt7m2TN3u2xOxcIR/6DjYUitv4fO/+Nm+QKTXww8Hhvhpl5QSKFQdf9dgobm1vbO8Xd0t7+weFR+fikZVSqGW8yJZXuBNRwKWLeRIGSdxLNaRRI3g7Gd3O/PeHaCBU/4jThfkSHsRgIRtFKrd4kVGj65YpbdRcg68TLSQVyNPrlr16oWBrxGJmkxnQ9N0E/oxoFk3xW6qWGJ5SN6ZB3LY1pxI2fLa6dkQurhGSgtK0YyUL9PZHRyJhpFNjOiOLIrHpz8T+vm+Lgxs9EnKTIY7ZcNEglQUXmr5NQaM5QTi2hTAt7K2EjqilDG1DJhuCtvrxOWrWqd1WtPdQq9ds8jiKcwTlcggfXUId7aEATGDzBM7zCm6OcF+fd+Vi2Fpx85hT+wPn8Acwdj0Q=</latexit>
| B̃ |<latexit sha1_base64="lK4Vk0reLBOJvh3QwDhGbj4CvI4=">AAACIHicbVBNS8NAEN3Urxq/qh69BIsgHkpSD/VY6sVjBfsBTSibzaRdutmE3Y1QQn6KF/+KFw+K6E1/jZu2grY+GHi8N8PMPD9hVCrb/jRKa+sbm1vlbXNnd2//oHJ41JVxKgh0SMxi0fexBEY5dBRVDPqJABz5DHr+5Lrwe/cgJI35nZom4EV4xGlICVZaGlYarg8jyjPM6Ihf5KYb0cCNsBoTzDJXURZA1srzQjZd4MFP47BStWv2DNYqcRakihZoDysfbhCTNAKuCMNSDhw7UV6GhaKEgV6cSkgwmeARDDTlOALpZbMHc+tMK4EVxkIXV9ZM/T2R4UjKaeTrzuJ2uewV4n/eIFXhlZdRnqQKOJkvClNmqdgq0rICKoAoNtUEE0H1rRYZY4GJ0pmaOgRn+eVV0q3XnMta/bZebbYWcZTRCTpF58hBDdREN6iNOoigB/SEXtCr8Wg8G2/G+7y1ZCxmjtEfGF/fMLSkNw==</latexit>
Connections with Computer Vision
Deep InfoMax (DIM; Hjelm et al., 2019)
Type of Architecture
Encoders
Encoder-Decoders Decoders
Parameters are what we get from the pretraining process.
Pros for the “encoders” architecture:
Gets bidirectional context.
Easy to use in language understanding tasks!
Other members in the family:
BERT for Understanding
BERT
<CLS> This must be the greatest movie ever !
Positive / Negative
BERT for Generation
<MASK> <MASK> <MASK> <MASK> <MASK> <MASK> <MASK> <MASK> <MASK>
What
BERT
BERT for Generation
<MASK> <MASK> <MASK> <MASK> <MASK> <MASK> <MASK> <MASK>
What
What
do
Input has been changed. The representations will need to be recomputed!
Not a very good idea…
BERT
Pretrained Models
— pretraining objective
— neural representation learner
— type of architecture
Ep(xi,x̂i)[p(xi | x̂i)]<latexit sha1_base64="pLw1dc0bFf82DgikPLTS3krQVCg=">AAACU3icdVFNS8QwEE3r+rV+rXr0ElwEFVna9aBHUQSPCq4K21LSdHY3mqYlScUl9D+K4ME/4sWDprsr+DkQ8nhvZjLzEuecKe15L447VZuemZ2bry8sLi2vNFbXrlRWSAodmvFM3sREAWcCOpppDje5BJLGHK7ju5NKv74HqVgmLvUwhzAlfcF6jBJtqahxG8TQZ8IQzvpit6wHKdGDODanZWTy7YeI7QVxxhM1TO1lggHR5qEsI7ZTdkcyDlKW4H9ywnoAIvnsHTWaXssbBf4N/AlookmcR42nIMlokYLQlBOlur6X69AQqRnlYGctFOSE3pE+dC0UJAUVmpEnJd6yTIJ7mbRHaDxiv1YYkqpqYptZrax+ahX5l9YtdO8wNEzkhQZBxw/1Co51hiuDccIkUM2HFhAqmZ0V0wGRhGr7DXVrgv9z5d/gqt3y91vti3bz6HhixxzaQJtoG/noAB2hM3SOOoiiR/SK3h3kPDtvruvWxqmuM6lZR9/CXfoAfmy2CQ==</latexit>
GPT (Generative Pretrained Transformer)
Radford et al., 2018
Decoders
Transformer as Decoder
Happy mid autumn festival
Need to prevent the attention the future words.
Happy
Happy
<s>
mid autumn festival
mid
autumn
festival
causal attention
GPT (Generative Pretrained Transformer)
Radford et al., 2018
...<latexit sha1_base64="C+LMdhPjPVUFsZ7cnLmlQNDRtHs=">AAAB7XicbVA9SwNBEJ2LXzF+RS1tFoNgFe5ioWXQxjKC+YDkCHt7m2TN3u2xOxcIR/6DjYUitv4fO/+Nm+QKTXww8Hhvhpl5QSKFQdf9dgobm1vbO8Xd0t7+weFR+fikZVSqGW8yJZXuBNRwKWLeRIGSdxLNaRRI3g7Gd3O/PeHaCBU/4jThfkSHsRgIRtFKrd4kVGj65YpbdRcg68TLSQVyNPrlr16oWBrxGJmkxnQ9N0E/oxoFk3xW6qWGJ5SN6ZB3LY1pxI2fLa6dkQurhGSgtK0YyUL9PZHRyJhpFNjOiOLIrHpz8T+vm+Lgxs9EnKTIY7ZcNEglQUXmr5NQaM5QTi2hTAt7K2EjqilDG1DJhuCtvrxOWrWqd1WtPdQq9ds8jiKcwTlcggfXUId7aEATGDzBM7zCm6OcF+fd+Vi2Fpx85hT+wPn8Acwdj0Q=</latexit>
Previous Context
Next Word
lookup table
Transformer
GIF credit: Lena Voita
GPT for Understanding
GPT
This must be the greatest movie ever !
Positive / Negative
GPT for Generation
GPT
This must be the greatest movie
ever
GPT for Generation
GPT
This must be the greatest movie ever
!
Just “grow” the transformer!
T5 (Text-to-Text Transfer Transformer)
Raffel et al., 2020
Encoder-Decoders
T5 (Text-to-Text Transfer Transformer)
Raffel et al., 2020Thank you <X> me to your party <Y> week.
<X> for inviting <Y> last <Z>
T5 (Text-to-Text Transfer Transformer)
Raffel et al., 2020
T5 (Text-to-Text Transfer Transformer)
Raffel et al., 2020
ELMo (Embeddings from Language Models)
Encoders
Bidirectional Language Model
Peters et al., 2018
ELMo (Embeddings from Language Models)
BART (Denoising Sequence-to-Sequence Pre-training )
Lewis et al., 2018
Encoder-Decoders
BART (Denoising Sequence-to-Sequence Pre-training )
Lewis et al., 2018
InfoWord
Kong et al., 2019
Transformer
TransformerTransformer Transformer Transformer Transformer. . .<latexit sha1_base64="f6gDVSy0KXdLUDs2/Vp+blJOSrY=">AAACAnicbVDLSgNBEOz1GeMr6tHLYBA8hd0o6DHgxWME84BkCbOzk2TM7M4y0yuEJTc/wKt+gjfx6o/4Bf6Gk2QPJrGgoajqprsrSKQw6Lrfztr6xubWdmGnuLu3f3BYOjpuGpVqxhtMSaXbATVcipg3UKDk7URzGgWSt4LR7dRvPXFthIofcJxwP6KDWPQFo2ilZleGCk2vVHYr7gxklXg5KUOOeq/00w0VSyMeI5PUmI7nJuhnVKNgkk+K3dTwhLIRHfCOpTGNuPGz2bUTcm6VkPSVthUjmal/JzIaGTOOAtsZURyaZW8q/ud1Uuzf+JmIkxR5zOaL+qkkqMj0dRIKzRnKsSWUaWFvJWxINWVoA1rYEig1QhqYiU3GW85hlTSrFe+yUr2/KtfcPKMCnMIZXIAH11CDO6hDAxg8wgu8wpvz7Lw7H87nvHXNyWdOYAHO1y/owZhq</latexit>
Global View
Local View
. . .<latexit sha1_base64="f6gDVSy0KXdLUDs2/Vp+blJOSrY=">AAACAnicbVDLSgNBEOz1GeMr6tHLYBA8hd0o6DHgxWME84BkCbOzk2TM7M4y0yuEJTc/wKt+gjfx6o/4Bf6Gk2QPJrGgoajqprsrSKQw6Lrfztr6xubWdmGnuLu3f3BYOjpuGpVqxhtMSaXbATVcipg3UKDk7URzGgWSt4LR7dRvPXFthIofcJxwP6KDWPQFo2ilZleGCk2vVHYr7gxklXg5KUOOeq/00w0VSyMeI5PUmI7nJuhnVKNgkk+K3dTwhLIRHfCOpTGNuPGz2bUTcm6VkPSVthUjmal/JzIaGTOOAtsZURyaZW8q/ud1Uuzf+JmIkxR5zOaL+qkkqMj0dRIKzRnKsSWUaWFvJWxINWVoA1rYEig1QhqYiU3GW85hlTSrFe+yUr2/KtfcPKMCnMIZXIAH11CDO6hDAxg8wgu8wpvz7Lw7H87nvHXNyWdOYAHO1y/owZhq</latexit>
“Real” “Fake”