adaptation of deep bidirectional ... - dialog-21.ru · transfer learning and pretraining in nlp...
TRANSCRIPT
![Page 1: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT](https://reader035.vdocuments.us/reader035/viewer/2022071016/5fcec6f0566e8c39ac414c9e/html5/thumbnails/1.jpg)
Adaptation of Deep Bidirectional Multilingual
Transformers for Russian Language
Kuratov Yuri, Arkhipov MikhailNeural Networks and Deep Learning Lab,
Moscow Institute of Physics and Technology
![Page 2: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT](https://reader035.vdocuments.us/reader035/viewer/2022071016/5fcec6f0566e8c39ac414c9e/html5/thumbnails/2.jpg)
Plan
● Transfer learning and pretraining in NLP○ BERT, ELMo, GPT
● RuBERT - transfer from Multilingual BERT model● Evaluation of RuBERT on:
○ Classification (Paraphrase Identification and Sentiment Analysis)○ Question Answering on SDSJ Task B (SQuAD)
● Results and conclusions
![Page 3: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT](https://reader035.vdocuments.us/reader035/viewer/2022071016/5fcec6f0566e8c39ac414c9e/html5/thumbnails/3.jpg)
Transfer learning and pretraining in NLP
● Word Embeddings (w2v, GloVe)○ word vectors are independent from context
● Language Model Pretraining○ task is to predict next word○ P(Wi | W1,…, Wi-1)○ ELMo, OpenAI GPT
● Masked Language Model Pretraining○ P(Wi | W1,…, Wi-1, Wi+1,…, WN)
● Masked Language Model Pretraining and auxiliary tasks○ Next sentence prediction (BERT)
● Combining Language Model, Masked Language Model and Seq2Seq○ Unified Language Model Pre-training for Natural Language Understanding and Generation https://arxiv.org/abs/1905.03197
BERT: https://arxiv.org/abs/1810.04805ELMo: https://arxiv.org/abs/1802.05365GPT: https://github.com/openai/gpt-2
Language Modeling
![Page 4: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT](https://reader035.vdocuments.us/reader035/viewer/2022071016/5fcec6f0566e8c39ac414c9e/html5/thumbnails/4.jpg)
Transfer learning and pretraining in NLP
BERT paper: https://arxiv.org/abs/1810.04805Illustrated BERT, ELMo, GPT: http://jalammar.github.io/illustrated-bert/
Language Model Pretraining (Unidirectional)Masked Language Model
![Page 5: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT](https://reader035.vdocuments.us/reader035/viewer/2022071016/5fcec6f0566e8c39ac414c9e/html5/thumbnails/5.jpg)
BERT Bidirectional Encoder Representations from Transformers
BERT paper: https://arxiv.org/abs/1810.04805
![Page 6: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT](https://reader035.vdocuments.us/reader035/viewer/2022071016/5fcec6f0566e8c39ac414c9e/html5/thumbnails/6.jpg)
BERT Bidirectional Encoder Representations from Transformers
BERT paper: https://arxiv.org/abs/1810.04805
[mask]
[mask] [mask]
[mask]
P(is_next_sentence) P(dog | W0, …, W10) P(he | W0, …, W10)
![Page 7: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT](https://reader035.vdocuments.us/reader035/viewer/2022071016/5fcec6f0566e8c39ac414c9e/html5/thumbnails/7.jpg)
Multilingual BERT Why do we need BERT for Russian?
● Three key motivations:
● BERT based models show state-of-the-art performance on wide range of NLP tasks● Multilingual BERT model was trained for 104 languages on Wikipedia
○ vocabulary size: ~120k subtokens○ only ~25k subtokens (~20%) in vocabulary are related to Russian language○ model has 180M parameters and half of them are used by subtoken embeddings○ 50% + 50% * 20% = 60% of total model parameters could be used for Russian texts
● Single-language BERT models outperform Multilingual BERT model:○ It was explored for English and Chinese BERT models by Google Research
Multilingual BERT: https://github.com/google-research/bert/blob/master/multilingual.md
![Page 8: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT](https://reader035.vdocuments.us/reader035/viewer/2022071016/5fcec6f0566e8c39ac414c9e/html5/thumbnails/8.jpg)
RuBERT Building Russian vocabulary
● We applied Subword NMT to Russian Wiki (80%) and news data (20%)● Effect of new Russian vocabulary:
○ 120k subtokens for Russian language○ ~1.5 times longer sequences could be fit to model
![Page 9: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT](https://reader035.vdocuments.us/reader035/viewer/2022071016/5fcec6f0566e8c39ac414c9e/html5/thumbnails/9.jpg)
RuBERT Transfer from Multilingual BERT model
http://docs.deeppavlov.ai/en/master/components/bert.htmlhttp://files.deeppavlov.ai/deeppavlov_data/bert/rubert_cased_L-12_H-768_A-12_v1.tar.gz
How to initialize RuBERT model?
● Random● Initialize with multilingual model,
random init for new subtokens
● Can we do better?
![Page 10: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT](https://reader035.vdocuments.us/reader035/viewer/2022071016/5fcec6f0566e8c39ac414c9e/html5/thumbnails/10.jpg)
RuBERT Transfer from Multilingual BERT model
http://docs.deeppavlov.ai/en/master/components/bert.htmlhttp://files.deeppavlov.ai/deeppavlov_data/bert/rubert_cased_L-12_H-768_A-12_v1.tar.gz
Can we do better?
● Initialize with multilingual model and assemble embeddings for new subtokens:
bird = bi ##rdEmb(bird) := Emb(bi) + Emb(##rd)
250k steps ~ 2 days of computations on Tesla P-100 16Gb x 8
![Page 11: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT](https://reader035.vdocuments.us/reader035/viewer/2022071016/5fcec6f0566e8c39ac414c9e/html5/thumbnails/11.jpg)
RuBERT Training details
http://docs.deeppavlov.ai/en/master/components/bert.htmlhttp://files.deeppavlov.ai/deeppavlov_data/bert/rubert_cased_L-12_H-768_A-12_v1.tar.gz
Model was trained in two stages: ● train full BERT on sequences with 128 subtokens length● train only positional embeddings on 512 length sequences
We used following hyperparameters:● batch size: 256● learning rate: 2 · 10−5
● optimizer: Adam● L2 regularization: 10−2
To support multi-gpu training we made fork of original Tensorflow BERT repo:https://github.com/deepmipt/bert/tree/feat/multi_gpu
![Page 12: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT](https://reader035.vdocuments.us/reader035/viewer/2022071016/5fcec6f0566e8c39ac414c9e/html5/thumbnails/12.jpg)
RuBERT for Classification
![Page 13: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT](https://reader035.vdocuments.us/reader035/viewer/2022071016/5fcec6f0566e8c39ac414c9e/html5/thumbnails/13.jpg)
RuBERT for Classification Paraphrase Identification
● ParaPhraser - dataset for Russian paraphrase detection (~7k training pairs)○ “9 мая метрополитен Петербурга будет работать круглосуточно”○ “Петербургское метро в ночь на 10 мая будет работать круглосуточно”○ domain: news
We compare BERT based models with other models in nonstandard run setting, when all resources were allowed.
http://paraphraser.ru/download/[1] Pivovarova, L., et al. (2017). Paraphraser: Russian paraphrase corpus and shared task. [2] Kravchenko, D. (2017). Paraphrase detection using machine translation and textual similarity algorithms.
Model F-1 Accuracy
Classifier + linguistic features [1] 81.10 77.39
Machine Translation + Semantic similarity [2] 78.51 81.41
BERT multilingual 85.48 ± 0.19 81.66 ± 0.38
RuBERT 87.73 ± 0.26 84.99 ± 0.35
![Page 14: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT](https://reader035.vdocuments.us/reader035/viewer/2022071016/5fcec6f0566e8c39ac414c9e/html5/thumbnails/14.jpg)
RuBERT for Classification Sentiment Analysis
● RuSentiment - dataset for sentiment analysis of posts from VKontakte● domain: social networks
http://text-machine.cs.uml.edu/projects/rusentiment/http://docs.deeppavlov.ai/en/master/intro/features.html#classification-component
![Page 15: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT](https://reader035.vdocuments.us/reader035/viewer/2022071016/5fcec6f0566e8c39ac414c9e/html5/thumbnails/15.jpg)
Question Answering on SDSJ Task B (SQuAD)
● Context:
In meteorology, precipitation is any product of the condensation of atmospheric water vapor that falls under gravity. The main forms of precipitation include drizzle, rain, sleet, snow, graupel and hail... Precipitation forms as smaller droplets coalesce via collision with other rain drops or ice crystals within a cloud. Short, intense periods of rain in scattered locations are called “showers”.
● Question:
Where do water droplets collide with ice crystals to form precipitation?
● datasets: Stanford Question Answering Dataset (SQuAD), Natural Questions, SDSJ 2017 Task B● SDSJ Task B: ~50k context-question-answer triplets
SQuAD: https://rajpurkar.github.io/SQuAD-explorer/Natural Questions: https://ai.google.com/research/NaturalQuestionsSDSJ 2017: https://sdsj.sberbank.ai/2017/ru/contest.html, http://docs.deeppavlov.ai/en/master/components/squad.html#sdsj-task-b
![Page 16: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT](https://reader035.vdocuments.us/reader035/viewer/2022071016/5fcec6f0566e8c39ac414c9e/html5/thumbnails/16.jpg)
BERT for Question Answering
![Page 17: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT](https://reader035.vdocuments.us/reader035/viewer/2022071016/5fcec6f0566e8c39ac414c9e/html5/thumbnails/17.jpg)
RuBERT for Question Answering on SDSJ Task B
http://docs.deeppavlov.ai/en/master/components/squad.html#sdsj-task-b
![Page 18: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT](https://reader035.vdocuments.us/reader035/viewer/2022071016/5fcec6f0566e8c39ac414c9e/html5/thumbnails/18.jpg)
Results
● ParaPhraser and SDSJ Task B + 4-6 F-1● RuSentiment +1 F-1 improvement from previous state-of-the-art
● ParaPhraser and SDSDJ Task B share the same domain with RuBERT (wiki, news)
● RuSentiment is more challenging due to domain shift, but still we could show good results
![Page 19: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT](https://reader035.vdocuments.us/reader035/viewer/2022071016/5fcec6f0566e8c39ac414c9e/html5/thumbnails/19.jpg)
Beyond this work
We trained single BERT model for Slavic languages (ru, bg, cs, pl) in the same manner as for Russian Language and evaluated it on BSNLP 2019 Shared Task on Multilingual Named Entity Recognition:
These results were obtained on validation set.
Results from Arkhipov M., Trofimova M., Kuratov Y., Sorokin A., Tuning Multilingual Transformers for Named Entity Recognition on Slavic Languages
![Page 20: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT](https://reader035.vdocuments.us/reader035/viewer/2022071016/5fcec6f0566e8c39ac414c9e/html5/thumbnails/20.jpg)
Results and conclusions
● We trained RuBERT model for Russian language● We achieved significant improvements on several Russian datasets with
RuBERT model● RuBERT, SlavicBERT and all pre-trained models are open-sourced, e.g.:
http://docs.deeppavlov.ai/en/master/components/bert.htmlhttp://files.deeppavlov.ai/deeppavlov_data/bert/rubert_cased_L-12_H-768_A-12_v1.tar.gz
python -m deeppavlov install squad_ru_rubert
python -m deeppavlov download squad_ru_rubert
python -m deeppavlov interact/riseapi squad_ru_rubert
from deeppavlov import build_model, configsmodel = build_model(configs.squad.squad_ru_rubert, download=True)model(['DeepPavlov это библиотека для NLP и диалоговых систем.'], ['Что такое DeepPavlov?'])>> [['библиотека для NLP и диалоговых систем'], [15], [2758812.25]]
![Page 21: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT](https://reader035.vdocuments.us/reader035/viewer/2022071016/5fcec6f0566e8c39ac414c9e/html5/thumbnails/21.jpg)
github.com/deepmipt/DeepPavlovdocs.deeppavlov.ai