![Page 1: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/1.jpg)
Character-Level Models
Hinrich Schutze
Center for Information and Language Processing, LMU Munich
2019-08-29
1 / 70
![Page 2: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/2.jpg)
Overview
1 Motivation
2 fastText
3 CNNs
4 FLAIR
5 Summary
2 / 70
![Page 3: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/3.jpg)
Outline
1 Motivation
2 fastText
3 CNNs
4 FLAIR
5 Summary
3 / 70
![Page 4: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/4.jpg)
Typical NLP pipeline: Tokenization
Mr. O’Neill thinks that Brazil’s capital is Rio.
Mr.|O’Neill|thinks|that|Brazil|’s|capital|is|Rio|.
4 / 70
![Page 5: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/5.jpg)
Typical NLP pipeline: Tokenization
Mr. O’Neill thinks that Brazil’s capital is Rio.
Mr.|O’Neill|thinks|that|Brazil|’s|capital|is|Rio|.
5 / 70
![Page 6: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/6.jpg)
Typical NLP pipeline: Tokenization
Mr. O’Neill thinks that Brazil’s capital is Rio.
Mr.|O’Neill|thinks|that|Brazil|’s|capital|is|Rio|.
6 / 70
![Page 7: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/7.jpg)
Typical NLP pipeline: Tokenization
Mr. O’Neill thinks that Brazil’s capital is Rio.
Mr.|O’Neill|thinks|that|Brazil|’s|capital|is|Rio|.
7 / 70
![Page 8: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/8.jpg)
Typical NLP pipeline: Tokenization
Mr. O’Neill thinks that Brazil’s capital is Rio.
Mr.|O’Neill|thinks|that|Brazil|’s|capital|is|Rio|.
8 / 70
![Page 9: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/9.jpg)
Typical NLP pipeline: Tokenization
Mr. O’Neill thinks that Brazil’s capital is Rio.
Mr.|O’Neill|thinks|that|Brazil|’s|capital|is|Rio|.
9 / 70
![Page 10: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/10.jpg)
Typical NLP pipeline: Morphological analysis
For example: lemmatization
Mr. O’Neill knows that the US has fifty states
Mr. O’Neill know that the US have fifty state
10 / 70
![Page 11: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/11.jpg)
Preprocessing in the typical NLP pipeline
Tokenization
Morphological analysis
Later today: BPEs
What is the problem with this?
11 / 70
![Page 12: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/12.jpg)
Problems with typical preprocessing in NLP
Rules do not capture structure within tokens.
Regular morphology, e.g., compounding:“Staubecken” can mean “Staub-Ecken” (dusty corners) or“Stau-Becken” (dam reservoir)
Non-morphological, semi-regular productivity:cooooooooooool, fancy-shmancy,Watergate/Irangate/Dieselgate
Blends:Obamacare, mockumentary, brunch
Onomatopoeia, e.g., “oink”, “sizzle”, “tick tock”
Certain named entity classes: What is “lisinopril”?
Noise due to spelling errors:“signficant”
Noise that affects token boundaries, e.g., in OCR:“run fast” → “runfast”
12 / 70
![Page 13: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/13.jpg)
Problems with typical preprocessing in NLP
Rules do not capture structure across tokens.
Noise that affects token boundaries, e.g., in OCR:“gumacamole” → “guaca” “mole”
recognition of names / multiphrase expressions“San Francisco-Los Angeles flights”
“Nonsegmented” languages: Chinese, Thai, Burmese
13 / 70
![Page 14: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/14.jpg)
Pipelines in deep learning (and StatNLP in general)
We have a pipeline consisting of two differ-ent subsystems:
A preprocessing component:tokenization, morphology, BPEs
The deep learning model that isoptimized for a particular objective
The preprocessing component is notoptimal for the objective and there aremany cases where it’s outrightharmful.
If we replace the preprocessingcomponent with a character-levellayer, we can train the architectureend2end and get rid of the pipeline.
14 / 70
![Page 15: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/15.jpg)
Advantages of end2end vs. pipeline
End2end optimizes all parameters of a deep learning modeldirectly for the learning objective, including “first-layer”parameters that connect the raw input representation to thefirst layer of internal representations of the network.
Pipelines generally don’t allow “backtracking” if an error hasbeen made in the first element of the pipeline.
In character-level models, there is no such thing as anout-of-vocabulary word. (OOV analysis)
Character-level models can generate words / units that didnot occur in the training set (OOV generation).
End2end can deal better with human productivity (e.g.,“brunch”), misspellings etc.
15 / 70
![Page 16: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/16.jpg)
Three character-level models
fastText
Bag of character ngrams
Character-aware CNN (Kim, Jernite, Sontag, Rush, 2015)
CNN
FLAIR
Character-level BiLSTM
16 / 70
![Page 17: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/17.jpg)
Outline
1 Motivation
2 fastText
3 CNNs
4 FLAIR
5 Summary
17 / 70
![Page 18: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/18.jpg)
fastText
FastText is an extension of word2vec.
It computes embeddings for character ngrams
A word’s embedding isthe sum of its character ngram embeddings.
Parameters:minimum ngram length: 3, maximum ngram length: 6
The embedding of “dendrite” will be the sum of the followingngrams: @dendrite@ @de den end ndr dri rit ite te@ @dendend endr ndri drit rite ite@ @dend dendr endri ndrit driterite@ @dendr dendri endrit ndrite drite@
18 / 70
![Page 19: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/19.jpg)
fastText: Example for benefits
Embedding for character ngram “dendri”→ “dentrite” and “dentritic” are similar
word2vec: no guarantee, especially for rare words
19 / 70
![Page 20: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/20.jpg)
fastText paper
20 / 70
![Page 21: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/21.jpg)
fastText objective
T∑
t=1
∑
c∈Ct
− log p(wc |wt)
T length of the training corpus in tokensCt words surrounding word wt
21 / 70
![Page 22: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/22.jpg)
Probability of a context word: softmax?
p(wc |wt) =exp(s(wt ,wc))
∑Wj=1 exp(s(wj ,wc ))
s(wt ,wc) scoring function that maps word pair to RProblems: too expensive
22 / 70
![Page 23: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/23.jpg)
Instead of softmax:
Negative sampling and binary logistic loss
log(1 + exp(−s(wt ,wc ))) +∑
n∈Nt,c
log(1 + exp(s(wt ,wn)))
ℓ(s(wt ,wc)) +∑
n∈Nt,c
ℓ(−s(wt ,wn))
Nt,c set of negative examples sampled from the vocabularyℓ(x) log(1 + exp(−x)) (logistic loss)
23 / 70
![Page 24: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/24.jpg)
Binary logistic loss for corpus
T∑
t=1
[
∑
c∈Ct
ℓ(s(wt ,wc )) +∑
n∈Nt,c
ℓ(−s(wt ,wn))]
ℓ(x) log(1 + exp(−x))
24 / 70
![Page 25: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/25.jpg)
Scoring function
s(wt ,wc) = u⊺
wtvwc
uwt the input vector of wt
vwc the output vector (or context vector) of wc
25 / 70
![Page 26: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/26.jpg)
Subword model
s(wt ,wc) =1
|Gwt |
∑
g∈Gwt
z⊺
gvwc
Gwt set of ngrams of wt and wt itself
26 / 70
![Page 27: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/27.jpg)
fastText: Summary
Basis: word2vec skipgram
Objective:includes character ngrams as well as word itself
Result: word embeddings that combine word-level andcharacter-level information
We can compute an embedding for any unseen word (OOV).
27 / 70
![Page 28: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/28.jpg)
Letter n-gram generalization can be good
word2vec
1.000 automobile 779 mid-size 770 armored 763 seaplane 754 bus754 jet 751 submarine 750 aerial 744 improvised 741 anti-aircraft
fastText
1.000 automobile 976 automobiles 929 Automobile 858manufacturing 853 motorcycles 849 Manufacturing 848 motorcycle841 automotive 814 manufacturer 811 manufacture
28 / 70
![Page 29: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/29.jpg)
Letter n-gram generalization can be bad
word2vec
1.000 Steelers 884 Expos 865 Cubs 848 Broncos 831 Dinneen 831Dolphins 827 Pirates 826 Copley 818 Dodgers 814 Raiders
fastText
1.000 Steelers 893 49ers 883 Steele 876 Rodgers 857 Colts 852Oilers 851 Dodgers 849 Chalmers 849 Raiders 844 Coach
29 / 70
![Page 30: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/30.jpg)
Letter n-gram generalization: no-brainer for unknowns
(OOVs)
word2vec
(“video-conferences” did not occur in corpus)
fastText
1.000 video-conferences 942 conferences 872 conference 870Conferences 823 inferences 806 Questions 805 sponsorship 800References 797 participates 796 affiliations
30 / 70
![Page 31: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/31.jpg)
fastText extensions (Mikolov et al, 2018)
31 / 70
![Page 32: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/32.jpg)
fastText extensions (Mikolov et al, 2018)
Position-dependent features
Phrases (like word2vec)
cbow
Pretrained word vectors for 157 languages
32 / 70
![Page 33: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/33.jpg)
fastText evaluation
33 / 70
![Page 34: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/34.jpg)
Code
fastTexthttps://fasttext.cc
gensimhttps://radimrehurek.com/gensim/
34 / 70
![Page 35: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/35.jpg)
Pretrained fasttext embeddings
Afrikaans, Albanian, Alemannic, Amharic, Arabic, Aragonese, Armenian, Assamese, Asturian, Azerbaijani, Bashkir,
Basque, Bavarian, Belarusian, Bengali, Bihari, Bishnupriya Manipuri, Bosnian, Breton, Bulgarian, Burmese,
Catalan, Cebuano, Central Bicolano, Chechen, Chinese, Chuvash, Corsican, Croatian, Czech, Danish, Divehi, Dutch,
Eastern Punjabi, Egyptian Arabic, Emilian-Romagnol, English, Erzya, Esperanto, Estonian, Fiji Hindi, Finnish,
French, Galician, Georgian, German, Goan Konkani, Greek, Gujarati, Haitian, Hebrew, Hill Mari, Hindi, Hungarian,
Icelandic, Ido, Ilokano, Indonesian, Interlingua, Irish, Italian, Japanese, Javanese, Kannada, Kapampangan, Kazakh,
Khmer, Kirghiz, Korean, Kurdish (Kurmanji), Kurdish (Sorani), Latin, Latvian, Limburgish, Lithuanian, Lombard,
Low Saxon, Luxembourgish, Macedonian, Maithili, Malagasy, Malay, Malayalam, Maltese, Manx, Marathi,
Mazandarani, Meadow Mari, Minangkabau, Mingrelian, Mirandese, Mongolian, Nahuatl, Neapolitan, Nepali,
Newar, North Frisian, Northern Sotho, Norwegian (Bokmal), Norwegian (Nynorsk), Occitan, Oriya, Ossetian,
Palatinate German, Pashto, Persian, Piedmontese, Polish, Portuguese, Quechua, Romanian, Romansh, Russian,
Sakha, Sanskrit, Sardinian, Scots, Scottish Gaelic, Serbian, Serbo-Croatian, Sicilian, Sindhi, Sinhalese, Slovak,
Slovenian, Somali, Southern Azerbaijani, Spanish, Sundanese, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar,
Telugu, Thai, Tibetan, Turkish, Turkmen, Ukrainian, Upper Sorbian, Urdu, Uyghur, Uzbek, Venetian, Vietnamese,
Volapuk, Walloon, Waray, Welsh, West Flemish, West Frisian, Western Punjabi, Yiddish, Yoruba, Zazaki, Zeelandic
35 / 70
![Page 36: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/36.jpg)
fastText skipgram parameters
-input <path>training file path
-output <path>output file path
-lr (0.05)learning rate
-lrUpdateRate (100)rate of updates for the learning rate
-dim (100)dimensionality of word embeddings
-ws (5)size of the context window
-epoch (5)number of epochs
36 / 70
![Page 37: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/37.jpg)
fastText skipgram parameters
-minCount (5)minimal number of word occurrences
-neg (5)number of negatives sampled
-wordNgrams (1)max length of word ngram
-loss (ns)loss function ∈ { ns, hs, softmax }
-bucket (2,000,000)number of buckets
-minn (3)min length of char ngram
-maxn (6)max length of char ngram
37 / 70
![Page 38: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/38.jpg)
fastText skipgram parameters
-threads (12)number of threads
-t (0.0001)sampling threshold
-label <string>labels prefix
38 / 70
![Page 39: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/39.jpg)
Outline
1 Motivation
2 fastText
3 CNNs
4 FLAIR
5 Summary
39 / 70
![Page 40: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/40.jpg)
Convolutional Neural Networks (CNNs): Basic idea
We learn feature detectors.
Each feature detector has a fixed size, e.g., a width of threecharacters.
We slide the feature detector over the input (e.g., an inputword).
The feature detector indicates for each point in the input theactivation of the feature at that point.
Then we pass to the next layer the highest activation we’vefound.
Example task in following slides: detect capitalization
40 / 70
![Page 41: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/41.jpg)
Convolution&pooling architecture
@ M c C a i n @ l o s e s @
poolinglayer
convolutionlayer
inputlayer
41 / 70
![Page 42: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/42.jpg)
Input layer
@ M c C a i n @ l o s e s @
poolinglayer
convolutionlayer
inputlayer
42 / 70
![Page 43: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/43.jpg)
Convolution layer
@ M c C a i n @ l o s e s @
poolinglayer
convolutionlayer
inputlayer
43 / 70
![Page 44: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/44.jpg)
Convolution layer (filter size 3)
0.1
@ M c C a i n @ l o s e s @
a = g(H ⊙ X )
g(H⊙X )
poolinglayer
convolutionlayer
inputlayer
44 / 70
![Page 45: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/45.jpg)
Convolutional layer: configuration
Convolutional filter: a = g(H ⊙ X + b)
g : nonlinearity (e.g., sigmoid)
H: filter parameters
X is the input to the filter,of dimensionality D × k
Kernel size (or filter size) k : length of subsequence
D is the dimensionality of the embeddings.
⊙ is the (Frobenius) inner product:H ⊙ X =
∑
(i ,j)HijXij
H also has dimensionality D × k .
Number of kernels/filters
Usually: mix of filters of different sizes
45 / 70
![Page 46: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/46.jpg)
dos Santos and Zadrozny (2014)
46 / 70
![Page 47: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/47.jpg)
CNN for generating word embeddings
character embeddings,dimensionality d chr
convolutional filter W0 (here: widthkchr = 3)
input to filter: zm, concatentationof kchr = 3 character embeddings
output of filter: g(W0 ⊙ zm + b0)
one output vector per positionhere: M = 9− kchr + 1
maxpooling:[rwch]j =max1≤m≤M [g(W0 ⊙ zm + b0)]j
rwch is the character-basedembedding of the input word“clearly”
47 / 70
![Page 48: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/48.jpg)
Examples
Character-based word embeddings are trained end2end – herefor a part-of-speech (POS) tagging task.
It’s apparent that the word embeddings reflect similarity forthe POS task.
48 / 70
![Page 49: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/49.jpg)
Hyperparameters
dchr 10 dimensionality of character embeddingskchr 5 width of convolutional filtersn 50 number of convolutional filters
(= dimensionality of character-based word embeddings)
49 / 70
![Page 50: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/50.jpg)
POS performance of character-based word embeddings
If regular word embeddings (WNN; e.g., word2vec) areavailable (OOSV), then character-based word embeddings donot help.
If not (OOUV), then character-based word embeddingsperform best.
Overall performance is also the best.
Hand-engineered features slightly worse than character-basedword embeddings.
50 / 70
![Page 51: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/51.jpg)
Kim et al. 2016 (AAAI)
51 / 70
![Page 52: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/52.jpg)
Kim et al. 2016 (AAAI)
52 / 70
![Page 53: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/53.jpg)
Extensions
Convolutional filters of many sizes
Highway network
53 / 70
![Page 54: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/54.jpg)
Hyperparameters
dchr 15 dimensionality of character embeddingskchr 1, 2, 3, 4, 5, 6, 7 width of convolutional filtersn min(200, 50 · kchr) number of filters per width
1100 dimensionality char-based word embeddings
54 / 70
![Page 55: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/55.jpg)
Language modeling results (perplexity)
Character-based model is on par with state of the art at thetime
But has a smaller number of parameters
55 / 70
![Page 56: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/56.jpg)
Examples
Here the objective is language modeling, so we get more thanPOS similarity (“advertised” / “advertising”)
Big difference before/after highway
Highway copies over useful character information(computer-aided / computer-guided) and filters outmisleading character-based similarity (“loooook” / “cook”).
56 / 70
![Page 57: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/57.jpg)
Outline
1 Motivation
2 fastText
3 CNNs
4 FLAIR
5 Summary
57 / 70
![Page 58: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/58.jpg)
FLAIR: Akbik et al (2018)
58 / 70
![Page 59: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/59.jpg)
FLAIR embeddings
First layer: biLSTM
Second layer: FLAIR word embedding:concatenation of fLM hidden state after last characterand bLM hidden state before first character(fLM = forward language model)(bLM = backward language model)
59 / 70
![Page 60: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/60.jpg)
Motivation for FLAIR embeddings
FLAIR embeddings are contextual.
The same word has different FLAIR embeddings in different contexts!
Hope: context (e.g., “George”) incorporated into FLAIR embedding(e.g., “Washington”) in the context “George Washington”
Pretraining: In contrast to character-embeddings learned for a specifictask, FLAIR embeddings can be pretrained on huge unlabeled corpora
Plus: FLAIR embeddings have all the other advantages ofcharacter-based word embeddings, e.g., robustness against noise andno out-of-vocabulary words.
60 / 70
![Page 61: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/61.jpg)
Typical use of FLAIR embeddings:
Sequence labeling
61 / 70
![Page 62: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/62.jpg)
Extensions
Stacking embeddings: a FLAIR embedding can be extendedwith other word embeddings: word2vec, fastText etc.
Also: a FLAIR embedding can be extended with atask-trained embedding, i.e., trained end2end on the trainingset of the task
62 / 70
![Page 63: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/63.jpg)
Nearest neighbors in FLAIR embedding space
This demonstrates that FLAIR embeddings indeed capturevaluable context – so they are contextualized embeddings.
63 / 70
![Page 64: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/64.jpg)
Performance of FLAIR
FLAIR beats three strong baselines – new state of the art.
Word embeddings give a big boost for some tasks.
Stacked embeddings are better than FLAIR embeddings only.
64 / 70
![Page 65: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/65.jpg)
How much information do FLAIR Embeddings contain?
Only use a linear map from embeddings
NER English NER German chunking POS
FLAIR, map 81.42 73.90 90.50 97.26FLAIR, full model 91.97 85.78 96.68 97.73word, map 48.79 57.43 65.01 89.58word, full model 88.54 82.32 95.40 96.94
Using FLAIR embeddings directly, without a sequence labelingmodel, performs surprisingly well (but big gap to full model).
In particular, FLAIR/map is much better that word/map.
65 / 70
![Page 67: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/67.jpg)
Outline
1 Motivation
2 fastText
3 CNNs
4 FLAIR
5 Summary
67 / 70
![Page 68: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/68.jpg)
fastText CNN FLAIR
architecture ngram embed’s CNN biLSTMpipeline end2end mixed
language modeling BOW/fixed pos. filters sequentialefficient to train? + − −pretrained available? + − +expressivity − + +combinable withword embeddings + + +within-token, OOVs + + +cross-token inflexible − +
68 / 70
![Page 69: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/69.jpg)
Resources
See “references.pdf” document
69 / 70
![Page 70: Character-Level Modelshs/ranlp/pchar.flat.pdf · tokenization, morphology, BPEs The deep learning model that is optimized for a particular objective The preprocessing component is](https://reader033.vdocuments.us/reader033/viewer/2022052810/607f7524a5edb302696ed3d8/html5/thumbnails/70.jpg)
Questions?
70 / 70