e not a finite set! esentations compositional character · 2018-02-01 · as models with knowledge...
TRANSCRIPT
NLU lecture 6: Com
positional character representations
Adam Lopez
Credits: Clara Vania 2 Feb 2018
When does this assum
ption make
sense for language modeling?
Let’s revisit an assumption in
language modeling (& word2vec)
But words are not a finite set!•
Bengio et al.: “Rare words with frequency ≤ 3 were m
erged into a single symbol, reducing the
vocabulary size to |V| = 16,383.”
•Bahdanau et al.: “we use a shortlist of 30,000 m
ost frequent words in each language to train our m
odels. Any word not included in the shortlist is m
apped to a special token ([UNK]).”-------------------------------------------------- S
rc | ��
� ��
��
� �
�
R
ef | the main crop of japan is rice .
Hyp | the _U
NK
is popular of _UN
K . _E
OS
--------------------------------------------------
What if we could scale softm
ax to the training data vocabulary? W
ould that help?
SOFTMAX ALL THE WORDS
Idea: scale by partitioning•
Partition the vocabulary into smaller pieces.
p(wi |h
i )=
p(ci |h
i )p(wi |c
i ,hi )
Class-based LM
•Partition the vocabulary into sm
aller pieces hierarchically (hierarchical softm
ax).Brown clustering:
hard clustering based on m
utual inform
ation
Idea: scale by partitioning
•Differentiated softm
ax: assign more param
eters to more
frequent words, fewer to less frequent words.
Source: Strategies for training large vocabulary language models. Chen, Auli, and Grangier, 2015
Idea: scale by partitioningPartitioning helps
Source: Strategies for training large vocabulary language models. Chen, Auli, and Grangier, 2015
Source: Strategies for training large vocabulary language models. Chen, Auli, and Grangier, 2015
Partitioning helps… but
could be better
Noise contrastive estim
ation
Source: Strategies for training large vocabulary language models. Chen, Auli, and Grangier, 2015
Partitioning helps… but
could be better
Skip norm
alization step
altogether
Source: Strategies for training large vocabulary language models. Chen, Auli, and Grangier, 2015
Partitioning helps… but
could be better
Room for
improvem
ent
Source: Strategies for training large vocabulary language models. Chen, Auli, and Grangier, 2015
Partitioning helps… but
could be better
V is not finite
•Practical problem
: softmax com
putation is linear in vocabulary size.
•Theorem
. The vocabulary of word types is infinite. Proof 1. productive m
orphology, loanwords, “fleek”Proof 2. 1, 2, 3, 4, …
What set is finite?
Characters.
More precisely, unicode code points.
Are you sure? !
Not all characters are the same, because not all languages
have alphabets. Some have syllabaries (e.g. Japanese
kana) and/ or logographies (Chinese hànzì).
Rather than look up word representations…
Source: Finding function in form: compositional character models for open vocabulary word representation, Ling et al. 2015
Compose character representations
into word representations with LSTMs
Source: Finding function in form: compositional character models for open vocabulary word representation, Ling et al. 2015
Compose character representations
into word representations with CNNs
Source: Character-aware neural language models, Kim et al. 2015
Character models actually work. Train
them long enough, they generate words
Source: Finding function in form: compositional character models for open vocabulary word representation, Ling et al. 2015
Character models actually work. Train
them long enough, they generate words
anterest artifactive capacited capitaling com
pensive derm
itories despertator dividem
ent extrem
ilated faxem
ary follect
hamburgo
identimity
ipoteca nightm
ale orience patholicism
pinguenas sam
mitm
ent tastem
an understrum
ental w
isholver
Wow, the disconversated vocabulations of their system
are fantastics!—
Sharon Goldw
ater
How good are character-level NLP m
odels?
Implied(?): character-level neural m
odels learn everything they need to know about language.
Word em
beddings have obvious lim
itations•
Closed vocabulary assumption
•Cannot exploit functional relationships in learning
?
And we know a lot about
linguistic structure
Morphem
e: the smallest m
eaningful unit of language
“loves”
root/stem: love
affix: -s
morph. analysis: 3rd.SG
.PRES
love +s
The ratio of morphem
es to words varies by language
Analytic languages
one morphem
e per word
Synthetic languagesm
any morphem
es per word
Vietnamese
English
Turkish
West G
reenlandic
Morphology can change
syntax or semantics of a word
Inflectional morphology
love (VB), loves (VB), loving(VB), loved(VB)
Derivational m
orphology
lover (NN), lovely(ADJ), lovable(ADJ)
“love” (VB)
Morphem
es can represent one or m
ore features
Fusional languagesm
any features per morphem
e
Agglutinative languages
one feature per morphem
e(Turkish)
oku-r-sa-m
read-AOR.CO
ND.1SG
‘If I read …’
(English) read-s
read-3SG.SG
‘reads’
Words can have m
ore than one stem
Com
poundingm
any stems per word
Affixation
one stem per word
studying study + ing
Rettungshubschraubernotlandeplatz R
ettung + s + hubschrauber + not + lande+ platz rescue + LNK + helicopter + em
ergency + landing + place ‘Rescue helicopter em
ergency landing pad’
(English)
(Germ
an)
Inflection is not limited to
affixation
Base M
odification
Root &
Pattern
Reduplication
drink, drank, drunk
k(a)t(a)b(a) write-PST.3SG
.M
‘he wrote’
kemerah~m
erahan red-ADJ ‘reddish’
(English)
(Arabic)
(Indonesian)
There are many different ways to com
pute word representations from
subwords
•Characters (Ling et al., 2015, Kim
et al., 2016, Lee at al., 2016)
•Character n-gram
s (Sperr at al., 2013, W
ieting et al., 2016, Bojanowski et al., 2016)
•M
orphemes (Luong et al., 2013, Botha
& Blunsom, 2014, Sennrich et al.,
2016)
•M
orphological analysis (Cotterel & Schütze, 2015, Kann & Schütze, 2016)
Addition, Bidirectional LSTM
s, Convolutional NN,
…
basic unit(s)
0.01 … 0.3 …
… …
0.12 … 0.05
Basic U
nits of Representation
Com
positional Function
We’ve revised m
orphology, so we have som
e questions about character models
•How do representations based on m
orphemes
compared with those based on characters?
•W
hat is the best way to compose subword
representations?
•Do character-level m
odels the same predictive
utility as models with knowledge of m
orphology?
•How do different representations interact with languages of different m
orphological typologies?
Prediction problem: neural
language modeling
what we vary
Open
vocabulary history
Closed vocabulary prediction
Open vocabulary prediction is interesting, but our goal is to understand representations, not build a better neural LM
.
Variable: Subword Unit
Unit
Examples
Morfessor
^want, s$BPE
^w, ants$char-trigram
^wa, wan, ant, nts, ts$character
^, w, a, n, t, s, $analysis
want+VB, +3rd, +SG, +Pres
Approxim
ations to m
orphology
Annotated
morphology
The last row is part of an oracle experiment: suppose you
had an oracle that could tell you the true morphology. In
this case, the oracle is a human annotator.
Variable: Composition Function
•Vector addition (except for characters)
•Bidirectional LSTM
s •
Convolutional NN
Variable: Language TypologyFusional (English)
read-s read-3SG
.SG
‘reads’
Agglutinative (Turkish)
oku-r-sa-m
read-AOR.CO
ND.1SG
‘If I read …’
Root&
Pattern (Arabic)k(a)t(a)b(a)
write-PST.3SG.M
‘he w
rote’
Reduplication (Indonesian)
anak~anak child-PL ‘children’
Summ
ary of perplexity: use bi-LSTM
s over character trigrams
Languageword
characterchar-trigram
sBPE
Morfessor
%im
pbi-LSTM
CNNadd
bi-LSTMadd
bi-LSTMadd
bi-LSTM
Czech41.46
34.2536.6
42.7333.59
49.9633.74
47.7436.87
18.98English
46.443.53
44.6745.41
42.9747.51
43.349.72
49.727.39
Russian34.93
28.4429.47
35.1527.72
40.128.52
39.631.31
20.64Finnish
24.2120.05
20.2924.89
18.6226.77
19.0827.79
22.4523.09
Japanese98.14
98.1491.63
101.99101.09
126.5396.8
111.9799.23
6.63Turkish
66.9754.46
55.0750.07
54.2359.49
57.3262.2
62.725.24
Arabic48.2
42.0243.17
50.8539.87
50.8542.79
52.8845.46
17.28Hebrew
38.2331.63
33.1939.67
30.444.15
32.9144.94
34.2820.48
Indonesian46.07
45.4746.6
58.5145.96
59.1743.37
59.3344.86
5.86M
alay54.67
53.0150.56
68.5150.74
68.9951.21
68.252.5
7.52
Still lots of work to do on unsupervised m
orphology…
Do character-level models have the
predictive utility of models with access to
actual morphology?
Perplexity
0 5 10 15 20 25 30 35
CzechRussian
26.430.1
27.7
33.6
28.4
34.3
characterchar-trigramm
orph. analysis
(^, r, e, a, d, s, $)(read, VB, 3rd, SG
, Present)no m
orphologyactual m
orphology
NO
Perplexity
0 8 16 24 32 40
wordchar-trigram
char-CNN
28.828.8
28.8
39.135.8
35.637
34.835.2
35.232.3
39.7
1M5M
10Mm
orph. analysis (~1M token)
Can we close that gap by training character-level m
odels on far more data? NO
•M
easure Targeted perplexity: perplexity on specific subset of words in the test data
•Analyze perplexities when the inflected w
ords of interest are in the m
ost recent history: nouns and verbs
Green tea or w
hite tea ?
The sushi is great , and they have a great selection .
How do we know that the morphological
annotations that make the difference?
Targeted perplexity of Czech nouns is lower when we use m
orphology
Perplexity
0 16 32 48 64 80
allfrequent
rare
42.640.1
40.9
62.8
5053.4
56.153.4
50.3
59
47.951
73
56.861.2
wordcharacters
char-trigrams
BPEm
orph. analysis
Perplexity
0 20 40 60 80
100
allfrequent
rare
61.858.6
59.5
78.372.5
74.270.6
63.765.8
77.168.1
70.8
99.4
74.381.4word
characterschar-trigram
sBPE
morph. analysis
Targeted perplexity of Czech verbs is lower when we use m
orphologyCharacter m
odels are good at reduplication (no oracle, though)
Languagetype-level (%
)token-level (%
)Indonesian
1.1%2.6
Malay
1.3%2.9
0 32 64 96
128
160
allfrequent
rare
156.8
108.9117.2
137.4
91.499.2
157
91.7101.7
wordcharacters
BPE
Percentage of full-reduplication in the training data
Targeted Perplexity
Different representations m
ake different neighborsM
odelFrequent
RareUnknown
man
includingunconditional
hydroplaneuploading
foodism
wordperson
likenazi
molybdenum
-
-anyone
featuringfairly
your-
- children
includejoints
imperial
- -
BPE bi-lstm
iicalled
unintentional em
phasizeupbeat
vigilantismhill
involveungenerous
heartbeatuprising
pyrethrum
textlike
unanimous
hybridizedhandling
pausaniaschar
trigrams
bi-lstm
mak
include unconstitutional
selenocysteinedrifted
tuaregsvill
includesconstitutional
guerrillasaffected
quftcow
undermining
unimolecular
scrofulaconflicted
subjectivism
char bi-lstm
mayr
inclusionrelates
hydrolyzedm
usagteform
ulasm
anyinsularity
unmyelinated
hydraulicsm
utualismform
allym
ayinclude
uncoordinatedhysterotom
ym
utualistfecal
char CNN
mtn
include unconventional
hydroxyprolineunloading
fordhamm
annincludes
unintentionalhydrate
loadingdadaism
nunexcluding
unconstitutionalhydrangea
upgradingpopism
Different representations m
ake different neighborsM
odelFrequent
RareO
OV
man
includingunconditional
hydroplaneuploading
foodism
wordperson
likenazi
molybdenum
-
-anyone
featuringfairly
your-
- children
includejoints
imperial
- -
BPE bi-lstm
iicalled
unintentional em
phasizeupbeat
vigilantismhill
involveungenerous
heartbeatuprising
pyrethrum
textlike
unanimous
hybridizedhandling
pausaniaschar
trigrams
bi-lstm
mak
include unconstitutional
selenocysteinedrifted
tuaregsvill
includesconstitutional
guerrillasaffected
quftcow
undermining
unimolecular
scrofulaconflicted
subjectivism
char bi-lstm
mayr
inclusionrelates
hydrolyzedm
usagteform
ulasm
anyinsularity
unmyelinated
hydraulicsm
utualismform
allym
ayinclude
uncoordinatedhysterotom
ym
utualistfecal
char CNN
mtn
include unconventional
hydroxyprolineunloading
fordhamm
annincludes
unintentionalhydrate
loadingdadaism
nunexcluding
unconstitutionalhydrangea
upgradingpopism
Good at frequent
wordsM
aybe they learn “word classes”?
Character NLMs learn word
boundaries.
Source: Yova Kementchedjhieva, M
orpho-syntactic awareness in a character-level language m
odel 2017 Inform
atics M.Sc. thesis
…and m
emorize PO
S tags
Source: Yova Kementchedjhieva, M
orpho-syntactic awareness in a character-level language m
odel 2017 Inform
atics M.Sc. thesis
…and m
emorize PO
S tags
Source: Yova Kementchedjhieva, M
orpho-syntactic awareness in a character-level language m
odel 2017 Inform
atics M.Sc. thesis
What do NLM
s learn about m
orphology?•
Character-level NLMs are great! Across typologies, but
especially for agglutinative morphology.
•However, they do not m
atch predictive accuracy of m
odel with explicit knowledge of morphology (or PO
S).
•Q
ualitative analyses suggests that they learn orthographic sim
ilarity of affixes, and forget meaning
of root morphem
es.
•M
ore generally, they appear to mem
orize frequent subpatterns.
What do we know about what
NNs know about language?•
Still very little.
•Evidence suggests: nothing surprising. Lots of m
emorization, local generalization.
•NNs are great for sim
plicity of specification and end-to-end learning.
•But these things are not m
agic! We still don’t have enough
data, and these models could be better if they knew about
morphology.
•But how do we do that?