e not a ﬁnite set! esentations compositional character · 2018-02-01 · as models with knowledge...

NLU lecture 6: Com

positional character representations

Adam Lopez

[email protected]

Credits: Clara Vania 2 Feb 2018

When does this assum

ption make

sense for language modeling?

Let’s revisit an assumption in

language modeling (& word2vec)

But words are not a finite set!•

Bengio et al.: “Rare words with frequency ≤ 3 were m

erged into a single symbol, reducing the

vocabulary size to |V| = 16,383.”

•Bahdanau et al.: “we use a shortlist of 30,000 m

ost frequent words in each language to train our m

odels. Any word not included in the shortlist is m

apped to a special token ([UNK]).”-------------------------------------------------- S

rc | ��

� ��

��

� �

�

R

ef | the main crop of japan is rice .

Hyp | the _U

NK

is popular of _UN

K . _E

OS

--------------------------------------------------

What if we could scale softm

ax to the training data vocabulary? W

ould that help?

SOFTMAX ALL THE WORDS

Idea: scale by partitioning•

Partition the vocabulary into smaller pieces.

p(wi |h

i )=

p(ci |h

i )p(wi |c

i ,hi )

Class-based LM

•Partition the vocabulary into sm

aller pieces hierarchically (hierarchical softm

ax).Brown clustering:

hard clustering based on m

utual inform

ation

Idea: scale by partitioning

•Differentiated softm

ax: assign more param

eters to more

frequent words, fewer to less frequent words.

Source: Strategies for training large vocabulary language models. Chen, Auli, and Grangier, 2015

Idea: scale by partitioningPartitioning helps



Partitioning helps… but

could be better

Noise contrastive estim

ation



could be better

Skip norm

alization step

altogether



could be better

Room for

improvem

ent



could be better

V is not finite

•Practical problem

: softmax com

putation is linear in vocabulary size.

•Theorem

. The vocabulary of word types is infinite. Proof 1. productive m

orphology, loanwords, “fleek”Proof 2. 1, 2, 3, 4, …

What set is finite?

Characters.

More precisely, unicode code points.

Are you sure? !

Not all characters are the same, because not all languages

have alphabets. Some have syllabaries (e.g. Japanese

kana) and/ or logographies (Chinese hànzì).

Rather than look up word representations…

Source: Finding function in form: compositional character models for open vocabulary word representation, Ling et al. 2015

Compose character representations

into word representations with LSTMs


Compose character representations

into word representations with CNNs

Source: Character-aware neural language models, Kim et al. 2015

Character models actually work. Train

them long enough, they generate words


Character models actually work. Train

them long enough, they generate words

anterest artifactive capacited capitaling com

pensive derm

itories despertator dividem

ent extrem

ilated faxem

ary follect

hamburgo

identimity

ipoteca nightm

ale orience patholicism

pinguenas sam

mitm

ent tastem

an understrum

ental w

isholver

Wow, the disconversated vocabulations of their system

are fantastics!—

Sharon Goldw

ater

How good are character-level NLP m

odels?

Implied(?): character-level neural m

odels learn everything they need to know about language.

Word em

beddings have obvious lim

itations•

Closed vocabulary assumption

•Cannot exploit functional relationships in learning

?

And we know a lot about

linguistic structure

Morphem

e: the smallest m

eaningful unit of language

“loves”

root/stem: love

affix: -s

morph. analysis: 3rd.SG

.PRES

love +s

The ratio of morphem

es to words varies by language

Analytic languages

one morphem

e per word

Synthetic languagesm

any morphem

es per word

Vietnamese

English

Turkish

West G

reenlandic

Morphology can change

syntax or semantics of a word

Inflectional morphology

love (VB), loves (VB), loving(VB), loved(VB)

Derivational m

orphology

lover (NN), lovely(ADJ), lovable(ADJ)

“love” (VB)

Morphem

es can represent one or m

ore features

Fusional languagesm

any features per morphem

e

Agglutinative languages

one feature per morphem

e(Turkish)

oku-r-sa-m

read-AOR.CO

ND.1SG

‘If I read …’

(English) read-s

read-3SG.SG

‘reads’

Words can have m

ore than one stem

Com

poundingm

any stems per word

Affixation

one stem per word

studying study + ing

Rettungshubschraubernotlandeplatz R

ettung + s + hubschrauber + not + lande+ platz rescue + LNK + helicopter + em

ergency + landing + place ‘Rescue helicopter em

ergency landing pad’

(English)

(Germ

an)

Inflection is not limited to

affixation

Base M

odification

Root &

Pattern

Reduplication

drink, drank, drunk

k(a)t(a)b(a) write-PST.3SG

.M

‘he wrote’

kemerah~m

erahan red-ADJ ‘reddish’

(English)

(Arabic)

(Indonesian)

There are many different ways to com

pute word representations from

subwords

•Characters (Ling et al., 2015, Kim

et al., 2016, Lee at al., 2016)

•Character n-gram

s (Sperr at al., 2013, W

ieting et al., 2016, Bojanowski et al., 2016)

•M

orphemes (Luong et al., 2013, Botha

& Blunsom, 2014, Sennrich et al.,

2016)

•M

orphological analysis (Cotterel & Schütze, 2015, Kann & Schütze, 2016)

Addition, Bidirectional LSTM

s, Convolutional NN,

…

basic unit(s)

0.01 … 0.3 …

… …

0.12 … 0.05

Basic U

nits of Representation

Com

positional Function

We’ve revised m

orphology, so we have som

e questions about character models

•How do representations based on m

orphemes

compared with those based on characters?

•W

hat is the best way to compose subword

representations?

•Do character-level m

odels the same predictive

utility as models with knowledge of m

orphology?

•How do different representations interact with languages of different m

orphological typologies?

Prediction problem: neural

language modeling

what we vary

Open

vocabulary history

Closed vocabulary prediction

Open vocabulary prediction is interesting, but our goal is to understand representations, not build a better neural LM

.

Variable: Subword Unit

Unit

Examples

Morfessor

^want, s$BPE

^w, ants$char-trigram

^wa, wan, ant, nts, ts$character

^, w, a, n, t, s, $analysis

want+VB, +3rd, +SG, +Pres

Approxim

ations to m

orphology

Annotated

morphology

The last row is part of an oracle experiment: suppose you

had an oracle that could tell you the true morphology. In

this case, the oracle is a human annotator.

Variable: Composition Function

•Vector addition (except for characters)

•Bidirectional LSTM

s •

Convolutional NN

Variable: Language TypologyFusional (English)

read-s read-3SG

.SG

‘reads’

Agglutinative (Turkish)

oku-r-sa-m

read-AOR.CO

ND.1SG

‘If I read …’

Root&

Pattern (Arabic)k(a)t(a)b(a)

write-PST.3SG.M

‘he w

rote’

Reduplication (Indonesian)

anak~anak child-PL ‘children’

Summ

ary of perplexity: use bi-LSTM

s over character trigrams

Languageword

characterchar-trigram

sBPE

Morfessor

%im

pbi-LSTM

CNNadd

bi-LSTMadd

bi-LSTMadd

bi-LSTM

Czech41.46

34.2536.6

42.7333.59

49.9633.74

47.7436.87

18.98English

46.443.53

44.6745.41

42.9747.51

43.349.72

49.727.39

Russian34.93

28.4429.47

35.1527.72

40.128.52

39.631.31

20.64Finnish

24.2120.05

20.2924.89

18.6226.77

19.0827.79

22.4523.09

Japanese98.14

98.1491.63

101.99101.09

126.5396.8

111.9799.23

6.63Turkish

66.9754.46

55.0750.07

54.2359.49

57.3262.2

62.725.24

Arabic48.2

42.0243.17

50.8539.87

50.8542.79

52.8845.46

17.28Hebrew

38.2331.63

33.1939.67

30.444.15

32.9144.94

34.2820.48

Indonesian46.07

45.4746.6

58.5145.96

59.1743.37

59.3344.86

5.86M

alay54.67

53.0150.56

68.5150.74

68.9951.21

68.252.5

7.52

Still lots of work to do on unsupervised m

orphology…

Do character-level models have the

predictive utility of models with access to

actual morphology?

Perplexity

0 5 10 15 20 25 30 35

CzechRussian

26.430.1

27.7

33.6

28.4

34.3

characterchar-trigramm

orph. analysis

(^, r, e, a, d, s, $)(read, VB, 3rd, SG

, Present)no m

orphologyactual m

orphology

NO

Perplexity

0 8 16 24 32 40

wordchar-trigram

char-CNN

28.828.8

28.8

39.135.8

35.637

34.835.2

35.232.3

39.7

1M5M

10Mm

orph. analysis (~1M token)

Can we close that gap by training character-level m

odels on far more data? NO

•M

easure Targeted perplexity: perplexity on specific subset of words in the test data

•Analyze perplexities when the inflected w

ords of interest are in the m

ost recent history: nouns and verbs

Green tea or w

hite tea ?

The sushi is great , and they have a great selection .

How do we know that the morphological

annotations that make the difference?

Targeted perplexity of Czech nouns is lower when we use m

orphology

Perplexity

0 16 32 48 64 80

allfrequent

rare

42.640.1

40.9

62.8

5053.4

56.153.4

50.3

59

47.951

73

56.861.2

wordcharacters

char-trigrams

BPEm

orph. analysis

Perplexity

0 20 40 60 80

100

allfrequent

rare

61.858.6

59.5

78.372.5

74.270.6

63.765.8

77.168.1

70.8

99.4

74.381.4word

characterschar-trigram

sBPE

morph. analysis

Targeted perplexity of Czech verbs is lower when we use m

orphologyCharacter m

odels are good at reduplication (no oracle, though)

Languagetype-level (%

)token-level (%

)Indonesian

1.1%2.6

Malay

1.3%2.9

0 32 64 96

128

160

allfrequent

rare

156.8

108.9117.2

137.4

91.499.2

157

91.7101.7

wordcharacters

BPE

Percentage of full-reduplication in the training data

Targeted Perplexity

Different representations m

ake different neighborsM

odelFrequent

RareUnknown

man

includingunconditional

hydroplaneuploading

foodism

wordperson

likenazi

molybdenum

-

-anyone

featuringfairly

your-

- children

includejoints

imperial

- -

BPE bi-lstm

iicalled

unintentional em

phasizeupbeat

vigilantismhill

involveungenerous

heartbeatuprising

pyrethrum

textlike

unanimous

hybridizedhandling

pausaniaschar

trigrams

bi-lstm

mak

include unconstitutional

selenocysteinedrifted

tuaregsvill

includesconstitutional

guerrillasaffected

quftcow

undermining

unimolecular

scrofulaconflicted

subjectivism

char bi-lstm

mayr

inclusionrelates

hydrolyzedm

usagteform

ulasm

anyinsularity

unmyelinated

hydraulicsm

utualismform

allym

ayinclude

uncoordinatedhysterotom

ym

utualistfecal

char CNN

mtn

include unconventional

hydroxyprolineunloading

fordhamm

annincludes

unintentionalhydrate

loadingdadaism

nunexcluding

unconstitutionalhydrangea

upgradingpopism

Different representations m

ake different neighborsM

odelFrequent

RareO

OV

man

includingunconditional

hydroplaneuploading

foodism

wordperson

likenazi

molybdenum

-

-anyone

featuringfairly

your-

- children

includejoints

imperial

- -

BPE bi-lstm

iicalled

unintentional em

phasizeupbeat

vigilantismhill

involveungenerous

heartbeatuprising

pyrethrum

textlike

unanimous

hybridizedhandling

pausaniaschar

trigrams

bi-lstm

mak

include unconstitutional

selenocysteinedrifted

tuaregsvill

includesconstitutional

guerrillasaffected

quftcow

undermining

unimolecular

scrofulaconflicted

subjectivism

char bi-lstm

mayr

inclusionrelates

hydrolyzedm

usagteform

ulasm

anyinsularity

unmyelinated

hydraulicsm

utualismform

allym

ayinclude

uncoordinatedhysterotom

ym

utualistfecal

char CNN

mtn

include unconventional

hydroxyprolineunloading

fordhamm

annincludes

unintentionalhydrate

loadingdadaism

nunexcluding

unconstitutionalhydrangea

upgradingpopism

Good at frequent

wordsM

aybe they learn “word classes”?

Character NLMs learn word

boundaries.

Source: Yova Kementchedjhieva, M

orpho-syntactic awareness in a character-level language m

odel 2017 Inform

atics M.Sc. thesis

…and m

emorize PO

S tags



odel 2017 Inform

atics M.Sc. thesis

…and m

emorize PO

S tags



odel 2017 Inform

atics M.Sc. thesis

What do NLM

s learn about m

orphology?•

Character-level NLMs are great! Across typologies, but

especially for agglutinative morphology.

•However, they do not m

atch predictive accuracy of m

odel with explicit knowledge of morphology (or PO

S).

•Q

ualitative analyses suggests that they learn orthographic sim

ilarity of affixes, and forget meaning

of root morphem

es.

•M

ore generally, they appear to mem

orize frequent subpatterns.

What do we know about what

NNs know about language?•

Still very little.

•Evidence suggests: nothing surprising. Lots of m

emorization, local generalization.

•NNs are great for sim

plicity of specification and end-to-end learning.

•But these things are not m

agic! We still don’t have enough

data, and these models could be better if they knew about

morphology.

•But how do we do that?

e not a ﬁnite set! esentations compositional character · 2018-02-01 · as models with knowledge...

Documents