inducing the morphological lexicon of a natural language from unannotated text
DESCRIPTION
nyky + ratkaisu + i + sta + mme. kahvi + n + juo + ja + lle + kin. tietä + isi + mme + kö + hän. open + mind + ed + ness. un + believ + able. Inducing the Morphological Lexicon of a Natural Language from Unannotated Text. { Mathias . Creutz , Krista . Lagus }@hut.fi - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Inducing the Morphological Lexicon of a Natural Language from Unannotated Text](https://reader036.vdocuments.us/reader036/viewer/2022062301/5681599f550346895dc6ecfa/html5/thumbnails/1.jpg)
HELSINKI UNIVERSITY OF TECHNOLOGY
NEURAL NETWORKS RESEARCH CENTRE
Inducing the Morphological Lexicon of a Natural Language from Unannotated
Text
{ Mathias.Creutz, Krista.Lagus }@hut.fi
International and Interdisciplinary Conference on Adaptive Knowledge Representation and
Reasoning (AKRR’05)Espoo, 17 June 2005
kahvi + n + juo + ja + lle + kin
nyky + ratkaisu + i + sta + mme
tietä + isi + mme + kö + hän
open + mind + ed + ness un + believ + able
![Page 2: Inducing the Morphological Lexicon of a Natural Language from Unannotated Text](https://reader036.vdocuments.us/reader036/viewer/2022062301/5681599f550346895dc6ecfa/html5/thumbnails/2.jpg)
17 June 2005Mathias Creutz 2
HELSINKI UNIVERSITY OF TECHNOLOGY
NEURAL NETWORKS RESEARCH CENTRE
Challenge for NLP: too many words• E.g., Finnish words often consist of lengthy
sequences of morphemes — stems, suffixes and prefixes:– kahvi + n + juo + ja + lle + kin
(coffee + of + drink + -er + for + also)
– nyky + ratkaisu + i + sta + mme(current + solution + -s + from + our)
– tietä + isi + mme + kö + hän(know + would + we + INTERR + indeed)
Huge number of different possible word forms Important to know the inner structure of words The number of morphemes per word varies much
![Page 3: Inducing the Morphological Lexicon of a Natural Language from Unannotated Text](https://reader036.vdocuments.us/reader036/viewer/2022062301/5681599f550346895dc6ecfa/html5/thumbnails/3.jpg)
17 June 2005Mathias Creutz 3
HELSINKI UNIVERSITY OF TECHNOLOGY
NEURAL NETWORKS RESEARCH CENTRE
Goal
• Learn representations of– the smallest individually meaningful units of
language (morphemes)– and their interaction– in an unsupervised and data-driven manner
from raw text– making as general and language-independent
assumptions as possible.
Morfessor
![Page 4: Inducing the Morphological Lexicon of a Natural Language from Unannotated Text](https://reader036.vdocuments.us/reader036/viewer/2022062301/5681599f550346895dc6ecfa/html5/thumbnails/4.jpg)
17 June 2005Mathias Creutz 4
HELSINKI UNIVERSITY OF TECHNOLOGY
NEURAL NETWORKS RESEARCH CENTRE
State of the art• Rule-based systems
– accurate, language-dependent, adaptivity issues
• Unsupervised word segmentation– sentences can be of different length– context-insensitive poor modeling of syntax:
• undersegmentation of frequent strings (“forthepurposeof”)
• oversegmentation of rare strings (“in + s + an + e”)
• no syntactic / morphotactic constraints (“s + can”)
MorfessorBaseline
![Page 5: Inducing the Morphological Lexicon of a Natural Language from Unannotated Text](https://reader036.vdocuments.us/reader036/viewer/2022062301/5681599f550346895dc6ecfa/html5/thumbnails/5.jpg)
17 June 2005Mathias Creutz 5
HELSINKI UNIVERSITY OF TECHNOLOGY
NEURAL NETWORKS RESEARCH CENTRE
State of the art (cont’d)• Morphology learning
– Beyond segmentation: allomorphy (“foot – feet, goose – geese”)
– Detection of semantic similarity (e.g., Yarowsky &
Wicentowski) (“sing – sings – singe – singed”)
– Learning of paradigms (e.g., John Goldsmith’s Linguistica)
believhopliv
movus
eedesing
Very restricted syntax / morphotactics in terms of number of morphemes per word form!
![Page 6: Inducing the Morphological Lexicon of a Natural Language from Unannotated Text](https://reader036.vdocuments.us/reader036/viewer/2022062301/5681599f550346895dc6ecfa/html5/thumbnails/6.jpg)
17 June 2005Mathias Creutz 6
HELSINKI UNIVERSITY OF TECHNOLOGY
NEURAL NETWORKS RESEARCH CENTRE
Morfessor with morpheme categories• Lexicon / Grammar dualism
– Word structure captured by a regular expression: word = ( prefix* stem suffix* )+
– Morph sequences (words) are generated by a Hidden Markov model:
P(STM | PRE) P(SUF | SUF)
ificover ationsimpl# s #
P(’s’ | SUF)P(’over’ | PRE)
Transition probs
Emission probs
![Page 7: Inducing the Morphological Lexicon of a Natural Language from Unannotated Text](https://reader036.vdocuments.us/reader036/viewer/2022062301/5681599f550346895dc6ecfa/html5/thumbnails/7.jpg)
17 June 2005Mathias Creutz 7
HELSINKI UNIVERSITY OF TECHNOLOGY
NEURAL NETWORKS RESEARCH CENTRE
Lexicon“Meaning” “Form”
14029
136 1 4 over
41 4 1 5 simpl
17259
1 4618 1 s
Freq
uency
Length
String
...
Right p
erplex
ity
Left
perplex
ity
Morp
hs
![Page 8: Inducing the Morphological Lexicon of a Natural Language from Unannotated Text](https://reader036.vdocuments.us/reader036/viewer/2022062301/5681599f550346895dc6ecfa/html5/thumbnails/8.jpg)
17 June 2005Mathias Creutz 8
HELSINKI UNIVERSITY OF TECHNOLOGY
NEURAL NETWORKS RESEARCH CENTRE
How meaning affects morphotactic role
0
0,2
0,4
0,6
0,8
1
1,2
10 30 50 70 90
Left perplexity
Suffix-likeness0
0,2
0,4
0,6
0,8
1
1,2
10 30 50 70 90
Right perplexity
Prefix-likeness0
0,2
0,4
0,6
0,8
1
1,2
1 2 3 4 5 6 7 8 9 1
Morph length
Stem-likeness
• Prior probability distributions for category membership of a morph, e.g., P(PRE | ’over’)
• Assume asymmetries between the categories:
![Page 9: Inducing the Morphological Lexicon of a Natural Language from Unannotated Text](https://reader036.vdocuments.us/reader036/viewer/2022062301/5681599f550346895dc6ecfa/html5/thumbnails/9.jpg)
17 June 2005Mathias Creutz 9
HELSINKI UNIVERSITY OF TECHNOLOGY
NEURAL NETWORKS RESEARCH CENTRE
How meaning affects role (cont’d) • There is an additional non-morpheme
category for cases where none of the proper classes is likely:
€
P(NON |'over') =
1− Prefixlike('over')[ ] ⋅ 1− Stemlike('over')[ ]
⋅1− Suffixlike('over')[ ]
€
P(PRE |'over') =Prefixlike('over')q ⋅ 1− P(NON |'over')[ ]
Prefixlike('over')q + Stemlike('over')q + Suffixlike('over')q
• Distribute remaining probability mass proportionally, e.g.,
![Page 10: Inducing the Morphological Lexicon of a Natural Language from Unannotated Text](https://reader036.vdocuments.us/reader036/viewer/2022062301/5681599f550346895dc6ecfa/html5/thumbnails/10.jpg)
17 June 2005Mathias Creutz 10
HELSINKI UNIVERSITY OF TECHNOLOGY
NEURAL NETWORKS RESEARCH CENTRE
Maximum a posteriori optimization
€
argmaxLexicon
P(Lexicon | Corpus) =
argmaxLexicon
P(Corpus | Lexicon) ⋅P(Lexicon)
Morfessor Categories-MAP:Older maximum-
likelihood version:Categories-ML
(lexicon controlledheuristically)
14029
136 1 4 over
41 4 1 5 simpl
17259
1 4618 1 s
...
P(STM | PRE) P(SUF | SUF)
ificover ationsimpl# s #
P(’s’ | SUF)P(’over’ | PRE)
Balance accuracy of representation of data against size of lexicon
![Page 11: Inducing the Morphological Lexicon of a Natural Language from Unannotated Text](https://reader036.vdocuments.us/reader036/viewer/2022062301/5681599f550346895dc6ecfa/html5/thumbnails/11.jpg)
17 June 2005Mathias Creutz 11
HELSINKI UNIVERSITY OF TECHNOLOGY
NEURAL NETWORKS RESEARCH CENTRE
Over- and undersegmentation still a problem?
€
P('morgana') = P(Freq =1) ⋅P(RightPpl =1) ⋅P(LeftPpl =1) ⋅P(Length = 7) ⋅
P('m') ⋅P('o') ⋅P('r') ⋅P('g') ⋅P('a') ⋅P('n') ⋅P('a')
• Probability of adding an entry to the lexicon:
Rare strings are split into smaller parts (e.g., morgan + a)
hands# #hand# #s
• Probability of sequences in the corpus:
vs.
Frequent strings are left unsplit and their inner structure is “lost” (e.g., hands)
![Page 12: Inducing the Morphological Lexicon of a Natural Language from Unannotated Text](https://reader036.vdocuments.us/reader036/viewer/2022062301/5681599f550346895dc6ecfa/html5/thumbnails/12.jpg)
17 June 2005Mathias Creutz 12
HELSINKI UNIVERSITY OF TECHNOLOGY
NEURAL NETWORKS RESEARCH CENTRE
Solution: Hierarchical structures in lexicon
oppositio kansanedustaja+
op positio kansan edustaja
kansa edusta jan
Non-morpheme Stem
Suffix• Make morphs consist of submorphs. • Expand the tree when performing morpheme segmentation.• Do not expand morphs consisting of non-morphemes.
![Page 13: Inducing the Morphological Lexicon of a Natural Language from Unannotated Text](https://reader036.vdocuments.us/reader036/viewer/2022062301/5681599f550346895dc6ecfa/html5/thumbnails/13.jpg)
17 June 2005Mathias Creutz 13
HELSINKI UNIVERSITY OF TECHNOLOGY
NEURAL NETWORKS RESEARCH CENTRE
Evaluation using Hutmegs(Helsinki University of Technology Morphological Evaluation Gold Standard)
• Evaluate the segmentation of Morfessor against a linguistic morpheme segmentation = Hutmegs
• Covers– 1.4 million Finnish word forms– 120 000 English word forms
• Publicly available and described in the technical report: M. Creutz and K. Lindén. 2004. Morpheme
Segmentation Gold Standards for Finnish and English. Publications in Computer and Information Science, Report A77, Helsinki University of Technology.
![Page 14: Inducing the Morphological Lexicon of a Natural Language from Unannotated Text](https://reader036.vdocuments.us/reader036/viewer/2022062301/5681599f550346895dc6ecfa/html5/thumbnails/14.jpg)
17 June 2005Mathias Creutz 14
HELSINKI UNIVERSITY OF TECHNOLOGY
NEURAL NETWORKS RESEARCH CENTRE
50
60
70
80
10 50 250 12000
Corpus size [1000 words]
F-measure [%]30
40
50
60
70
80
10 50 250 16000
Corpus size [1000 words]
F-measure [%]
Evaluation against the Hutmegs Gold Standard
Finnish English
Ctxt-insens. (Baseline)Paradigms
(Linguistica)
Heuristic (Categories-ML)Categories-MAP
![Page 15: Inducing the Morphological Lexicon of a Natural Language from Unannotated Text](https://reader036.vdocuments.us/reader036/viewer/2022062301/5681599f550346895dc6ecfa/html5/thumbnails/15.jpg)
17 June 2005Mathias Creutz 15
HELSINKI UNIVERSITY OF TECHNOLOGY
NEURAL NETWORKS RESEARCH CENTRE
Example segmentationsFinnish English
[ aarre kammio ] issa [ accomplish es ]
[ aarre kammio ] on [ accomplish ment ]
bahama laiset [ beautiful ly ]
bahama [ saari en ] [ insur ed ]
[ epä [ [ tasa paino ] inen ] ]
[ insure s ]
maclare n [ insur ing ]
[ nais [ autoili ja ] ] a [ [ [ photo graph ] er ] s ]
[ sano ttiin ] ko [ present ly ] found
töhri ( mis istä ) [ re siding ]
[ [ voi mme ] ko ] [ [ un [ expect ed ] ] ly ]
![Page 16: Inducing the Morphological Lexicon of a Natural Language from Unannotated Text](https://reader036.vdocuments.us/reader036/viewer/2022062301/5681599f550346895dc6ecfa/html5/thumbnails/16.jpg)
17 June 2005Mathias Creutz 16
HELSINKI UNIVERSITY OF TECHNOLOGY
NEURAL NETWORKS RESEARCH CENTRE
Discussion
• Possibility to extend the model– rudimentary features used for “meaning”– more fine-grained categories– beyond concatenative phenomena (e.g., goose –
geese)– allomorphy
(e.g., beauty, beauty + ’s, beauti + es, beauti + ful)
• Already now useful in applications– automatic speech recognition (Finnish, Turkish)
![Page 17: Inducing the Morphological Lexicon of a Natural Language from Unannotated Text](https://reader036.vdocuments.us/reader036/viewer/2022062301/5681599f550346895dc6ecfa/html5/thumbnails/17.jpg)
17 June 2005Mathias Creutz 17
HELSINKI UNIVERSITY OF TECHNOLOGY
NEURAL NETWORKS RESEARCH CENTRE
Morpho project pagehttp://www.cis.hut.fi/projects/morpho/
![Page 18: Inducing the Morphological Lexicon of a Natural Language from Unannotated Text](https://reader036.vdocuments.us/reader036/viewer/2022062301/5681599f550346895dc6ecfa/html5/thumbnails/18.jpg)
17 June 2005Mathias Creutz 18
HELSINKI UNIVERSITY OF TECHNOLOGY
NEURAL NETWORKS RESEARCH CENTRE
Demo 6
http://www.cis.hut.fi/projects/morpho/
![Page 19: Inducing the Morphological Lexicon of a Natural Language from Unannotated Text](https://reader036.vdocuments.us/reader036/viewer/2022062301/5681599f550346895dc6ecfa/html5/thumbnails/19.jpg)
17 June 2005Mathias Creutz 19
HELSINKI UNIVERSITY OF TECHNOLOGY
NEURAL NETWORKS RESEARCH CENTRE
Demo 7