1 csc 9010- nlp - 3: morphology, finite state transducers csc 9010 natural language processing...

Post on 19-Dec-2015

217 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1CSC 9010- NLP - 3: Morphology, Finite State Transducers

CSC 9010Natural Language Processing

Lecture 3: Morphology, Finite State Transducers

Paula MatuszekMary-Angela Papalaskari

Presentation slides adapted from: Marti Hearst (some following Dorr and Habash) http://www.sims.berkeley.edu/courses/is290-2/f04/ andJim Martin: http://www.cs.colorado.edu/~martin/csci5832.html

2CSC 9010- NLP - 3: Morphology, Finite State Transducers

Today

Elementary MorphologyComputational morphologyFinite State TransducersLexicon-only schemesRule-only schemesLab: Introduction to NLTK

3CSC 9010- NLP - 3: Morphology, Finite State Transducers

MorphologyMorphology:

The study of the way words are built up from smaller meaning units.Morphemes:

The smallest meaningful unit in the grammar of a language.Contrasts:

Derivational vs. InflectionalRegular vs. IrregularConcatinative vs. Templatic (root-and-pattern)

A useful resource:Glossary of linguistic terms by Eugene Looshttp://www.sil.org/linguistics/GlossaryOfLinguisticTerms/contents.htm

4CSC 9010- NLP - 3: Morphology, Finite State Transducers

Examples (English)

“unladylike”3 morphemes, 4 syllables

un- ‘not’lady ‘(well behaved) female adult human’-like ‘having the characteristics of’

Can’t break any of these down further without distorting the meaning of the units

“technique”1 morpheme, 2 syllables

“dogs”2 morphemes, 1 syllable

-s, a plural marker on nouns

5CSC 9010- NLP - 3: Morphology, Finite State Transducers

Morpheme DefinitionsRoot

The portion of the word that:– is common to a set of derived or inflected forms, if any, when all affixes

are removed – is not further analyzable into meaningful elements– carries the principal portion of meaning of the words

StemThe root or roots of a word, together with any derivational affixes, to which inflectional affixes are added.

AffixA bound morpheme that is joined before, after, or within a root or stem.

Clitica morpheme that functions syntactically like a word, but does not appear as an independent phonological word

– Spanish: un beso, las aguas– English: Hal’s (genetive marker) – Proto-European: Kwe -que (Latin), te (Greek), and –ca (Sanskrit)

6CSC 9010- NLP - 3: Morphology, Finite State Transducers

Inflectional vs. Derivational

Word ClassesParts of speech: noun, verb, adjectives, etc.Word class dictates how a word combines with morphemes to form new words

Inflection:Variation in the form of a word, typically by means of an affix, that expresses a grammatical contrast.

– Doesn’t change the word class– Usually produces a predictable, non-idiosyncratic change of

meaning.

Derivation:The formation of a new word or inflectable stem from another word or stem.

7CSC 9010- NLP - 3: Morphology, Finite State Transducers

Inflectional Morphology

Adds: tense, number, person, mood, aspect

Word class doesn’t changeWord serves new grammatical roleExamples

come is inflected for person and number:The pizza guy comes at noon.

las and rojas are inflected for agreement with manzanas in grammatical gender by -a and in number by –s

las manzanas rojas (‘the red apples’)

8CSC 9010- NLP - 3: Morphology, Finite State Transducers

Derivational Morphology

Nominalization (formation of nouns from other parts of speech, primarily verbs in English):

computerizationappointeekillerfuzziness

Formation of adjectives (primarily from nouns) computationalcluelessEmbraceable

Diffulcult cases:building from which sense of “build”?

9CSC 9010- NLP - 3: Morphology, Finite State Transducers

Concatinative MorphologyMorpheme+Morpheme+Morpheme+…Stems: also called lemma, base form, root, lexeme

hope+ing hoping hop hopping

AffixesPrefixes: AntidisestablishmentarianismSuffixes: AntidisestablishmentarianismInfixes: hingi (borrow) – humingi (borrower) in TagalogCircumfixes: sagen (say) – gesagt (said) in German

Agglutinative Languagesuygarlaştıramadıklarımızdanmışsınızcasına (Turkish)uygar+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casınaBehaving as if you are among those whom we could not cause to become civilized

10CSC 9010- NLP - 3: Morphology, Finite State Transducers

Templatic MorphologyRoots and Patterns

Example: Hebrew verbsRoot:

– Consists of 3 consonants CCC– Carries basic meaning

Template:– Gives the ordering of consonants and vowels– Specifies semantic information about the verb

Active, passive, middle voiceExample:

– lmd (to learn or study) CaCaC -> lamad (he studied) CiCeC -> limed (he taught) CuCaC -> lumad (he was taught)

11CSC 9010- NLP - 3: Morphology, Finite State Transducers

Nouns and Verbs (in English)

Nouns have simple inflectional morphologycatcat+s, cat+’s

Verbs have more complex morphology

12CSC 9010- NLP - 3: Morphology, Finite State Transducers

Nouns and Verbs (in English)

NounsHave simple inflectional morphologyCat/CatsMouse/Mice, Ox, Oxen, Goose, Geese

VerbsMore complex morphologyWalk/WalkedGo/Went, Fly/Flew

13CSC 9010- NLP - 3: Morphology, Finite State Transducers

Regular (English) Verbs

Morphological Form Classes Regularly Inflected Verbs

Stem walk merge try map

-s form walks merges tries maps

-ing form walking merging trying mapping

Past form or –ed participle walked merged tried mapped

14CSC 9010- NLP - 3: Morphology, Finite State Transducers

Irregular (English) Verbs

Morphological Form Classes Irregularly Inflected Verbs

Stem eat catch cut

-s form eats catches cuts

-ing form eating catching cutting

Past form ate caught cut

-ed participle eaten caught cut

15CSC 9010- NLP - 3: Morphology, Finite State Transducers

“To love” in Spanish

16CSC 9010- NLP - 3: Morphology, Finite State Transducers

Syntax and Morphology

Phrase-level agreementSubject-Verb

– John studies hard (STUDY+3SG)

Noun-Adjective– Las vacas hermosas

Sub-word phrasal structuresנויספרבש

נו+ים+ספר+ב+ש

That+in+book+PL+Poss:1PLWhich are in our books

17CSC 9010- NLP - 3: Morphology, Finite State Transducers

Phonology and Morphology

Script Limitations

Spoken English has 14 vowels– heed hid hayed head had hoed hood who’d hide

how’d taught Tut toy enough

English Alphabet has 5– Use vowel combinatios: far fair fare– Consonantal doubling (hopping vs. hoping)

18CSC 9010- NLP - 3: Morphology, Finite State Transducers

Computational MorphologyApproaches

Lexicon onlyRules onlyLexicon and Rules

– Finite-state Automata– Finite-state Transducers

SystemsWordNet’s morphyPCKimmo

– Named after Kimmo Koskenniemi, much work done by Lauri Karttunen, Ron Kaplan, and Martin Kay

– Accurate but complex– http://www.sil.org/pckimmo/

Two-level morphology– Commercial version available from InXight Corp.

BackgroundChapter 3 of Jurafsky and MartinA short history of Two-Level Morphology

– http://www.ling.helsinki.fi/~koskenni/esslli-2001-karttunen/

19CSC 9010- NLP - 3: Morphology, Finite State Transducers

Computational MorphologyWORD STEM (+FEATURES)*

cats cat +N +PLcat cat +N +SGcities city +N +PLgeese goose +N +PLducks (duck +N +PL) or

(duck +V +3SG)merging merge +V +PRES-PARTcaught (catch +V +PAST-PART) or

(catch +V +PAST)

20CSC 9010- NLP - 3: Morphology, Finite State Transducers

FSAs and the Lexicon

First we’ll capture the morphotacticsThe rules governing the ordering of affixes in a language.

Then we’ll add in the actual words

21CSC 9010- NLP - 3: Morphology, Finite State Transducers

Simple Rules

22CSC 9010- NLP - 3: Morphology, Finite State Transducers

Adding the Words

23CSC 9010- NLP - 3: Morphology, Finite State Transducers

Derivational Rules

24CSC 9010- NLP - 3: Morphology, Finite State Transducers

Parsing/Generation vs. Recognition

Recognition is usually not quite what we need. Usually if we find some string in the language we need to find the structure in it (parsing)Or we have some structure and we want to produce a surface form (production/generation)

ExampleFrom “cats” to “cat +N +PL” and back

Morphological analysis

25CSC 9010- NLP - 3: Morphology, Finite State Transducers

Finite State Transducers

The simple storyAdd another tapeAdd extra symbols to the transitions

On one tape we read “cats”, on the other we write “cat +N +PL”, or the other way around.

26CSC 9010- NLP - 3: Morphology, Finite State Transducers

FSTs

27CSC 9010- NLP - 3: Morphology, Finite State Transducers

Transitions

c:c means read a c on one tape and write a c on the other+N:ε means read a +N symbol on one tape and write nothing on the other+PL:s means read +PL and write an s

c:c a:a t:t +N:ε +PL:s

28CSC 9010- NLP - 3: Morphology, Finite State Transducers

Ambiguity

Recall that in non-deterministic recognition multiple paths through a machine may lead to an accept state.

Didn’t matter which path was actually traversed

In FSTs the path to an accept state does matter since different paths represent different parses and different outputs will result

29CSC 9010- NLP - 3: Morphology, Finite State Transducers

Ambiguity

What’s the right parse forUnionizableUnion-ize-ableUn-ion-ize-able

Each represents a valid path through the derivational morphology machine.

30CSC 9010- NLP - 3: Morphology, Finite State Transducers

Ambiguity

There are a number of ways to deal with this problem

Simply take the first output foundFind all the possible outputs (all paths) and return them all (without choosing)Bias the search so that only one or a few likely paths are explored

31CSC 9010- NLP - 3: Morphology, Finite State Transducers

The Gory Details

Of course, its not as easy as “cat +N +PL” <-> “cats”

As we saw earlier there are geese, mice and oxenBut there are also a whole host of spelling/pronunciation changes that go along with inflectional changes

Cats vs Dogs

Multi-tape machines

32CSC 9010- NLP - 3: Morphology, Finite State Transducers

Multi-Level Tape Machines

We use one machine to transduce between the lexical and the intermediate level, and another to handle the spelling changes to the surface tape

33CSC 9010- NLP - 3: Morphology, Finite State Transducers

Lexical to Intermediate Level

34CSC 9010- NLP - 3: Morphology, Finite State Transducers

Intermediate to Surface

The add an “e” rule as in fox^s# <-> foxes

35CSC 9010- NLP - 3: Morphology, Finite State Transducers

Foxes

36CSC 9010- NLP - 3: Morphology, Finite State Transducers

Foxes

37CSC 9010- NLP - 3: Morphology, Finite State Transducers

FST Review

FSTs allow us to take an input and deliver a structure based on itOr… take a structure and create a surface formOr take a structure and create another structure

In many applications its convenient to decompose the problem into a set of cascaded transducers where

The output of one feeds into the input of the next.We’ll see this scheme again for deeper semantic processing.

38CSC 9010- NLP - 3: Morphology, Finite State Transducers

Overall Plan

39CSC 9010- NLP - 3: Morphology, Finite State Transducers

Lexicon-only Morphology

acclaim acclaim $N$

acclaim acclaim $V+0$

acclaimed acclaim $V+ed$

acclaimed acclaim $V+en$

acclaiming acclaim $V+ing$

acclaims acclaim $N+s$

acclaims acclaim $V+s$

acclamation acclamation $N$

acclamations acclamation $N+s$

acclimate acclimate $V+0$

acclimated acclimate $V+ed$

acclimated acclimate $V+en$

acclimates acclimate $V+s$

acclimating acclimate $V+ing$

• The lexicon lists all surface level and lexical level pairs

• No rules …

• Analysis/Generation is easy

• Very large for English

• What about

•Arabic or

•Turkish or

• Chinese?

40CSC 9010- NLP - 3: Morphology, Finite State Transducers

Stemming vs Morphology

Sometimes you just need to know the stem of a word and you don’t care about the structure.In fact you may not even care if you get the right stem, as long as you get a consistent string.This is stemming… it most often shows up in IR applications

41CSC 9010- NLP - 3: Morphology, Finite State Transducers

Stemming in IR

Run a stemmer on the documents to be indexedRun a stemmer on users queriesMatch

This is basically a form of hashing

Example: Computerizationization -> -ize computerizeize -> ε computer

42CSC 9010- NLP - 3: Morphology, Finite State Transducers

Porter StemmerStep 4: Derivational Morphology I: Multiple Suffixes (m>0) ATIONAL -> ATE relational -> relate (m>0) TIONAL -> TION conditional -> condition rational -> rational (m>0) ENCI -> ENCE valenci -> valence (m>0) ANCI -> ANCE hesitanci -> hesitance (m>0) IZER -> IZE digitizer -> digitize (m>0) ABLI -> ABLE conformabli -> conformable (m>0) ALLI -> AL radicalli -> radical (m>0) ENTLI -> ENT differentli -> different (m>0) ELI -> E vileli - > vile (m>0) OUSLI -> OUS analogousli -> analogous (m>0) IZATION -> IZE vietnamization -> vietnamize (m>0) ATION -> ATE predication -> predicate (m>0) ATOR -> ATE operator -> operate (m>0) ALISM -> AL feudalism -> feudal (m>0) IVENESS -> IVE decisiveness -> decisive (m>0) FULNESS -> FUL hopefulness -> hopeful (m>0) OUSNESS -> OUS callousness -> callous (m>0) ALITI -> AL formaliti -> formal (m>0) IVITI -> IVE sensitiviti -> sensitive (m>0) BILITI -> BLE sensibiliti -> sensible

43CSC 9010- NLP - 3: Morphology, Finite State Transducers

Porter

No lexicon neededBasically a set of staged sets of rewrite rules that strip suffixesHandles both inflectional and derivational suffixes

Doesn’t guarantee that the resulting stem is really a stem (see first bullet)

Lack of guarantee doesn’t matter for IR

44CSC 9010- NLP - 3: Morphology, Finite State Transducers

Porter StemmerErrors of Omission

European Europeanalysis analyzesmatrices matrixnoise noisyexplain explanation

Errors of Commissionorganization organdoing doegeneralization genericnumerical numerousuniversity universe

45CSC 9010- NLP - 3: Morphology, Finite State Transducers

Soundex

You work as the Villanova telephone operator. Someone calls looking for:

Dr Papalarsky or Dr Matuzka

???????? What do you type as your query string?

46CSC 9010- NLP - 3: Morphology, Finite State Transducers

Soundex

1. Keep the first letter2. Drop non-initial occurrences of vowels, h, w

and y3. Replace the remaining letters with numbers

according to group (e.g.. b, f, p, and v -> 14. Replace strings of identical numbers with a

single number (333 -> 3)5. Drop any numbers beyond a third one

47CSC 9010- NLP - 3: Morphology, Finite State Transducers

Soundex

Effect is to map (hash) all similar sounding transcriptions to the same code.Structure your directory so that it can be accessed by code as well as by correct spellingUsed for census records, phone directories, author searches in libraries etc.

top related