dictionaries and grammar do we include all forms of a particular word, or do we include only the...

62
Dictionaries and Grammar Do we include all forms of a particular word, or do we include only the base word and derive its forms? How are the grammatical rules of a language represented? How do we represent the parts of speech that go with particular grammatical rules? Questions to Address

Upload: rolando-timbs

Post on 14-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Dictionaries and Grammar

• Do we include all forms of a particular word, or do we include only the base word and derive its forms?

• How are the grammatical rules of a language represented?

• How do we represent the parts of speech that go with particular grammatical rules?

Questions to Address

Morphology

• Morphology – The study of the patterns used to form words– E.g. inflection, derivation, and compounds

• Morpheme - Minimal meaning-bearing unit– Could be a stem or an affix

• Stem {“unthinkable” “realization” “distrust”}– The part of a word that contains the root meaning (E.g. cat)

• Affixes {-s, un-, de-, -en, -able, -ize, -hood}– a linguistic element added to a word modify the meaning– E.g.: prefix (unbuckle), suffix (buckled), infix

(absobloodylutely), and circumfix (gesagt in German for said).– Affixes can attach to other affixes (boyishness)

Definitions

Knowing Words• When we know a word, we know its

1. Phonological sound sequences2. Semantic meanings3. Morphological relationships4. Syntactic categories and proper structure of a sentence

• Morphological relationships adjust word meanings– Person Jill waits.– Number Jill carried two buckets.– Case The chair’s leg is broken.– Tense Jill is waiting there now.– Degree Jill ran faster than Jack.– Gender Jill is female– Part of Speech Jill is a proper noun

These are the kind of things we want our computers to figure out

Units of Meaning

• How many morphemes do each of the following sentences have?– “I have two cats”– “She wants to leave soon”– “He walked across the room”– “Her behavior was unbelievable”

• Free Morphemes {eye, think, run, apple}• Bound Morphemes {-able, un-, -s, -tion, -ly}

Affix Examples

• Prefixes from Karuk, a Hokan language of California [pasip] “Shoot!” [nipasip] “I shoot” [/upasip] “She/he shoots”

• Suffixes from Mende spoken in Liberia and Sierra Leone [pElE] “house” [pElEi] “the house” [mEmE] “glass” [mEmEi] “the glass”

• Infixes from Bontoc spoken in the Phillipines [fikas] “strong” [fumikas] “she is becoming strong” [fusul] “enemy” [fumusal] “she is becoming an enemy”

Turkish Morpology

Uygarlastiramadiklarimizdanmissinizcasina

Meaning: `behaving as if you are among those whom we could not civilize’

• Uygar `civilized’ + las `become’ + tir `cause’ • + ama `not able’ + dik `past’ • + lar ‘plural’+ imiz ‘p1pl’ + dan ‘abl’ • + mis ‘past’ + siniz ‘2pl’ + casina ‘as if’

How does the Mind Store Meanings?• Hypotheses

– Full listing: We store all words individually – Minimum redundancy: We store morphemes and how they relate

• Analysis– Determine if people understand new words based on root meanings– Observe whether children have difficulty learning exceptions– Regular form: government/govern, Irregular form: department/depart

• Evidence suggests – The mind represents words and affix meanings separately– Linguists observe that affixes were originally separate words that

speakers slur together over time

General Observations about Lexicons

• Meanings are continually changing

• Roots and Morphemes do not have to occur in a fixed position in relation to other elements.

• How many words do people know?– Shakespeare uses 15,000 words– A typical high school student knows 60,000

(learning 10 words a day from 12 months to 18 years)

• How many English words are there?– Over 300,000 words without Morphemes in 1988

Computational Morphology

• Consider all of the morphemes of the word ‘true’– true, truer, truest, truly, untrue, truth, truthful, truthfully, untruthfully,

untruthfulness– Untruthfulness = un- + true + -th + -ful + -ness

• Productive morphemes– An affix that at a point in time spread rapidly through the language– Consider goose and geese versus cat and cats

• The former was an older way to indicate plurals• The latter is a more recent way that spread throughout

• If we store morpheme rules, not all words, we can – Reduce storage requirements and simplify creating entire dictionaries– More closely mimic how the mind does it– Be able to automatically understand newly encountered word forms

Speech recognition requires a language dictionaryHow many words would it contain?

Morphology Rules• There are rules used to form complex words from their roots

– ‘re-’ only precedes verbs (rerun, release, return)– ‘-s’ indicates plurals– ‘-ed’ indicates past tense

• Affix Rules– Regular: follow productive affix rules– Irregular: don’t follow productive affix rules

• Nouns– Regular: (cat, thrush), (cats, thrushes), (cat’s thrushes’)– Irregular: (mouse, ox), (mice, oxen)

Observation: More frequent words resist changes that result fromproductive affixes and take irregular forms (E.g. am, is, are).

Exceptions: A singer sings, and a writer writes. Why doesn’t a whisker whisk, a spider spid, or a finger fing?

Parsing

• Morphological parsing – Identifies stem and affixes and how they relate– Example:

• fish fish + Noun + Singular or goose + Verb• fish fish +Noun +Plural• fish fish +Verb +Singular

– Bracketing: indecipherable [in [[de [cipher]] able]]• Why do we parse?

– spell-checking: Is muncheble a real word?– Identify a word’s part-of-speech (pos)– Sentence parsing and machine translation– Identify word stems for data mining search operations– Speech recognition and text to speech

Identify components and underlying structure

Parsing Applications

• Lexicon– Create a word list– Include both stems and affixes (with the part of speech)

• Morphotactics – Models how morphemes can be affixed to a stem. – E.g., plural morpheme follows noun in English

• Orthographic rules– Defines spelling modifications during affixation– E.g. true tru in context of true truthfully

Grammatical Morphemes

• New forms are rarely added to closed morpheme classes

• Examples– prepositions at, for, by– articles a, the– conjunctions and, but, or

Morphological Parsing (stemming)

• Goal: Break the surface input into morphemes

• foxes – Fox is a noun stem– It has -es as a plural suffix

• rewrites – Write is the verb stem– It has re- as a prefix meaning to do again– It has a –s suffix indicating a continuing activity

Inflectional Morphology

• Nouns– plural marker: -s (dog + s = dogs)– possessive marker: -’s (dog + ’s = dog’s)

• Verbs– 3rd person present singular: -s (walk + s = walks)– past tense: -ed (walk + ed = walked)– progressive: -ing (walk + ing = walking)– past participle: -en or -ed (eat + en = eaten)

• Adjectives– comparative: -er (fast + er = faster)– superlative: -est (fast + est = fastest)

• In English– Meaning transformations are predictable– All inflectional affixes are suffices– Inflectional affixes are attached after any derivational (next slide) affixes

• E.g. modern + ize + s = modernizes; not modern + s + ize

Does not change the grammatical category

Concatenative and Non-concatenative • Concatenative morphology combines by concatentation

– prefixes and suffixes• Non-concatentative morphology combines in complex ways

– circumfixes and infixes– templatic morphology

• words change by internal changes to the root• E.g. (Arabic, Hebrew) ktb (write), kuttib (will have been written)

C V C C V C

k t b

u i

kuttib

Templative Example

Verbal Inflective Morphology• Verbal inflection

– Main verbs (sleep, like, fear) are relatively regularStandard morphemes: -s, ing, ed These morphemes are productive: Emails, Emailing, Emailed

– Combination with nouns for syntactical agreement I am, we are, they were

• There are exceptions– Eat (will eat, eats, eating, ate) – Catch (will catch, catches, catching, caught)– Be (will be, is, being, was)– Have (will have, has, having, had)

• General Observations about English– There are approximately 250 Irregular verbs that occur– Other languages have more complex verbal inflection rules

Nominal Inflective Morphology

• Plural forms (s or es)• Possessives (cat’s or cats’)• Regular Nouns

– Singular (cat, bush)– Plural (cats, bushes)– Possessive (cat’s bushes’)

• Irregular Nouns– Singular (mouse, ox)– Plural (mice, oxen)

Derivational Morphology

• Word stem combines with grammatical morpheme– Usually produces word of different class– Complex rules that are less productive with many exceptions– Sometimes meanings of derived terms are hard to predict (E.g. hapless)

• Examples: verbs to nouns– generalize, realize generalization, realization– Murder, spell murderer, speller

• Examples: verbs and nouns to adjectives– embrace, pity embraceable, pitiable– care, wit careless, witless

• Example: adjectives adverbs– happy happily

• More complicated to model than inflection– Less productive: science-less, concern-less, go-able, sleep-able

Derivational Morphology Examples

Level 1• Examples: ize, ization,

ity, ic, al, ity, ion, y, ate, ous, ive, ation

• Observations– Can attach to non-words

(e.g. fratern-al, paternal)– Often changes stem’s

stress and vowel quality

Level 2• Examples: hood, ness,

ly, s, ing, ish, ful, ly, less, y (adj.)

• Observations– Never precede Level 1

suffixes– Never change stress or

vowel quality– Almost always attach to

words that exists

Level 1 + Level 1: histor-ic-al, illumina-at-tion, indetermin-at-y;Level 1 + Level 2: fratern-al-ly, transform-ate-ion-less;Level 2 + Level 2: weight-less-nessBig one: antidisestablishmenterrianism (if I spelled it right)

Adjective Morphology• Standard Forms

– Big, bigger, biggest– Cool, cooler, coolest, cooly– Red, redder, reddest– Clear, clearer, clearest, clearly, unclear, unclearly– Happy, happier, happiest, happily– Unhappy, unhappier, unhappiest, unhappily– Real, unreal, really

• Exceptions: unbig, redly, realest

Identify and Classify Morphemes• In each group

– Two words have a different morphological structure– One word has a different type of suffix– One word has no suffix at all

• Perform the following tasks– 1.Isolate the suffix that two of the words share.– 2.Identify whether it is (i) free or bound; (ii) prefix, infix, suffix;

(iii) inflectional or derivational.– 3.Give its function/meaning.– 4.Identify the word that has no suffix– 5.Identify the word that has a suffix which is different from the

others in each group.

a. b. c. d.rider tresses running tablescolder melodies foundling lenssilver Bess’s handling witchesactor guess fling calculates

Computational Techniques

• Regular Grammars

• Finite State Automata

• Finite State Transducer

• Parsing – Top down and bottom up

Regular Grammars

• Grammar: Rules that define legal characters strings

• A regular grammar accepts regular expressions

• A regular expression must satisfy the following:– The grammar with no strings is regular – The grammar that accepts the empty string is regular– A single character is a regular grammar– If r1 and r2 are regular grammars, then r1 union r2, and r1

concatenated with r2 are regular grammars– If r is a regular grammar, then r* ( where * means zero or more

occurrences) is regular

Notations to Express Regular Expressions

• Conjunction: abc• Disjunction: [a-zA-Z], gupp(y|ies)• Counters: a*, a+, ?, a{5}, a{5,8}, a{5,}• Any character: a.b• Not: [^0-9]• Anchors: /^The dog\.$/

– Note: the backslash before the period is an escape character– Other escape characters include \*, \?, \n, \t, \\, \[, \], etc.

• Operators– \d equivalent to [0-9], \D equivalent to [^0-9]– \w equivalent to [a-zA-z0-9 ], \W equivalent to [^\w]– \s equivalent to [ \r\t\n\f], \S equivalent to [^s]

• Substitute one regular expression for another: s/regExp1/regExp2/

Examples of Regular Expressions

• All strings ending with two zeroes• All strings containing three consecutive zeroes• All strings that every block of five consecutive

symbols have at least two zeroes• All strings that the tenth symbol from the right is a

one • The set of all modular five numbers

Finite State Automata (FSA)

Definition: A FSA consists of1. a set of states (Σ)

2. a starting state (q0)

3. a set of final or accepting states (F Q)

4. a finite set of symbols (Q)

5. a transition function ((q,i) ) that maps QxΣ to Q. It switches from a from-state to a to-state, based on one of the valid symbols

Synonyms: Finite Automata, Finite State Machine

FSA’s recognize grammars that are regular

Recognition

• Traditionally, Turing used a tape reader to depict a FSA• Algorithm

– Begin in the start state– Examine the current input character– Consult the table– Go to a new state and update the tape pointer.– Until you run out of tape.– The machine accepts the string processing stops in a final state

Determine if the machine accepts a particular stringi.e. Is a string in the language?

Graphs and State Transition Tables• What can we can say about this machine?

– It has 5 states– At least b,a, and ! are in its alphabet– q0 is the start state– q4 is an accept state– It has 5 transitions

• Questions– Which strings does it accept? baaaa, aaabaaa, ba– Is this the only FSA that can accept this language?

An FSA only can accept regular strings. Question: Can you think of a string that is not regular?

State Transition Table

Annotated Directed Graph

Recognizer Implementation

index = beginning of tapestate = start stateDO IF transition[index, tape[index]] is empty RETURN false state = transition[index, tape[index]] index = index + 1UNTIL end of tap is reachedIF state is a final state RETURN trueELSE RETURN false

Key Points Regarding FSAs

• This algorithm is a state-space search algorithm– Implementation uses simple table lookups– Success occurs when at the end of a string, we reach a final state

• The results are always deterministic– There is one unique choice at each step– The algorithm recognizes all regular languages

• Perl, Java, etc. use a regular expression algorithm– Create a state transition table from the expression– pass the table to the FSA interpreter

• FSA algorithms– Recognizer: determines if a string is in the language– Generator: Generates all strings in the language

Non-Deterministic FSA• Deterministic: Given a state and symbol, only one transition is possible• Nondeterministic:

– Given a state and a symbol, multiple transitions are possible– Epsilon transitions: those which DO NOT examine or advance the tape

• The Nondeterministic FSA recognizes a string if:– At least one transition sequence ends at a final state– Note: all sequences DO NOT have to end at a final state– Note: String rejection occurs only when NO sequence ends at a final state

ε

Examples

Concatenation

Closure Closure

Union

Using NFSAsInput

State b a ! e

0 1 0 0 0

1 0 2 0 0

2 0 2,3 0 0

3 0 0 4 0

4 0 0 0 0

NFSA Recognition of “baaa!”

Breadth-first Recognition of “baaa!”

Nondeterministic FSA Example

b a a a ! \

q0 q1 q2 q2 q3 q4

Other FSA ExamplesDollars and Cents

Exercise: Create a FSA for the following regular expressions(0|1)*[a-f1-9]abc{5}

Non Deterministic FSA RecognizerRecognizer (index, state)

LOOP IF end of tape THEN IF state is final RETURN true

ELSE RETURN false

IF no possible transitions RETURN false IF there is only one transition state = transition[index, tape[index]] IF not an epsilon transition THEN index++ ELSE FOR each possible transition not considered result = CALL recognizer(nextState,nextIndex) IF result = true RETURN trueEND LOOPRETURN false

FSA’s and Morphology• Apply an FSA to each word in the dictionary to capture the

morphological forms.• Groups of words with common morphology can share FSAs

Building a Lexicon with a FSA

Derivational Rules

Simple Morphology Example

)()(

ly

est

er

rootsadjun

q0 q1 q2 q3

un- adj-root -er –est -ly

e From To Output

0 1 un

0 1 NULL

1 2 adj-root-list

2 3 er;est;lyStop states: q2 and q3

An Extended ExampleFrom To Output

0 1 un

0 3 NULL

1 2 adj-root-list-1

2 5 er;est;ly

3 2 adj-root-list-1

3 4 adj-root-list-2

4 5 er;est

q0q1 q2

q5

un- -er –est -ly

eq4 -er –estq3

adj-root-2

adj-root-1

adj-root-1

)()( 1

ly

est

er

rootsadjun )(2

est

errootsadj

Adj-root1: clear, happy, realAdj-root2: big, red

Representing Derivational Rules

ation ize

noun

adj

verb

adverb

er

nouns

ativeive

able

lyly

ity, ness

ful

verbs

adjectives

adverbs

Finite State Transducer (FST)• Definition: A FST is a 5-tuple consisting of

– Q: set of states {q0,q1,q2,q3,q4}– : an alphabet of complex symbols

• Each complex symbol contains two simple symbols• The first symbol is from an input alphabet i I• The second symbol is from an output alphabet o O• is in I x O, ε is the null character

– q0: a start state– F: a set of final states in Q {q4}– (q,i:o): a transition function mapping Q x to Q

• Concept: Translates and writes to a second tape

a:o

q0 q4q1 q2 q3

b:m a:o a:o !:?

Example: baaaamoooo

Transition Example

• c:c means read a c on one tape and write a c on the other

• +N:ε means read a +N symbol on one tape and write nothing on the other

• +PL:s means read +PL and write an s

c:c a:a t:t +N:ε +PL:s

On-line demos

• Finite state automata demoshttp://www.xrce.xerox.com/competencies/content-analysis/fsCompiler/fsinput.html

• Finite state morphologyhttp://www.xrce.xerox.com/competencies/content-analysis/demos/english

• Some other downloadable FSA tools:http://www.research.att.com/sw/tools/fsm/

Lexicon for L0

Rule based languages

Top Down Parsing

S

NP VP

NP

Nom

NounNounDetVerbPro

flightmorningapreferI

[S [NP [Pro I]] [VP [V prefer] [NP [Det a] [Nom [N morning] [N flight]]]]]

S → NP VP, NP→Pro, Pro→I, VP→V NP, V→prefer, NP→Det Nom, Det→a, Nom→Noun Nom, Noun→morning, Noun→flight

Driven by the grammar, working down

Bottom Up ParsingDriven by the words, working up

The Grammar

0) S E $ 1)E E + T | E - T | T 2)T T * F | T / F | F 3) F num | id

The Bottom Up Parse

1)id - num * id2)F - num * id3)T - num * id4)E - num * id5)E - F * id 6)E - T * id7)E - T * F8)E - T9)E10)S correct sentence

Note: If there is no rule that applies, backtracking is necessary

Top-Down and Bottom-Up

• Top-down– Advantage: Searches only trees that are legal– Disadvantage: Tries trees that don’t match the words

• Bottom-up– Advantage: Only forms trees matching the words– Disadvantage: Tries trees that make no sense globally

• Efficient combined algorithms – Link top-down expectations with bottom-up data– Example: Top-down parsing with bottom-up filtering

Stochastic Language Models

• Problems– A Language model cannot cover all grammatical rules– Spoken language is often ungrammatical

• Solution– Constrain search space emphasizing likely word sequences– Enhance the grammar to recognize intended sentences even

when the sequence doesn't satisfy the rules

A probabilistic view of language modeling

Probabilistic Context-Free Grammars (PCFG)

• Definition: G = (VN, VT, S, P, p);

VN = non-terminal set of symbols

VT = terminal set of symbols

S = start symbol

p = set of rule probabilities

R = set of rules

P(S ->W |G): S is the start symbol, W = expression in grammar G

• Training the Grammar: Count rule occurrences in a training corpusP(R | G) = Count(R) / ∑C(R)

Goal: Assist in discriminating among competing choices

PFSA (Probabilistic Finite State Automata)

• A PFSA is a type of Probabilistic Context Free Grammar– The states are the non-terminals in a production rule– The output symbols are the observed outputs– The arcs represent a context-free rule– The path through the automata represent a parse tree

• A PCFG considers state transitions and the transition path

S1

a S2

b S3

ε

S1 S2 S3

a b

Probabilistic Finite State Machines

• Probabilistic models determine weights of the transitions

• The sum of weights leaving a state total to unity

• Operations– Consider the weights to

compute the probability of a given string or most likely path.

– The machine can ‘learn’ the weights over time

Canine

Companion

Tooth

.01

.0035

.001

Another Example

Pronunciation decoding

[n iy]

Merging the machines together[n iy]

Another Example