fall 2005 lecture notes #4
DESCRIPTION
EECS 595 / LING 541 / SI 661. Natural Language Processing. Fall 2005 Lecture Notes #4. Features and unification. Introduction. Grammatical categories have properties Constraint-based formalisms Example: this flights : agreement is difficult to handle at the level of grammatical categories - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/1.jpg)
Fall 2005Lecture Notes #4
EECS 595 / LING 541 / SI 661
Natural Language Processing
![Page 2: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/2.jpg)
Features and unification
![Page 3: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/3.jpg)
Introduction
• Grammatical categories have properties• Constraint-based formalisms• Example: this flights: agreement is difficult to
handle at the level of grammatical categories• Example: many water: count/mass nouns• Sample rule that takes into account features: S
NP VP (but only if the number of the NP is equal to the number of the VP)
![Page 4: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/4.jpg)
Feature structuresCAT NPNUMBER SINGULARPERSON 3
CAT NP
AGREEMENT NUMBER SG PERSON 3
Feature paths: {x agreement number}
![Page 5: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/5.jpg)
Unification
[NUMBER SG] [NUMBER SG] +
[NUMBER SG] [NUMBER PL] -
[NUMBER SG] [NUMBER []] = [NUMBER SG]
[NUMBER SG] [PERSON 3] = ?
![Page 6: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/6.jpg)
Agreement
• S NP VP{NP AGREEMENT} = {VP AGREEMENT}
• Does this flight serve breakfast?• Do these flights serve breakfast?
• S Aux NP VP{Aux AGREEMENT} = {NP AGREEMENT}
![Page 7: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/7.jpg)
Agreement• These flights• This flight
• NP Det Nominal{Det AGREEMENT} = {Nominal AGREEMENT}
• Verb serve{Verb AGREEMENT NUMBER} = PL
• Verb serves{Verb AGREEMENT NUMBER} = SG
![Page 8: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/8.jpg)
Subcategorization
• VP Verb{VP HEAD} = {Verb HEAD}{VP HEAD SUBCAT} = INTRANS
• VP Verb NP{VP HEAD} = {Verb HEAD}{VP HEAD SUBCAT} = TRANS
• VP Verb NP NP{VP HEAD} = {Verb HEAD}{VP HEAD SUBCAT} = DITRANS
![Page 9: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/9.jpg)
Regular Expressions andAutomata
![Page 10: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/10.jpg)
Regular expressions
• Searching for “woodchuck”• Searching for “woodchucks” with an
optional final “s”• Regular expressions• Finite-state automata (singular: automaton)
![Page 11: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/11.jpg)
Regular expressions• Basic regular expression patterns• Perl-based syntax (slightly different from other notations for regular expressions)• Disjunctions [abc]• Ranges [A-Z]• Negations [^Ss]• Optional characters ? and *• Wild cards .• Anchors ^ and $, also \b and \B• Disjunction, grouping, and precedence |
Acrobat Document
Acrobat Document
Acrobat Document
Acrobat Document
Acrobat Document
![Page 12: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/12.jpg)
Writing correct expressions• Exercise: write a Perl regular expression to match the
English article “the”:
/the//[tT]he//\b[tT]he\b//[^a-zA-Z][tT]he[^a-zA-Z]//(^|[^a-zA-Z])[tT]he[^a-zA-Z]/
![Page 13: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/13.jpg)
A more complex example
• Exercise: Write a regular expression that will match “any PC with more than 500MHz and 32 Gb of disk space for less than $1000”:/$[0-9]+//$[0-9]+\.[0-9][0-9]//\b$[0-9]+(\.[0-9][0-9])?\b//\b[0-9]+ *(MHz|[Mm]egahertz|Ghz| [Gg]igahertz)\b//\b[0-9]+ *(Mb|[Mm]egabytes?)\b//\b[0-9](\.[0-9]+) *(Gb|[Gg]igabytes?)\b/
![Page 14: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/14.jpg)
Advanced operators
Acrobat Document Acrobat DocumentAdobe Acrobat 7.0 Document
![Page 15: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/15.jpg)
Substitutions and memory
• Substitutions
• Memory (\1, \2, etc. refer back to matches)
s/colour/color/
s/([0-9]+)/<\1>/
![Page 16: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/16.jpg)
Eliza [Weizenbaum, 1966]
User: Men are all alikeELIZA: IN WHAT WAYUser: They’re always bugging us about something or otherELIZA: CAN YOU THINK OF A SPECIFIC EXAMPLE?User: Well, my boyfriend made me come hereELIZA: YOUR BOYFRIEND MADE YOU COME HEREUser: He says I’m depressed much of the timeELIZA: I AM SORRY TO HEAR THAT YOU ARE DEPRESSED
![Page 17: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/17.jpg)
Eliza-style regular expressions
s/.* YOU ARE (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/
s/.* YOU ARE (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/
s/.* all .*/IN WHAT WAY/s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/
Step 1: replace first person references with second person referencesStep 2: use additional regular expressions to generate repliesStep 3: use scores to rank possible transformations
![Page 18: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/18.jpg)
Finite-state automata
• Finite-state automata (FSA)• Regular languages• Regular expressions
![Page 19: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/19.jpg)
Finite-state automata (machines)baa!baaa!baaaa!baaaaa!...
q0 q1 q2 q3 q4
b a a !
a
baa+!
state transition finalstate
![Page 20: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/20.jpg)
Input tape
a b a ! b
q0
![Page 21: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/21.jpg)
Finite-state automata
• Q: a finite set of N states q0, q1, … qN
: a finite input alphabet of symbols• q0: the start state• F: the set of final states(q,i): transition function
![Page 22: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/22.jpg)
State-transition tables
InputState b a !
0 1 0 01 0 2 02 0 3 03 0 3 44 0 0 0
![Page 23: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/23.jpg)
The FSM toolkit and friends
• Developed at AT&T Research (Riley, Pereira, Mohri, Sproat)
• Download: http://www.research.att.com/sw/tools/fsm/tech.htmlhttp://www.research.att.com/sw/tools/lextools/
• Tutorial available• 4 useful parts: FSM, Lextools, GRM, Dot (separate)
– /data2/tools/fsm-3.6/bin– /data2/tools/lextools/bin– /data2/tools/dot/bin
![Page 24: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/24.jpg)
D-RECOGNIZEfunction D-RECOGNIZE (tape, machine) returns accept or reject index Beginning of tape current-state Initial state of machine loop if End of input has been reached then if current-state is an accept state then return accept else return reject elsif transition-table [current-state, tape[index]] is empty then return reject else current-state transition-table [current-state, tape[index]] index index + 1end
![Page 25: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/25.jpg)
Adding a failing state
q0 q1 q2 q3 q4
b a a !
a
qFa
!
b
! b ! bb
a
!
![Page 26: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/26.jpg)
Languages and automata
• Formal languages: regular languages, non-regular languages
• deterministic vs. non-deterministic FSAs• Epsilon () transitions
Acrobat Document Acrobat Document Acrobat Document
![Page 27: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/27.jpg)
Using NFSAs to accept strings
• Backup: add markers at choice points, then possibly revisit underexplored markers
• Look-ahead: look ahead in input• Parallelism: look at alternatives in parallel
![Page 28: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/28.jpg)
Using NFSAs
InputState b a !
0 1 0 0 01 0 2 0 02 0 2,3 0 03 0 0 4 04 0 0 0 0
![Page 29: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/29.jpg)
More about FSAs
• Transducers• Equivalence of DFSAs and NFSAs• Recognition as search: depth-first, breadth-
search
![Page 30: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/30.jpg)
Recognition using NFSAs
Acrobat Document Acrobat DocumentAdobe Acrobat 7.0
Document
![Page 31: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/31.jpg)
Regular languages
• Operations on regular languages and FSAs: concatenation, closure, union
• Properties of regular languages (closed under concatenation, union, disjunction, intersection, difference, complementation, reversal, Kleene closure)
Acrobat Document Acrobat Document Acrobat Document
![Page 32: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/32.jpg)
An exercise
• J&M 2.8. Write a regular expression for the language accepted by the NFSA in the Figure.
Acrobat Document
![Page 33: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/33.jpg)
Morphology andFinite-State Transducers
![Page 34: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/34.jpg)
Morphemes
• Stems, affixes• Affixes: prefixes, suffixes, infixes: hingi
(borrow) – humingi (agent) in Tagalog, circumfixes: sagen – gesagt in German
• Concatenative morphology • Templatic morphology (Semitic languages): lmd (learn), lamad (he studied), limed (he
taught), lumad (he was taught)
![Page 35: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/35.jpg)
Morphological analysis
• rewrites• unbelievably
![Page 36: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/36.jpg)
Inflectional morphology
• Tense, number, person, mood, aspect• Five verb forms in English• 40+ forms in French• Six cases in Russian:
http://www.departments.bucknell.edu/russian/language/case.html
• Up to 40,000 forms in Turkish (you cause X to cause Y to … do Z)
Adobe Acrobat 7.0 Document
![Page 37: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/37.jpg)
Derivational morphology
• Nominalization: computerization, appointee, killer, fuzziness
• Formation of adjectives: computational, embraceable, clueless
![Page 38: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/38.jpg)
Finite-state morphological parsing
• Cats: cat +N +PL• Cat: cat +N +SG• Cities: city +N +PL• Geese: goose +N +PL• Ducks: (duck +N +PL) or (duck +V +3SG)• Merging: +V +PRES-PART• Caught: (catch +V +PAST-PART) or (catch +V
+PAST)
![Page 39: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/39.jpg)
Principles of morphological parsing
• Lexicon• Morphotactics (e.g., plural follows noun)• Orthography (easy easier)• Irregular nouns: e.g., geese, sheep, mice• Irregular verbs: e.g., caught, ate, eaten
Acrobat Document Acrobat Document
![Page 40: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/40.jpg)
FSA for adjectives
• Big, bigger, biggest• Cool, cooler, coolest, coolly• Red, redder, reddest• Clear, clearer, clearest, clearly, unclear, unclearly• Happy, happier, happiest, happily• Unhappy, unhappier, unhappiest, unhappily• What about: unbig, redly, and realest?
Acrobat Document Acrobat Document
![Page 41: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/41.jpg)
Using FSA for recognition
• Is a string a legitimate word or not?• Two-level morphology: lexical level +
surface level (Koskenniemi 83)• Finite-state transducers (FST) – used for
regular relations• Inversion and composition of FST
Acrobat Document
Acrobat Document
Acrobat Document
Acrobat Document
Acrobat Document
Acrobat Document
Adobe Acrobat 7.0 Document
![Page 42: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/42.jpg)
Orthographic rules
• Beg/begging• Make/making• Watch/watches• Try/tries• Panic/panicked
#__^/ szsx
dcba __/
Acrobat Document
Acrobat Document Acrobat Document
![Page 43: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/43.jpg)
Combining FST lexicon and rules• Cascades of transducers:
the output of one becomes the input of another
![Page 44: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/44.jpg)
Weighted Automata
![Page 45: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/45.jpg)
Phonetic symbols
• IPA• Arpabet• Examples
![Page 46: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/46.jpg)
Using WFST for language modeling
• Phonetic representation• Part-of-speech tagging
![Page 47: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/47.jpg)
Word Classes andPart Of Speech Tagging
![Page 48: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/48.jpg)
Some POS statistics
• Preposition list from COBUILD• Single-word particles• Conjunctions• Pronouns• Modal verbs
Adobe Acrobat 7.0 Document
Adobe Acrobat 7.0 Document
Adobe Acrobat 7.0 Document
Adobe Acrobat 7.0 Document
Adobe Acrobat 7.0 Document
![Page 49: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/49.jpg)
Tagsets for English
• Penn Treebank• Other tagsets (see Week 1 slides) Acrobat Document
![Page 50: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/50.jpg)
POS ambiguity
• Degrees of ambiguity (DeRose 1988)• Rule-based POS tagging
– ENGTWOL (Voutilainen et al. )– Sample rule:
• Adverbial-That rule (“it isn’t that odd”) (“Given input: “that”if (+1 A/ADV/QUANT); (+2 SENT-LIM); (NOT –1 SVOC/A); (not a verb like “consider”)then eliminate non-ADV tagselse eliminate ADV tag
Acrobat Document
Acrobat Document
![Page 51: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/51.jpg)
Evaluating POS taggers
• Percent correct• What is the lower bound on a system’s
performance?• What about the upper bound?
![Page 52: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/52.jpg)
Kappa
• N: number of items (index i)• n: number of categories (index j)• k: number of annotators• when > .8 – agreement is considered high
)(1)()(
EPEPAP
N
i
n
jij k
mkNk
AP1 1
2
11
)1(1)(
2
1
1
)(
Nk
mEP
N
iijn
j
![Page 53: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/53.jpg)
Midterm reading list
• Chapter 1 – Introduction• Chapter 2 – Regular expressions and automata• Chapter 3 – Morphology and finite-state
transducers + FSM tutorial• Chapter 8 – Word classes and POS tagging• Chapter 9 – Context-free grammars for English• Chapter 10 – Parsing with context-free grammars• Chapter 11 - Features and unification
![Page 54: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/54.jpg)
Syntaxscape
• Written by Juno Suk of Lucent• http://www.cs.columbia.edu/~radev/syntaxscape/
![Page 55: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/55.jpg)
![Page 56: Fall 2005 Lecture Notes #4](https://reader033.vdocuments.us/reader033/viewer/2022051317/56816005550346895dcf06bb/html5/thumbnails/56.jpg)
Read by yourselves
• 9.9. Spoken language syntax• 9.10. Grammar equivalence• 9.11. Finite-state and context-free grammars