october 2007natural language processing1 csa3050: natural language algorithms words and finite state...
TRANSCRIPT
![Page 1: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfd71a28abf838cae95b/html5/thumbnails/1.jpg)
October 2007 Natural Language Processing 1
CSA3050: Natural Language Algorithms
Words and
Finite State Machinery
![Page 2: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfd71a28abf838cae95b/html5/thumbnails/2.jpg)
October 2007 Natural Language Processing 2
Acknowledgement
Material derived from/copied from– Jurafsky and Martin, Speech and Language
Processing, Prentice Hall 2000– Richard Sproat, Lecture notes
![Page 3: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfd71a28abf838cae95b/html5/thumbnails/3.jpg)
October 2007 Natural Language Processing 3
Outline
Words
Regular Languages
Regular Expressions
Finite State Automata
![Page 4: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfd71a28abf838cae95b/html5/thumbnails/4.jpg)
October 2007 Natural Language Processing 4
What is a Word?
• A series of speech sounds that symbolizes meaning without being divisible into smaller units
• Any segment of written or printed discourse ordinarily appearing between spaces or between a space and a punctuation mark
• A set of linguistic forms produced by combining a single base with various inflectional elements without change in the part-of-speech elements
• The smallest meaningful element of language. When written it stands alone with a space on either side of it.
![Page 5: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfd71a28abf838cae95b/html5/thumbnails/5.jpg)
October 2007 Natural Language Processing 5
Information Associated with Words
• Spelling– orthographic– phonological
• Syntax– POS– Valency
• Semantics– Meaning – Relationship to other words
![Page 6: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfd71a28abf838cae95b/html5/thumbnails/6.jpg)
October 2007 Natural Language Processing 6
Properties of Words
• Sequence– characters pollution– phonemes
• Delimitation– whitespace– other?
• Structure– simple ("atomic") words– complex ("molecular") words
![Page 7: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfd71a28abf838cae95b/html5/thumbnails/7.jpg)
October 2007 Natural Language Processing 7
Complex Words
• Complex words have subparts:• e.g. "enlargement"en + large + ment
• Some subparts are valid wordslarge
• Others are prefixes and suffixesen, ment
• N.B. The complex word can be built in different ways: (en + large) + menten + (large + ment)
![Page 8: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfd71a28abf838cae95b/html5/thumbnails/8.jpg)
October 2007 Natural Language Processing 8
Morphological Processes
• affixation– prefix– suffix– circumfix: għandi - mgħandix– infix: phenidine phenetidine
• other morphological processes– redoubling (mexa; mexxa)– vowel change (swim; swam)
![Page 9: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfd71a28abf838cae95b/html5/thumbnails/9.jpg)
October 2007 Natural Language Processing 9
Complex Words Formed by Concatenation
disreunen
largechargeinfectcodedecide
edingeeerly
+ +
prefixes roots suffixes
![Page 10: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfd71a28abf838cae95b/html5/thumbnails/10.jpg)
October 2007 Natural Language Processing 10
The Language of Words
• What kind of formal language is the language of words?
• One which can be constructed out of– A characteristic set of basic symbols (alphabet)– A characteristic set of combining operations
• Union (disjunction) • Concatenation• Iteration
• Regular Language; Regular Sets
![Page 11: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfd71a28abf838cae95b/html5/thumbnails/11.jpg)
October 2007 Natural Language Processing 11
Outline
Words
Regular Languages
Regular Expressions
Finite State Automota
![Page 12: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfd71a28abf838cae95b/html5/thumbnails/12.jpg)
October 2007 Natural Language Processing 12
Regular Languages
• A regular language is a language with a finite alphabet that can be constructed out of one or more of the following operations:– Set union– Concatenation– Transitive closure (Kleene star)
![Page 13: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfd71a28abf838cae95b/html5/thumbnails/13.jpg)
October 2007 Natural Language Processing 13
Some things that areregular languages
• Zero or more a’s followed by zero or more b’s
• The set of words in an English dictionary
• Dates
• URLs
• English?
![Page 14: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfd71a28abf838cae95b/html5/thumbnails/14.jpg)
October 2007 Natural Language Processing 14
Some things that are not regular languages
• Zero or more a’s followed by exactly the same number of b’s
• The set of all English palindromes (e.g. Madam I'm Adam)
• The set that includes all noun phrases of the form– the cat slept– the cat the dog bit slept– the cat the dog the man fed bit slept
![Page 15: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfd71a28abf838cae95b/html5/thumbnails/15.jpg)
October 2007 Natural Language Processing 15
Some special regular languages
• The universal language (Σ*)
• The empty language (Ø)
Note: the empty language is not the same as the empty string
![Page 16: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfd71a28abf838cae95b/html5/thumbnails/16.jpg)
October 2007 Natural Language Processing 16
Some closure propertiesof regular languages
• Intersection
• Complementation
• Difference
• Reversal
• Power
![Page 17: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfd71a28abf838cae95b/html5/thumbnails/17.jpg)
October 2007 Natural Language Processing 17
Characterising Classes of SetCLASS OF
SETS or LANGUAGES
NOTATION MACHINE
![Page 18: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfd71a28abf838cae95b/html5/thumbnails/18.jpg)
October 2007 Natural Language Processing 18
Outline
Words
Regular Languages
Regular Expressions
Finite Automota
![Page 19: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfd71a28abf838cae95b/html5/thumbnails/19.jpg)
October 2007 Natural Language Processing 19
Regular Expressions
• Notation for describing regular sets
• Used extensively in the Unix operating system (grep, sed, etc.) and also in some Microsoft products (Word)
• Xerox Finite State tools use a somewhat different notation, but similar function.
![Page 20: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfd71a28abf838cae95b/html5/thumbnails/20.jpg)
October 2007 Natural Language Processing 20
Regular Expressions
a a simple symbol
A B concatenation
A | B alternation operator
A & B intersection operator
A* Kleene star
![Page 21: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfd71a28abf838cae95b/html5/thumbnails/21.jpg)
October 2007 Natural Language Processing 21
Characterising Classes of Set
CLASS OFSETS or LANGUAGES
NOTATION MACHINE
![Page 22: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfd71a28abf838cae95b/html5/thumbnails/22.jpg)
October 2007 Natural Language Processing 22
Outline
Words
Regular Languages
Regular Expressions
Finite Automata
![Page 23: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfd71a28abf838cae95b/html5/thumbnails/23.jpg)
October 2007 Natural Language Processing 23
Finite Automaton• A finite automaton is a quintuple
(Q, I, q0,F, δ ) where:
• Q is a finite set of states
• Σ is alphabet of symbols
• q0 Q is a start state
• F Q are final states
• δ is a transition relation δ(q,i,q') between a state q Q, a symbol σ Σ and q' Q
![Page 24: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfd71a28abf838cae95b/html5/thumbnails/24.jpg)
October 2007 Natural Language Processing 24
Representation of FSA’s:State Diagram
![Page 25: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfd71a28abf838cae95b/html5/thumbnails/25.jpg)
October 2007 Natural Language Processing 25
State Table
![Page 26: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfd71a28abf838cae95b/html5/thumbnails/26.jpg)
October 2007 Natural Language Processing 26
Mr. Kleene
![Page 27: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfd71a28abf838cae95b/html5/thumbnails/27.jpg)
October 2007 Natural Language Processing 27
Kleene’s theorem
• Languages generated by NFAs are exactly equivalent to languages described by Regular Expressions.
• Kleene’s Theorem, part 1: To each regular expression there corresponds a NFA.
• Kleene’s Theorem, part 2: To each NFA there corresponds a regular expression.
http://www.cs.may.ie/~jpower/Courses/parsing/node6.html
![Page 28: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfd71a28abf838cae95b/html5/thumbnails/28.jpg)
October 2007 Natural Language Processing 28
Converting a Regular Expressionto an NFA
• The NFA representing the empty string is:
• The NFA representing a single character is:
1 2ε
1 2a
![Page 29: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfd71a28abf838cae95b/html5/thumbnails/29.jpg)
October 2007 Natural Language Processing 29
Regular Expression to NFA
Dia
gram
fro
m L
eoni
das
Feg
aras
, U
niv.
Tex
as
![Page 30: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfd71a28abf838cae95b/html5/thumbnails/30.jpg)
October 2007 Natural Language Processing 30
Deterministic Finite Automata
• In deterministic finite automata (DFA), every state/symbol pair maps to a unique state
• In other words, δ is a function• Why do we care about DFAs?
![Page 31: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfd71a28abf838cae95b/html5/thumbnails/31.jpg)
October 2007 Natural Language Processing 31
Deterministic Finite Automata
• In deterministic finite automata (DFA), every state/symbol pair maps to a unique state
• In other words, δ is a function• Why do we care about DFAs?• EFFICIENCY!!
![Page 32: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfd71a28abf838cae95b/html5/thumbnails/32.jpg)
October 2007 Natural Language Processing 32
Equivalence of NFA’s and DFA’s
![Page 33: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfd71a28abf838cae95b/html5/thumbnails/33.jpg)
October 2007 Natural Language Processing 33
Subset Construction for Determinisation
• States which are connected by an ε transition will be represented by the same states in the DFA.
• If there are multiple transitions based on the same symbol, then we can regard a transition as moving from a state to a set of states (ie. the union of all those states reachable by a transition on the current symbol).
• Thus these states will be combined into a single DFA state.
• more details http://www.cs.may.ie/~jpower/Courses/parsing/node9.html
![Page 34: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfd71a28abf838cae95b/html5/thumbnails/34.jpg)
October 2007 Natural Language Processing 34
Subset construction for determinization
![Page 35: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfd71a28abf838cae95b/html5/thumbnails/35.jpg)
October 2007 Natural Language Processing 35
Subset construction for determinization
![Page 36: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfd71a28abf838cae95b/html5/thumbnails/36.jpg)
October 2007 Natural Language Processing 36
Subset construction for determinization
![Page 37: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfd71a28abf838cae95b/html5/thumbnails/37.jpg)
October 2007 Natural Language Processing 37
Subset construction for determinization
![Page 38: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfd71a28abf838cae95b/html5/thumbnails/38.jpg)
October 2007 Natural Language Processing 38
Subset construction for determinization
![Page 39: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfd71a28abf838cae95b/html5/thumbnails/39.jpg)
October 2007 Natural Language Processing 39
Subset construction for determinization