finite state automata and tries sambhav jain iiit hyderabad
TRANSCRIPT
Finite State Automata and Tries 2
Think !!!
• How to store a dictionary in computer?
• How to search for an entry in that dictionary?
– Say you have each word length exactly equal to 10 characters and can take any letter from ‘a-z’
Eg. aaaaaaaaaa, abcdefghij, …. etc Language = [a-z]{10} - RegEx
Finite State Automata and Tries 3
A Simple Way
• aaaaaaaaaa• aaaaaaaaab• aaaaaaaaac• ….• ….• ….• ….• zzzzzzzzzz
A Linear Sorted List of Entries
Finite State Automata and Tries 4
A Simple Way
• aaaaaaaaaa• aaaaaaaaab• aaaaaaaaac• ….• ….• ….• ….• zzzzzzzzzz
Character to be stored = 2610
= 1.41167096 × 1014
Each character take 1 Byte
~ 141 TB
Finite State Automata and Tries 5
Smart Way !
a b c d w x y z
a b c d w x y z
a b c d w x y z
……………………………………………..
……………………………………………..
……………………………………………..
………………………………..……………………………………………………………………………………….
Finite State Automata and Tries 6
Smart Way !
a b c d w x y z
a b c d w x y z
a b c d w x y z
……………………………………………..
……………………………………………..
……………………………………………..
………………………………..……………………………………………………………………………………….
•Total Storage = 26x10 = 260 bytes•Traverse 10 nodes
Finite State Automata and Tries 7
Does it work for Natural Language
• Oxford Advanced English Learner 20th Edition– A quarter of a million distinct English words,
excluding inflections, and words from technical and regional vocabulary not covered by the OED
• After inflections ? – eat,eats,eaten,eating …..
• What after multiple inflexion ???– beauty, beautiful, beautifully …
Finite State Automata and Tries 12
Inflectional morphology
• Deals with word forms of a root, when there is no change in lexical category.
• Each word form gives different values of features like gender, number, person, etc.
Finite State Automata and Tries 13
Paradigm
• For a given root, there are many word forms with different features.
• Ex. Forms of Hindi root laDakA (boy)
Direct Oblique
Singular laDakA laDake
Plural laDake laDakoM
Finite State Automata and Tries 14
Paradigm
- 'laDakoM' is plural with oblique case - given by feature structure {num=pl,
case=obl} - 'laDake' stands for two feature structures + Singular oblique (Ex. laDake ne kahA ...) - where oblique means 'laDake' is followed
by a postposition marker + plural direct case (Ex. laDake Aye)
Finite State Automata and Tries 15
Paradigmo Paradigms - What operation is done on root to obtain word forms - Model using pairs: (delete string, add string) | direct oblique ---|----------------------- sg | (O,O) (A,e) pl | (A,e) (A,oM) o List roots with paradigms they follow: - ghoDA follows paradigm laDakA - charkhA follows paradigm laDakA - laDakA follows paradigm laDakA•
Finite State Automata and Tries 16
l k | | a a | | D p | | -------- a | | | a A D | | | k ------- | | | | ------------ | I i | | | ------- | A e o | | | A | | | | | | A e o M M | M
Finite State Automata and Tries 17
Abstracting out suffixes
k l | | a a | | p D | | a --------- | | | D #1 a A | | k (#1) I
#1: Corresponds to paradigm for 'laDakA'
Finite State Automata and Tries 19
• Can we further optimize our search ?- Use knowledge of paradigms
- Use suffix tree
Finite State Automata and Tries 20
• Store suffix tree in main memory• Store rest of the categorized by paradigm in
hard disk• Do backward search for suffix tree• Identify the paradigm• Search only in that paradigm set• Eg. if ‘–ing’ occur you first won’t be searching
word like home, cat, god …
Finite State Automata and Tries 21
Finite State Automata
• Trie is a data structure
• FSA is the computational approach
• Slight difference in representation – Putting characters on edges rather than nodes
Finite State Automata and Tries 22
+ / \ l / \ k + + a | | a | | + + D | | p | | + + a | | a | | + + k | | D | | + + \ / 0 \ / 0 +______ e/ \o \ A / \ \ (+) + (+) | |M (+)
Finite State Automata and Tries 23
FSA
o A deterministic finite-state machine formally is - Q: A finite set of states (Ex.:{q0,q1,q2}) - SIGMA: A finite set of input alphabet (Ex.: {a,b,c}) - Start state: A state in Q, from which machine starts (Ex.: q0) - F: A set of accepting states (Ex.: {q2}) - DELTA (q,i): A transition function or transition matrix where: - q MEMBER Q, i MEMBER SIGMA, - DELTA(q,i) MEMBER Q
Thus, DELTA(q,i): Q x SIGMA --> Q
Finite State Automata and Tries 24
RECOGNITION Problem
• Till now we were handling only RECOGNITION problem
• If FSA reach a final state at the end of input string then EXIST
• Else NOT
Finite State Automata and Tries 25
• But we seek analyzed output• We want the machine to tell– Root– Gender– Number– Person– Case– Etc ……
Finite State Automata and Tries 26
Finite State TransducerFST is like the finite state automation defined earlier, except each arc is labelled by a pair of symbols: i:o where i: symbol in input string o: symbol output by FST when are is taken
+ Ex. arc in finite state transducer corresponding to 'e' in 'ladake'
e : ((+pl, -direct), (+sg, +dir)) q1 +----------------->--------------------+ q2
Two pairs of symbols: i : o - i is: 'e' - o is: '((+pl, -direct), (+sg, +dir))'
+ Ex. Morph Analyzer: Match input with i, if successful go ahead & produce o in output
Finite State Automata and Tries 27
o Formally: Finite state transducer - Q: Finite set of states q0, ..., qN - SIGMA_IN: Finite set of input symbols - SIGMA_OUT: Finite set of pairs output symbols - q0: Start state (q0 IN Q) - F: Set of final accepting states (F SUBSET Q) - DELTA (q, i:o) : For every state q, gives a set of states that can be reached from q with i in SIGMA_IN, and o in SIGMA_OUT.
Finite State Automata and Tries 29
Tools for FSA
• Lex• OpenFST– (www.openfst.org/)
• AT&T FSM Toolkit – (http://www2.research.att.com/~fsmtools/fsm/)