csa3050: natural language algorithms
DESCRIPTION
CSA3050: Natural Language Algorithms. Words, Strings and Regular Expressions Finite State Automota. This lecture. Outline Words The language of words FSAs in Prolog Acknowledgement Jurafsky and Martin, Speech and Language Processing, Prentice Hall 2000 - PowerPoint PPT PresentationTRANSCRIPT
October 2004 CSA3050 NL Algorithms 1
CSA3050: Natural Language Algorithms
Words, Strings and
Regular Expressions
Finite State Automota
October 2004 CSA3050 NL Algorithms 2
This lecture
• Outline– Words– The language of words– FSAs in Prolog
• Acknowledgement– Jurafsky and Martin, Speech and Language
Processing, Prentice Hall 2000– Blackburn and Steignitz: NLP Techiques in Prolog:
http://www.coli.uni-sb.de/~kris/nlp-with-prolog/html/
October 2004 CSA3050 NL Algorithms 3
What is a Word?
• A series of speech sounds that symbolizes meaning without being divisible into smaller units
• Any segment of written or printed discourse ordinarily appearing between spaces or between a space and a punctuation mark
• A set of linguistic forms produced by combining a single base with various inflectional elements without change in the part-of-speech elements
• A number of bytes processed as a unit.
October 2004 CSA3050 NL Algorithms 4
Information Associated with Words
• Spelling– orthographic– phonological
• Syntax– POS– Valency
• Semantics– Meaning – Relationship to other words
October 2004 CSA3050 NL Algorithms 5
Properties of Words
• Sequence– characters pollution– phonemes
• Delimitation– whitespace– other?
• Structure– simple ("atomic“) words– complex ("molecular") words
October 2004 CSA3050 NL Algorithms 6
Complex Words
• enlargementen + large + ment(en + large) + menten + (large + ment)
• affixation– prefix– suffix– infix
October 2004 CSA3050 NL Algorithms 7
Sets Underly the Formation of Complex Words
disreunen
largechargeinfectcodedecide
edingeeerly
+ +
prefixes roots suffixes
October 2004 CSA3050 NL Algorithms 8
Structure of Complex Words
• Complex words are made by concatenating elements chosen from – a set of prefixes– a set of roots– a set of suffixes
• The set of valid words for a given human language (e.g. English, Maltese) can be regarded as a formal language.
October 2004 CSA3050 NL Algorithms 9
The Language of Words
• What kind of formal language is the language of words?
• One which can be constructed out of– A characteristic set of basic symbols (alphabet)– A characteristic set of combining operations
• Union (disjunction) • Concatenation• Closure (iteration)
• Regular Language; Regular Sets
October 2004 CSA3050 NL Algorithms 10
Characterising Classes of Set
CLASS OFSETS or LANGUAGES
NOTATION MACHINE
October 2004 CSA3050 NL Algorithms 11
Regular Expressions
• Notation for describing regular sets
• Used extensively in the Unix operating system (grep, sed, etc.) and also in some Microsoft products (Word)
• Xerox Finite State tools use a somewhat different notation, but similar function.
October 2004 CSA3050 NL Algorithms 12
Regular Expressions
a a simple symbol
A B concatenation
A | B alternation operator
A & B intersection operator
A* Kleene star
October 2004 CSA3050 NL Algorithms 13
Characterising Classes of Set
CLASS OFSETS or LANGUAGES
NOTATION MACHINE
October 2004 CSA3050 NL Algorithms 14
Finite Automaton
• A finite automaton comprises• A finite set of states Q• An alphabet of symbols I• A start state q0 Q• A set of final states F Q• A transition function δ(q,i) which maps a
state q Q and a symbol i I to a new state q' Q
October 2004 CSA3050 NL Algorithms 15
Encoding FSAs in Prolog
• Three predicates– initial/1initial(s) – s is an initial state
– final/1final(f) – f is a final state
– arc/3arc(s,t,c)there is an arc from s to t labelled c
October 2004 CSA3050 NL Algorithms 16
Example 1: FSA
initial(1).final(4).arc(1,2,h).arc(2,3,a).arc(3,4,!).arc(3,2,h).
1-
2
3
4=
h
ha
!
October 2004 CSA3050 NL Algorithms 17
Example 2: FSA with jump arc
initial(1).final(4).arc(1,2,h).arc(2,3,a).arc(3,4,!).arc(3,1,#).
1-
2
3
4=
h
#a
!
October 2004 CSA3050 NL Algorithms 18
Example 3: NDA
initial(1).final(4).arc(1,2,h).arc(2,3,a).arc(3,4,!).arc(2,1,a).
1-
2
3
4=
h a
a
!
October 2004 CSA3050 NL Algorithms 19
A Recogniser
recognize1(Node,[ ]) :- final(Node).
recognize1(Node1,String) :- arc(Node1,Node2,Label), traverse1(Label,String,NewString), recognize1(Node2,NewString).
traverse1(Label,[Label|Symbols],Symbols).
October 2004 CSA3050 NL Algorithms 20
TraceCall: (7) test1([h, a, !]) Call: (8) initial(_L181) Exit: (8) initial(1) Call: (8) recognize1(1, [h, a, !]) Call: (9) arc(1, _L199, _L200) Exit: (9) arc(1, 2, h) Call: (9) traverse1(h, [h, a, !], _L201) Exit: (9) traverse1(h, [h, a, !], [a, !]) Call: (9) recognize1(2, [a, !]) Call: (10) recognize1(3, [!]) Call: (11) recognize1(4, []) Call: (12) final(4) Exit: (12) final(4) Exit: (11) recognize1(4, []) Exit: (10) recognize1(3, [!]) Exit: (9) recognize1(2, [a, !]) Exit: (8) recognize1(1, [h, a, !]) Exit: (7) test1([h, a, !])
October 2004 CSA3050 NL Algorithms 21
Generation
• test1(X)
• X = [h, a, !] ;
• X = [h, a, h, a, !] ;
• X = [h, a, h, a, h, a, !] ;
• X = [h, a, h, a, h, a, h, a, !] ;
• etc.
October 2004 CSA3050 NL Algorithms 22
3 Related Frameworks
REGULARLANGS/SETS
REGULAREXPRESSIONS
FINITE STATENETWORKS
describe recognise
October 2004 CSA3050 NL Algorithms 23
Regular Operations
• Operations– Concatenation– Union– Closure
• Over What– Language– Expressions– FS Automota
October 2004 CSA3050 NL Algorithms 24
Concatenation over Reg. Expression and LanguageRegular Expression
E1: = [a|b]
E2: = [c|d]
E1 E2 =
[a|b] [c|d]
Language
L1 = {"a", "b"}
L2 = {"c", "d"}
L1 L2 =
{"ac", "ad", "bc", "bd"}
October 2004 CSA3050 NL Algorithms 25
Concatenation overFS Automata
a
b
c
d
a
b
c
d
⌣
October 2004 CSA3050 NL Algorithms 26
Issues
• Handling jump arcs.
• Handling non-determinism
• Computing operations over networks.
• Maintaining multiple states in DB
• Representation.