nl grammar hierarchies regular expressions, finite state automata, markov algorithms

20
Instructor: Nick Cercone - 3050 CSEB - [email protected] 1 C SE6390 3.0 Special Topics in A I& Interactive System s II Introduction to C om putational Linguistics Fall Sem ester,2010 NL Grammar Hierarchi es Regular Expressio ns, Finite State Automata, Markov Algorithm

Upload: cole

Post on 24-Jan-2016

53 views

Category:

Documents


2 download

DESCRIPTION

NL Grammar Hierarchies Regular Expressions, Finite State Automata, Markov Algorithms. Regular Expressions - PowerPoint PPT Presentation

TRANSCRIPT

Instructor: Nick Cercone - 3050 CSEB - [email protected] 1

CSE6390 3.0 Special Topics in AI & Interactive Systems IIIntroduction to Computational Linguistics

Fall Semester, 2010

NL Grammar Hierarchies

Regular Expressions, Finite State Automata, Markov Algorithms

Instructor: Nick Cercone - 3050 CSEB - [email protected] 2

CSE6390 3.0 Special Topics in AI & Interactive Systems IIIntroduction to Computational Linguistics

Fall Semester, 2010

Regular Expressions• Regular expressions consist of constants and operators

that denote sets of strings and operations over these sets, respectively. The following definition is standard, and found as such in most textbooks on formal language theory. Given a finite alphabet , the following constants are defined:

• (empty set) denoting the set • (empty string) denoting the "empty" string, with

no characters at all.• (literal character) a in denoting a character in

the language.

Instructor: Nick Cercone - 3050 CSEB - [email protected] 3

CSE6390 3.0 Special Topics in AI & Interactive Systems IIIntroduction to Computational Linguistics

Fall Semester, 2010

Regular Expressions

The following operations are defined:

• (concatenation) RS denoting the set { ab | a in R and b in S }. For example {"ab", "c"}{"d", "ef"} = {"abd", "abef", "cd", "cef"}.

• (alternation) R | S denoting the set union of R and S. For example {"ab", "c"}|{"ab", "d", "ef"} = {"ab", "c", "d", "ef"}.

Instructor: Nick Cercone - 3050 CSEB - [email protected] 4

CSE6390 3.0 Special Topics in AI & Interactive Systems IIIntroduction to Computational Linguistics

Fall Semester, 2010

Regular Expressions

• (Kleene star) R* denoting the smallest superset of R that contains e and is closed under string concatenation. This is the set of all strings that can be made by concatenating zero or more strings in R. For example, {"ab", "c"}* = {e, "ab", "c", "abab", "abc", "cab", "cc", "ababab", "abcab", ... }.

Instructor: Nick Cercone - 3050 CSEB - [email protected] 5

CSE6390 3.0 Special Topics in AI & Interactive Systems IIIntroduction to Computational Linguistics

Fall Semester, 2010

Regular Expressions

• (Regular expressions are defined recursively as follows: ∅ is a regular expression is a regular expression• if a ∈ ∑ is a letter then a is a regular expression• if r1 and r2 are regular expressions then so are (r1 + r2)

and (r1 · r2)• if r is a regular expression then so is (r )∗• nothing else is a regular expression over ∑

Instructor: Nick Cercone - 3050 CSEB - [email protected] 6

CSE6390 3.0 Special Topics in AI & Interactive Systems IIIntroduction to Computational Linguistics

Fall Semester, 2010

Finite State Automata• Automata are models of computation: they compute

languages.

• A finite-state automaton is a five-tuple {Q, q0, ∑, , F}, where ∑ is a finite set of alphabet symbols, Q is a finite set of states, q0 ∈ Q is the initial state, F ⊆ Q is a set of final (accepting) states and : Q × ∑ × Q is a relation from states and alphabet symbols to states.

Instructor: Nick Cercone - 3050 CSEB - [email protected] 7

CSE6390 3.0 Special Topics in AI & Interactive Systems IIIntroduction to Computational Linguistics

Fall Semester, 2010

Finite State Automata• Example: Finite-state automaton• Q = {q0, q1, q2, q3}• ∑ = {c, a, t, r}• F = {q3} = {<q0, c, q1>, <q1, a, q2>, <q2, t, q3>, <q2, r , q3>}

Instructor: Nick Cercone - 3050 CSEB - [email protected] 8

CSE6390 3.0 Special Topics in AI & Interactive Systems IIIntroduction to Computational Linguistics

Fall Semester, 2010

Finite State Automata

• The reflexive transitive extension of the transition relation is a new relation, ˆ, defined as follows:– for every state q ∈ Q, (q, , q) ∈ ˆ– for every string w ∈ ∑∗ and letter a ∈ ∑, if (q,w, q′) ∈

ˆ and (q′, a, q′′) ∈ then (q,w · a, q′′) ∈ ˆ.

Instructor: Nick Cercone - 3050 CSEB - [email protected] 9

CSE6390 3.0 Special Topics in AI & Interactive Systems IIIntroduction to Computational Linguistics

Fall Semester, 2010

Finite State Automata• Example: Paths

For the finite-state automaton:

ˆ is the following set of triples:

<q0, ǫ, q0>, <q1, ǫ, q1>, <q2, ǫ, q2>, <q3, ǫ, q3>,

<q0, c, q1>, <q1, a, q2>, <q2, t, q3>, <q2, r , q3>,

<q0, ca, q2>, <q1, at, q3>, <q1, ar , q3>,

<q0, cat, q3>, <q0, car , q3>

Instructor: Nick Cercone - 3050 CSEB - [email protected] 10

CSE6390 3.0 Special Topics in AI & Interactive Systems IIIntroduction to Computational Linguistics

Fall Semester, 2010

Finite State Automata

An extension: -moves.• The transition relation is

extended to:

⊆ Q × (∑ ∪ {}) × Q

Example: Automata with -moves - an automaton accepting the language {do, undo, done, undone}:

Instructor: Nick Cercone - 3050 CSEB - [email protected] 11

CSE6390 3.0 Special Topics in AI & Interactive Systems IIIntroduction to Computational Linguistics

Fall Semester, 2010

Formal language theory – definitions

• If L is a language then the reversal of L, denoted LR, is the language {w | wR ∈ L}.

• If L1 and L2 are languages, then L1 · L2 = {w1 · w2 | w1 ∈ L1 and w2 ∈ L2}.

• Example: Language operations– Let L1 = {i, you, he, she, it, we, they}, L2 = {smile, sleep}.– Then L1R = {i, uoy, eh, ehs, ti, ew, yeht} and L1 · L2 = {ismile,

yousmile, hesmile, shesmile, itsmile, wesmile, theysmile, isleep, yousleep, hesleep, shesleep, itsleep, wesleep, theysleep}.

Instructor: Nick Cercone - 3050 CSEB - [email protected] 12

CSE6390 3.0 Special Topics in AI & Interactive Systems IIIntroduction to Computational Linguistics

Fall Semester, 2010

Formal language theory – definitions

• If L is a language then L0 = {}.• Then, for i > 0, Li = L · Li−1.• Example: Language exponentiation

Let L be the set of words {bau, haus, hof, frau}. Then L0 = {}, L1 = L and L2 = {baubau, bauhaus, bauhof, baufrau, hausbau, haushaus, haushof, hausfrau, hofbau, hofhaus, hofhof, hoffrau, fraubau, frauhaus, frauhof, fraufrau}.

Instructor: Nick Cercone - 3050 CSEB - [email protected] 13

CSE6390 3.0 Special Topics in AI & Interactive Systems IIIntroduction to Computational Linguistics

Fall Semester, 2010

Formal language theory – definitions• The Kleene closure of L and is denoted L∗ and is

defined as ∞i=0 Li .

• L+ = ∞i=0 i=1 Li

• Example: Kleene closureLet L = {dog, cat}. Observe that L0 = {}, L1 = {dog, cat}, L2 = {catcat, catdog, dogcat, dogdog}, etc. Thus L∗ contains, among its infinite set of strings, the strings , cat, dog, catcat, catdog, dogcat, dogdog, catcatcat, catdogcat, dogcatcat, dogdogcat, etc.

• The notation for ∗ should now become clear: it is simply a special case of L∗, where L = ∑

Instructor: Nick Cercone - 3050 CSEB - [email protected] 14

CSE6390 3.0 Special Topics in AI & Interactive Systems IIIntroduction to Computational Linguistics

Fall Semester, 2010

Markov AlgorithmsA Markov Algorithm is a finite sequence P1, P2,...,Pn of Markov productions to be applied to strings in a given alphabet according to the following rules. Let S be a given string. The sequence is searched to find the first production Pi whose antecedent occurs in S. If no such production exists, the operation of the algorithm halts without change in S. If there is a production in the algorithm whose antecedent occurs in S, the first such production is applied to S. If this is a conclusive production, the operation of the algorithm halts without further change in S. If this is a simple production, a new search is conducted using the string S' into which S has been transformed. If the operation of the algorithm ultimately ceases with a string S*, we say that S* is the result of applying the algorithm to S.

Instructor: Nick Cercone - 3050 CSEB - [email protected] 15

CSE6390 3.0 Special Topics in AI & Interactive Systems IIIntroduction to Computational Linguistics

Fall Semester, 2010

Markov AlgorithmsExample:

Take the alphabet to be {a, b, c, d}. The algorithm is given below.

• Algorithm M1M11: [conclusive] a d → d cM12: [simple] b a → WM13: [simple] a → b cM14: [simple] b c → b b aM15: [simple] W → a

Taking S = “dcb” we apply the algorithmby M15 dcb becomes adcbby M11 adcb becomes dccb and halts.

Instructor: Nick Cercone - 3050 CSEB - [email protected] 16

CSE6390 3.0 Special Topics in AI & Interactive Systems IIIntroduction to Computational Linguistics

Fall Semester, 2010

Markov AlgorithmsExample:

Let be a marker not in the alphabet. If S is a string in the alphabet, the result of applying algorithm M3 to S is the string SA.

Algorithm M3M31: [interchange] → , A member of alphabetM32: [conclusive] → AM33: W → Since S initially does not contain , the third production is then used to move past the symbols in S. If S contains n occurrences of symbols, then after n steps we obtain the string S. At this point the first production no longer applies, and the second production produces SA. Since this production is conclusive, the string SA is then the result.

Instructor: Nick Cercone - 3050 CSEB - [email protected] 17

CSE6390 3.0 Special Topics in AI & Interactive Systems IIIntroduction to Computational Linguistics

Fall Semester, 2010

Markov Algorithms

In the preceding example, we have introduced a new notation. Namely, in the first production we have used the variable which ranges over the symbols in the alphabet. Thus the first line is not really a production, but rather a production schema, denoting all the productions which can be obtained by substituting symbols of the alphabet for .

Because of the manner in which the Markov algorithms are used, the order in which the productions are written is vital. If the first two lines of algorithm M3 were interchanged, the result would be to transform S into AS, rather than into SA, and the productions represented by → would never be used.

Instructor: Nick Cercone - 3050 CSEB - [email protected] 18

CSE6390 3.0 Special Topics in AI & Interactive Systems IIIntroduction to Computational Linguistics

Fall Semester, 2010

Markov AlgorithmsExample:

Another procedure which is quite common is that of reversing a string of characters. We do this by moving the first character to the end as before, then moving the next character down to the position just preceding the first character, and so on. Markers: ,

Algorithm M10M101: → W , f members of the

alphabetM102: → M103: f → fM104: → M105: W →

Instructor: Nick Cercone - 3050 CSEB - [email protected] 19

CSE6390 3.0 Special Topics in AI & Interactive Systems IIIntroduction to Computational Linguistics

Fall Semester, 2010

Markov AlgorithmsIllustrating this algorithm on the string “ABCD” we have

by M105 => A B C Dby M103 => B A C Dby M103 => B C A Dby M103 => B C D Aby M104 => B C D Aby M105 => B C D Aby M103 => C B D Aby M103 => C D B Aby M102 => C D B Aby M105 => C D B Aby M103 => D C B A by M102 => D C B Aby M105 => D C B Aby M102 => D C B Aby M105 => D C B Aby M101 => D C B A

Instructor: Nick Cercone - 3050 CSEB - [email protected] 20

CSE6390 3.0 Special Topics in AI & Interactive Systems IIIntroduction to Computational Linguistics

Fall Semester, 2010

Other Concluding Remarks

A PSYCHOLOGICAL TIP

Whenever you're called on to make up your mind,and you're hampered by not having any,

the best way to solve the dilemma, you'll find,is simply by spinning a penny.

No -- not so that chance shall decide the affair while you're passively standing there moping;

but the moment the penny is up in the air, you suddenly know what you're hoping.