cmsc 723: intro to computational linguistics lecture 2: february 4, 2004 regular expressions and...

36
CMSC 723: Intro to Computational Linguistics Lecture 2: February 4, 2004 Regular Expressions and Finite State Automata Professor Bonnie J. Dorr Dr. Nizar Habash TAs: Nitin Madnani and Nate Waisbrot

Post on 22-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

CMSC 723: Intro to Computational Linguistics

Lecture 2: February 4, 2004

Regular Expressions and Finite State Automata

Professor Bonnie J. DorrDr. Nizar Habash

TAs: Nitin Madnani and Nate Waisbrot

Regular Expressions and Finite State Automata

• REs: Language for specifying text strings

• Search for document containing a string– Searching for “woodchuck”

• Finite-state automata (FSA)(singular: automaton)

• How much wood would a woodchuck chuck if a woodchuck would chuck wood?

– Searching for “woodchucks” with an optional final “s”

Regular Expressions

• Basic regular expression patterns

• Perl-based syntax (slightly different from other notations for regular expressions)

• Disjunctions /[wW]oodchuck/

Regular Expressions

• Ranges [A-Z]

• Negations [^Ss]

Regular Expressions

• Optional characters ? ,* and +– ? (0 or 1)

• /colou?r/ color or colour

– * (0 or more)• /oo*h!/ oh! or Ooh! or Ooooh!

*+

Stephen Cole Kleene

– + (1 or more) • /o+h!/ oh! or Ooh! or Ooooh!

• Wild cards .– /beg.n/ begin or began or begun

Regular Expressions

• Anchors ^ and $– /^[A-Z]/ “Ramallah, Palestine”

– /^[^A-Z]/ “¿verdad?” “really?”

– /\.$/ “It is over.”

– /.$/ ?

• Boundaries \b and \B– /\bon\b/ “on my way” “Monday”

– /\Bon\b/ “automaton”

• Disjunction |– /yours|mine/ “it is either yours or mine”

Disjunction, Grouping, Precedence

• Column 1 Column 2 Column 3 …How do we express this?/Column [0-9]+ *//(Column [0-9]+ *)*/

• Precedence– Parenthesis ()– Counters * + ? {}– Sequences and anchors the ^my end$– Disjunction |

• REs are greedy!

Perl Commands

While ($line=<STDIN>){if ($line =~ /the/){

print “MATCH: $line”;}

}

Writing correct expressions

Exercise: write a Perl regular expression to match the English article “the”:

/the//[tT]he//\b[tT]he\b//[^a-zA-Z][tT]he[^a-zA-Z]//(^|[^a-zA-Z])[tT]he[^a-zA-Z]/

A more complex example

Exercise: Write a regular expression that will match “any PC with more than 500MHz and 32 Gb of disk space for less than $1000”:/$[0-9]+//$[0-9]+\.[0-9][0-9]//\b$[0-9]+(\.[0-9][0-9])?\b//\b$[0-9][0-9]?[0-9]?(\.[0-9][0-9])?\b//\b[0-9]+ *([MG]Hz|[Mm]egahertz| [Gg]igahertz)\b//\b[0-9]+ *(Mb|[Mm]egabytes?)\b//\b[0-9](\.[0-9]+) *(Gb|[Gg]igabytes?)\b/

Advanced operators

Substitutions and Memory

• Substitutions

• Memory (\1, \2, etc. refer back to matches)

s/colour/color/s/colour/color/g

s/([0-9]+)/<\1>/

/the (.*)er they were, the \1er they will be/

/the (.*)er they (.*), the \1er they \2/

Substitute as many times as possible!

Case insensitive matching

s/colour/color/i

Eliza [Weizenbaum, 1966]

User: Men are all alike

ELIZA: IN WHAT WAY

User: They’re always bugging us about something or other

ELIZA: CAN YOU THINK OF A SPECIFIC EXAMPLE?

User: Well, my boyfriend made me come here

ELIZA: YOUR BOYFRIEND MADE YOU COME HERE

User: He says I’m depressed much of the time

ELIZA: I AM SORRY TO HEAR THAT YOU ARE DEPRESSED

Eliza-style regular expressions

s/.* YOU ARE (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/

s/.* YOU ARE (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/

s/.* all .*/IN WHAT WAY/

s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/

Step 1: replace first person references with second person referencess/\bI(’m| am)\b/YOU ARE/g

s/\bmy\b/YOUR/g

S/\bmine\b/YOURS/g

Step 2: use additional regular expressions to generate replies

Step 3: use scores to rank possible transformations

Finite-state Automata

• Finite-state automata (FSA)

• Regular languages

• Regular expressions

Finite-state Automata (Machines)

q0 q1 q2 q3 q4

b a a !

a

/baa+!/

state transition finalstate

baa! baaa! baaaa! baaaaa! ...

Input Tape

a b a ! b

q0

0 1 2 3 4

b a a !a

REJECT

Input Tape

b a a a

q0 q1 q2 q3 q3 q4

!

0 1 2 3 4

b a a !a

ACCEPT

Finite-state Automata

• Q: a finite set of N states – Q = {q0, q1, q2, q3, q4}

: a finite input alphabet of symbols = {a, b, !}

• q0: the start state• F: the set of final states

– F = {q4} (q,i): transition function

– Given state q and input symbol i, return new state q' (q3,!) q4

State-transition TablesInput

State b a !

0 1 0 0

1 0 2 0

2 0 3 0

3 0 3 4

4: 0 0 0

D-RECOGNIZEfunction D-RECOGNIZE (tape, machine) returns accept or reject index Beginning of tape current-state Initial state of machine loop if End of input has been reached then if current-state is an accept state then return accept else return reject elsif transition-table [current-state, tape[index]] is empty then return reject else current-state transition-table [current-state, tape[index]] index index + 1end

Adding a failing state

q0 q1 q2 q3 q4

b a a !

a

qFa

!

b

! b ! b

b

a

!

Adding an “all else” arc

q0 q1 q2 q3 q4

b a a !

a

qF

== =

=

Languages and Automata

• Can use FSA as a generator as well as a recognizer

• Formal language L: defined by machine M that both generates and recognizes all and only the strings of that language. – L(M) = {baa!, baaa!, baaaa!, …}

• Regular languages vs. non-regular languages

Languages and Automata

• Deterministic vs. Non-deterministic FSAs

• Epsilon () transitions

Using NFSAs to accept strings

• Backup: add markers at choice points, then possibly revisit unexplored arcs at marked choice point.

• Look-ahead: look ahead in input

• Parallelism: look at alternatives in parallel

Using NFSAsInput

State b a ! 0 1 0 0 0

1 0 2 0 0

2 0 2,3 0 0

3 0 0 4 0

4 0 0 0 0

More about FSAs

• Transducers

• Equivalence of DFSAs and NFSAs

• Recognition as search: depth-first, breadth-search

Recognition using NFSAs

NFSA Recognition of “baaa!”

Breadth-first Recognition of “baaa!”

Regular languages

• Regular languages are characterized by FSAs

• For every NFSA, there is an equivalent DFSA.

• Regular languages are closed under concatenation, Kleene closure, union.

Concatenation

Kleene Closure

Union

Readings for next time

• J&M Chapter 3