regular expressions for nlp

24
Regular Expressions & Finite State Automata Lecture 1

Upload: dennis-flynn

Post on 07-Dec-2015

248 views

Category:

Documents


1 download

DESCRIPTION

Details how to use regular expressions for use for natural language processing

TRANSCRIPT

Page 1: Regular Expressions for NLP

Regular Expressions & Finite State AutomataLecture 1

Page 2: Regular Expressions for NLP

What is a Regular Expression

• Notation for specifying set of strings• Used for search• Corpus: text(s) to search through / learn from

• Used to define (formal) language

Page 3: Regular Expressions for NLP

Creating a Regular Expression

• Perl notation uses / / around regexes• Expressions composed of:

Category Symbols Example Example Matches

Literal Characters

/the/ the, other, The

Character Sets . [ ] \d \D \w \W \s \S

/[a-zA-Z]/ A, a, t, S, Z, ab

Disjunction | /T|the/ The, the

Boundaries \b \B ^ $ \n \t /\bthe\b/ the, other, the.

Quantifiers * + ? { } /colou?r/ color, colour

Special Characters

\ /.+\.com/ Yahoo.com

Capturing ( ) \1 /(\d{5}).+\1/ Same zip twice

Page 4: Regular Expressions for NLP

Creating a Regular Expression

• Defining a regex involves iteratively improving:• Accuracy/Precision: minimizing false positives• e.g. /the/ /\bthe\b/

• Coverage/Recall: minimizing false negatives• e.g. /the/ /T|the/

Page 5: Regular Expressions for NLP

Using Regular Expressions

• Generally used to search or replace:• Perl:$str = “other people”if($str =~ /the/) …

• Java:import java.util.regex.*;…Pattern r = Pattern.compile(“\d”);Matcher m = r.matcher(“D0es th1s c0nta1n d1g1ts?”);if(m.find()) …

• Python:import researchObj = re.search(r‘the’, “other people”)phone = “Tel: 209-867-5309”re.sub(r‘\d’, ‘#’, phone)

Page 6: Regular Expressions for NLP

References

• Good tutorials and cheat sheets available online:• http://regexone.com/lesson• http://web.mit.edu/hackl/www/lab/turkshop/slides/regex-cheatsheet.pdf • http://donovanh.com/pages/regex_list.html

• Textbook also has cheat sheet on cover

Page 7: Regular Expressions for NLP

ELIZA (1966)

• Cascading regexes to simulate Rogerian psychologist• Available online: http://nlp-addiction.com/eliza/ • Embodiment of Searle’s “Chinese Room”

Page 8: Regular Expressions for NLP

ELIZA

• Cascading regexes to simulate Rogerian psychologist• s/I’m/YOU ARE/• s/M|my/YOUR/

Page 9: Regular Expressions for NLP

ELIZA

• Cascading regexes to simulate Rogerian psychologist• s/YOU ARE (depressed|sad)/I AM SORRY TO HEAR YOU ARE \1/• s/YOU ARE (depressed|sad)/WHY DO YOU THINK THAT YOU ARE\1/

Page 10: Regular Expressions for NLP

ELIZA

• Cascading regexes to simulate Rogerian psychologist• s/\ball\b/IN WHAT WAY/• s/\balways\b/CAN YOU THINK OF A SPECIFIC EXAMPLE/

Page 11: Regular Expressions for NLP

Finite State Automata

Page 12: Regular Expressions for NLP

Finite State Automata (FSAs)

• Regular Expressions are convenient way to describe an FSA:• Sheep language: /baa+!/

• FSAs and probabilistic cousins (Markov models) are used extensively in NLP.• Perfectly capture regular languages• Capture parts of natural languages: phonology, morphology, syntax.

Page 13: Regular Expressions for NLP

FSA representation

• States are represented by circles• Q0 or state with incoming arrow: start state

• Double circled states: final/accepting state• Directed links: transitions between states

• Imagine tape with input – try to match to transition:

Page 14: Regular Expressions for NLP

Formal Representation

• Specify the following:• Q = {q0,q1,…qn-1} a finite set of N states

• Σ a finite input alphabet of symbols

(symbols can have internal structure)• q0 the start state

• F the set of final states F ⊂ Q• δ (q,i) a transition function that maps QxΣ

to Q

Page 15: Regular Expressions for NLP

Transition Table

• Convenient for computer representation, too: Input

State b a !

0 1 ∅ ∅1 ∅ 2 ∅2 ∅ 3 ∅3 ∅ 3 4

4 ∅ ∅ ∅

Page 16: Regular Expressions for NLP

D-Recognize

• Deterministic: no choice points

Page 17: Regular Expressions for NLP

Generative Uses

• Any model that recognizes a formal language (FSA, regex, CFG) can be used to generate valid strings.• Starting in q0, select random transitions until reach final state.

Page 18: Regular Expressions for NLP

Non-Deterministic FSAs

• More than one transition possible for a particular state and input combination:

• Or uses epsilon transitions, where no input characters are read:

Page 19: Regular Expressions for NLP

Non-Deterministic FSAs

• In NFSA there exists at least one path through the machine for any string in the language defined by the machine.

• Not all paths directed through the machine for an acceptable string lead to an accept state.

• No paths through the machine lead to an accept state for a string not in the language.

• Challenge: what to do if make wrong transition choice?

Page 20: Regular Expressions for NLP

Resolving Non-Determinism

• Backup: when reach a choice point, mark state and input position (search-state), then if needed roll backwards.

• Look-Ahead: Look at following input symbols to try to choose correct transition.

• Parallelism: Follow each of the transition options in parallel.

• Convert: All NFSAs can be converted to an equivalent FSA.

Page 21: Regular Expressions for NLP

Backup

• Need to modify transition table:• Add epsilon transition column• Allow multiple destination states for given search-state.

Input

State

b a ! ε0 1 ∅ ∅ ∅1 ∅ 2 ∅ ∅2 ∅ 2,3 ∅ ∅3 ∅ ∅ 4 ∅4 ∅ ∅ ∅ ∅

Input

State

b a ! ε0 1 ∅ ∅ ∅1 ∅ 2 ∅ ∅2 ∅ 3 ∅ ∅3 ∅ ∅ 4 3

4 ∅ ∅ ∅ ∅

Page 22: Regular Expressions for NLP

NFSA Search: BFS or DFS

Keep a stack or queue of search-states remaining to

explore.

Page 23: Regular Expressions for NLP

Computing Theory …

• You may recall from (or learn in) COMP 147:

Class of languages definable by regular expressions is same as class definable by FSAs. These are called regular languages.

Page 24: Regular Expressions for NLP

Your Turn …

• Lab 1: Regular Expression Practice

• Project 1: ELIZA reborn