regular expressions chapter 6. regular languages regular language regular expression finite state...

Regular Expressions Chapter 6 Slide 2 Regular Languages Regular Language Regular Expression Finite State Machine L Accepts Slide 3 Regular Expressions The regular expressions over an alphabet are all and only the strings that can be obtained as follows: 1. is a regular expression. 2. is a regular expression. 3. Every element of is a regular expression. 4. If , are regular expressions, then so is . 5. If , are regular expressions, then so is . 6. If is a regular expression, then so is *. 7. is a regular expression, then so is +. 8. If is a regular expression, then so is ( ). Slide 4 Regular Expression Examples If = { a, b }, the following are regular expressions: a ( a b )* abba Slide 5 Regular Expressions Define Languages Define L, a semantic interpretation function for regular expressions: 1. L( ) = . 2. L( ) = { }. 3. L(c), where c = {c}. 4. L( ) = L( ) L( ). 5. L( ) = L( ) L( ). 6. L( *) = (L( ))*. 7. L( + ) = L( *) = L( ) (L( ))*. If L( ) is equal to , then L( + ) is also equal to . Otherwise L( + ) is the language that is formed by concatenating together one or more strings drawn from L( ). 8. L(( )) = L( ). Slide 6 The Role of the Rules Rules 1, 3, 4, 5, and 6 give the language its power to define sets. Rule 8 has as its only role grouping other operators. Rules 2 and 7 appear to add functionality to the regular expression language, but they dont. 2. is a regular expression. 7. is a regular expression, then so is +. Slide 7 Analyzing a Regular Expression L(( a b )* b ) = L(( a b )*) L( b ) = (L(( a b )))* L( b ) = (L( a ) L( b ))* L( b ) = ({ a } { b })* { b } = { a, b }* { b }. Slide 8 Examples L( a * b * ) = L( ( a b )* ) = L( ( a b )* a * b * ) = L( ( a b )* abba ( a b )* ) = Slide 9 Going the Other Way L = {w { a, b }*: |w| is even} Slide 10 Going the Other Way L = {w { a, b }*: |w| is even} ( a b ) ( a b ))* ( aa ab ba bb )* Slide 11 Going the Other Way L = {w { a, b }*: |w| is even} ( a b ) ( a b ))* ( aa ab ba bb )* L = {w { a, b }*: w contains an odd number of a s} Slide 12 Going the Other Way L = {w { a, b }*: |w| is even} ( a b ) ( a b ))* ( aa ab ba bb )* L = {w { a, b }*: w contains an odd number of a s} b * ( ab * ab *)* a b * b * a b * ( ab * ab *)* Slide 13 More Regular Expression Examples L ( ( aa *) ) = L ( ( a )* ) = L = {w { a, b }*: there is no more than one b in w} L = {w { a, b }* : no two consecutive letters in w are the same} Slide 14 Common Idioms ( ) ( a b )* Slide 15 Operator Precedence in Regular Expressions RegularArithmeticExpressions HighestKleene starexponentiation concatenationmultiplication Lowestunionaddition a b * c d *x y 2 + i j 2 Slide 16 The Details Matter a * b * ( a b )* ( ab )* a * b * Slide 17 The Details Matter L 1 = {w { a, b }* : every a is immediately followed a b } A regular expression for L 1 : A FSM for L 1 : L 2 = {w { a, b }* : every a has a matching b somewhere} A regular expression for L 2 : A FSM for L 2 : Slide 18 Kleenes Theorem Finite state machines and regular expressions define the same class of languages. To prove this, we must show: Theorem: Any language that can be defined with a regular expression can be accepted by some FSM and so is regular. Theorem: Every regular language (i.e., every language that can be accepted by some DFSM) can be defined with a regular expression. Slide 19 For Every Regular Expression There is a Corresponding FSM Well show this by construction. An FSM for: : Slide 20 For Every Regular Expression There is a Corresponding FSM Well show this by construction. An FSM for: : Slide 21 For Every Regular Expression There is a Corresponding FSM Well show this by construction. An FSM for: : A single element of : Slide 22 For Every Regular Expression There is a Corresponding FSM Well show this by construction. An FSM for: : A single element of : Slide 23 For Every Regular Expression There is a Corresponding FSM Well show this by construction. An FSM for: : A single element of : ( *): Slide 24 For Every Regular Expression There is a Corresponding FSM Well show this by construction. An FSM for: : A single element of : ( *): Slide 25 Union If is the regular expression and if both L( ) and L( ) are regular: Slide 26 Concatenation If is the regular expression and if both L( ) and L( ) are regular: Slide 27 Kleene Star If is the regular expression * and if L( ) is regular: Slide 28 An Example (b ab )* An FSM for b An FSM for a An FSM for b An FSM for ab : Slide 29 An Example ( b ab )* An FSM for ( b ab ): Slide 30 An Example ( b ab )* An FSM for ( b ab )*: Slide 31 The Algorithm regextofsm regextofsm( : regular expression) = Beginning with the primitive subexpressions of and working outwards until an FSM for all of has been built do: Construct an FSM as described above. Slide 32 For Every FSM There is a Corresponding Regular Expression Well show this by construction. The key idea is that well allow arbitrary regular expressions to label the transitions of an FSM. Slide 33 A Simple Example Let M be: Suppose we rip out state 2: Slide 34 The Algorithm fsmtoregexheuristic fsmtoregexheuristic(M: FSM) = 1. Remove unreachable states from M. 2. If M has no accepting states then return . 3. If the start state of M is part of a loop, create a new start state s and connect s to Ms start state via an -transition. 4. If there is more than one accepting state of M or there are any transitions out of any of them, create a new accepting state and connect each of Ms accepting states to it via an -transition. The old accepting states no longer accept. 5. If M has only one state then return . 6. Until only the start state and the accepting state remain do: 6.1 Select rip (not s or an accepting state). 6.2 Remove rip from M. 6.3 *Modify the transitions among the remaining states so M accepts the same strings. 7. Return the regular expression that labels the one remaining transition from the start state to the accepting state. Slide 35 An Example 1. Create a new initial state and a new, unique accepting state, neither of which is part of a loop. Slide 36 An Example, Continued 2. Remove states and arcs and replace with arcs labelled with larger and larger regular expressions. Slide 37 An Example, Continued Remove state 3: Slide 38 An Example, Continued Remove state 2: Slide 39 An Example, Continued Remove state 1: Slide 40 When Its Hard M = Slide 41 When Its Hard A regular expression for M: a ( aa )* ( aa )* b ( b ( aa )* b )* ba ( aa )* [ a ( aa )* b ( b a ( aa )* b ) ( b ( aa )* b )* ( a ba ( aa )* b )] [ b ( aa )* b ( a b ( aa )* ab ) ( b ( aa )* b )* ( a ba ( aa )* b )]* [ b ( aa )* ( a b ( aa )* ab ) ( b ( aa )* b )* ba ( aa )*] Slide 42 Further Modifications to M Before We Start We require that, from every state other than the accepting state there must be exactly one transition to every state (including itself) except the start state. And into every state other than the start state there must be exactly one transition from every state (including itself) except the accepting state. 1. If there is more than one transition between states p and q, collapse them into a single transition: becomes: Slide 43 Further Modifications to M Before We Start 2. If any of the required transitions are missing, add them: becomes: Slide 44 Ripping Out States 3. Choose a state. Rip it out. Restore functionality. Suppose we rip state 2. Slide 45 What Happens When We Rip? Consider any pair of states p and q. Once we remove rip, how can M get from p to q? It can still take the transition that went directly from p to q, or It can take the transition from p to rip. Then, it can take the transition from rip back to itself zero or more times. Then it can take the transition from rip to q. Slide 46 Defining R(p, q) After removing rip, the new regular expression that should label the transition from p to q is: R(p, q) /* Go directly from p to q /* or R(p, rip) /* Go from p to rip, then R(rip, rip)* /* Go from rip back to itself any number of times, then R(rip, q) /* Go from rip to q Without the comments, we have: R = R(p, q) R(p, rip) R(rip, rip)* R(rip, q) Slide 47 Returning to Our Example R = R(p, q) R(p, rip) R(rip, rip)* R(p, rip) Let rip be state 2. Then: R (1, 3) = R(1, 3) R(1, rip)R(rip, rip)*R(rip, 3) = R(1, 3) R(1, 2)R(2, 2)*R(2, 3) = a b * a = ab * a Slide 48 The Algorithm fsmtoregex fsmtoregex(M: FSM) = 1. M = standardize(M: FSM). 2. Return buildregex(M). standardize(M: FSM) = 1. Remove unreachable states from M. 2. If necessary, create a new start state. 3. If necessary, create a new accepting state. 4. If there is more than one transition between states p and q, collapse them. 5. If any transitions are missing, create them with label . Slide 49 The Algorithm fsmtoregex buildregex(M: FSM) = 1. If M has no accepting states then return . 2. If M has only one state, then return . 3. Until only the start and accepting states remain do: 3.1 Select some state rip of M. 3.2 For every transition from p to q, if both p and q are not rip then do Compute the new label R for the transition from p to q: R (p, q) = R(p, q) R(p, rip) R(rip, rip)* R(rip, q) 3.3 Remove rip and all transitions into and out of it. 4. Return the regular expression that labels the transition from the start state to the accepting state. Slide 50 A Special Case of Pattern Matching Suppose that we want to match a pattern that is composed of a set of keywords. Then we can write a regular expression of the form: ( * (k 1 k 2 k n ) *) + For example, suppose we want to match: * finite state machine FSM finite state automaton * We can use regextofsm to build an FSM. But We can instead use buildkeywordFSM. Slide 51 {cat, bat, cab} The single keyword cat: Slide 52 {cat, bat, cab} Adding bat : Slide 53 {cat, bat, cab} Adding cab : Slide 54 A Biology Example BLAST Given a protein or DNA sequence, find others that are likely to be evolutionarily close to it. ESGHDTTTYYNKNRYPAGWNNHHDQMFFWV Build a DFSM that can examine thousands of other sequences and find those that match any of the selected patterns. Slide 55 Regular Expressions in Perl SyntaxNameDescription abcConcatenationMatches a, then b, then c, where a, b, and c are any regexs a | b | cUnion (Or)Matches a or b or c, where a, b, and c are any regexs a*a*Kleene starMatches 0 or more as, where a is any regex a+a+At least oneMatches 1 or more as, where a is any regex a?a?Matches 0 or 1 as, where a is any regex a{n, m}ReplicationMatches at least n but no more than m as, where a is any regex a*?ParsimoniousTurns off greedy matching so the shortest match is selected a+? .Wild cardMatches any character except newline ^Left anchorAnchors the match to the beginning of a line or string $Right anchorAnchors the match to the end of a line or string [a-z][a-z] Assuming a collating sequence, matches any single character in range [^ a - z ] Assuming a collating sequence, matches any single character not in range \d\d Digit Matches any single digit, i.e., string in [ 0 - 9 ] \D\D Nondigit Matches any single nondigit character, i.e., [^ 0 - 9 ] \w\w Alphanumeric Matches any single word character, i.e., [ a - zA - Z0 - 9 ] \W\W Nonalphanumeric Matches any character in [^ a - zA - Z0 - 9 ] \s\s White spaceMatches any character in [space, tab, newline, etc.] Slide 56 SyntaxNameDescription \S\S Nonwhite space Matches any character not matched by \ s \n\n NewlineMatches newline \r\r ReturnMatches return \t\t TabMatches tab \f\f FormfeedMatches formfeed \b\b BackspaceMatches backspace inside [] \b\b Word boundaryMatches a word boundary outside [] \B\B Nonword boundaryMatches a non-word boundary \0\0 NullMatches a null character \ nnn OctalMatches an ASCII character with octal value nnn \ x nn HexadecimalMatches an ASCII character with hexadecimal value nn \cX\cX ControlMatches an ASCII control character \ char QuoteMatches char ; used to quote symbols such as. and \ (a)(a)StoreMatches a, where a is any regex, and stores the matched string in the next variable \1VariableMatches whatever the first parenthesized expression matched \2Matches whatever the second parenthesized expression matched For all remaining variables Regular Expressions in Perl Slide 57 Using Regular Expressions in the Real World Matching numbers: -? ([0-9]+(\.[0-9]*)? | \.[0-9]+) Matching ip addresses: S! ([0-9]{1,3} (\. [0-9] {1,3}){3}) ! $1 ! Finding doubled words: \ From Friedl, J., Mastering Regular Expressions, OReilly,1997. Slide 58 More Regular Expressions Identifying spam: \badv$?ert$?\b Trawl for email addresses: \b[A-Za-z0-9_%-]+@[A-Za-z0-9_%-]+ (\.[A-Za- z]+){1,4}\b Slide 59 Using Substitution Building a chatbot: On input: is the chatbot will reply: Why is ? Slide 60 Chatbot Example The food there is awful Why is the food there awful? Assume that the input text is stored in the variable $text : $text =~ s/^([A-Za-z]+)\sis\s([A-Za-z]+)\.?$/ Why is \1 \2?/ ; Slide 61 Simplifying Regular Expressions Regexs describe sets: Union is commutative: = . Union is associative: ( ) = ( ). is the identity for union: = = . Union is idempotent: = . Concatenation: Concatenation is associative: ( ) = ( ). is the identity for concatenation: = = . is a zero for concatenation: = = . Concatenation distributes over union: ( ) = ( ) ( ). ( ) = ( ) ( ). Kleene star: * = . * = . ( *)* = *. * * = *. ( )* = ( * *)*.

regular expressions chapter 6. regular languages regular language regular expression finite state...

Documents