regular+expressions +closure+and+decision+properties+and+applications

Closure properties Closure properties are theorems, which show that the class of regular language is

closed under the operation mentioned. The theorems are of the form “if certain languages

are regular, and a language L is formed from them by certain operation such as union,

intersection etc. then L is also regular”. In general closure properties convey the fact that

when one (or several) languages are regular, then certain related languages are also

regular.

The principal closure properties of regular languages are:

1.The union of two regular languages is regular.

If L and M are regular languages, then so is L ∪ M.

2. The intersection of two regular languages is regular.

If L and M are regular languages, then so is L ∩ M.

3. The compliment of two regular languages is regular.

If L is a regular language over alphabet Σ, then Σ*-L is also regular language.

4. The difference of two regular languages is regular.

If L and M are regular languages, then so is L - M.

5. The reversal of a regular language is regular.

The reversal of a string means that the string is written backward, i.e.

reversal of abcde is edcba.

The reversal of a language is the language consisting of reversal of all its

strings, i.e. if L={001,110} then

L = {100,011}.

6.The closure of a regular language is regular.

If L is a regular language, then so is L*.

7. The concatenation of regular languages is regular.

If L and M are regular languages, then so is L M.

8.The homomorphism of a regular language is regular.

A homomorphism is a substitution of strings for symbol. Let the function

h be defined by h(0) = a and h(1) = b then h applied to 0011 is simply

aabb.

If h is a homomorphism on alphabet Σ and a string of symbols w =

abcd…z then h (w) = h (a) h

(b) h(c) h (d)…h (z)

The mathematical definition for homomorphism is

h: Σ*→Γ* such that ∀ x, y ∈ Σ*

A homomorphism can also be applied to a language by applying it to each

of strings in the language. Let L be a language over alphabet Σ, and h is a

homomorphism on Σ, then

h (L) = { h(w) | w is in L }

The theorem can be stated as “ If L is a regular language over alphabet Σ,

and h is a homomorphism on Σ, then h(L) is also regular ” .

9. The inverse homomorphism of two regular languages is regular.

Suppose h be a homomorphism from some alphabet Σ to strings in another

alphabet Τ and L be a language over Τ then h inverse of L, h′ (L) is set of

strings w in Σ* such that h(w) is in L.

The theorem states that “ If h is a homomorphism from alphabet Σ to

alphabet T , and L is aregular language on T , then h′(L) is also a regular

language.

Homomorphism applied in forward direction.

Homomorphism applied in inverse direction. -----------------------------------------------

Decision Properties of Regular languages Testing Emptiness of Regular Languages Suppose R is a regular expression. There are four cases to consider, corresponding to the ways that R could be constructed. R=R1+R2 then L(R) is empty if and only if both L(R1) and L(R2) are empty. R=R1R2 then L(R) is empty if and only if either L(R1) or L(R2) is empty. R=R1* then L(R) is not empty; it always includes at least ∈. R=(R1) then L(R) is empty if and only if both L(R1) is empty since they are the same language. Testing Membership in a Regular Language We have a regular language and an input string. To check whether the language accepts the string. If δ^ (q0,w)∈ F where the regular language is represented by the finite automata {Q,∑,δ, q0,F}. --------------------------------

Myhill-Nerode Theorem Myhill-Nerode theorem and minimization to eliminate useless states. The Myhill-Nerode Theorem says the following three statements are equivalent:

1) The set L ⊆ ∑∗ is accepted by some FA. (We know this means L is a regular

language.)

2) L is the union of some of the equivalence classes of a right invariant (with respect to

concatenation) equivalence relation

of finite index.

3) Let equivalence relation RL be defined by: xRLy if and only if for all z in ∑∗, xz is in

L exactly when yz is in L.

Then RL is of finite index.

The notation RL means an equivalence relation R over the language L. The notation RM means an equivalence relation R over a machine M. We know for every regular language L there is a machine M that exactly accepts the strings in L.

Think of an equivalence relation as being true or false for a specific pair of strings x and y. Thus xRy is true for some set of pairs x and y. We will use a relation R such that xRy <=> yRx x has a relation to y if and only if y has the same relation to x. This is known as symmetric. xRy and yRz implies xRz. This is known as transitive. xRx is true. This is known as reflexive. Our RL is defined xRLy <=> for all z in ∑∗ (xz in L <=> yz in L) Our RM is defined xRMy <=> xzRMyz for all z in ∑∗. In other words δ(q0, xz) = δ(δ(q0, x), z) = δ(δ(q0, y), z) = δ(q0, yz) for x, y and z strings in ∑∗. RM divides the set ∑∗ into equivalence classes, one class for each state reachable in M from the starting state q0. To get RL from this we have to consider only the Final reachable states of M. From this theorem comes the provable statement that there is a smallest, fewest number of states, FA for every regular language. ----------------------------- Table filling algorithm Start with a machine M = (Q, ∑, δ, q0, F) as usual Remove from Q, F and delta all states that cannot be reached from q0. Remember a DFA is a directed graph with states as nodes. Thus use a depth first search to mark all the reachable states. The unreachable states, if any, are then eliminated and the algorithm proceeds. 1) For p in F and q in Q-F put an "X" in the table at (p, q). This is the initialization step. Do not write over dashes. These matrix locations will never change. An X or x at (p, q) in the matrix means states p and q are distinct in the minimum machine. 2) Take a string w (preferably a string with a single character or ∈). Apply it starting from any two states p and q and after processing the string w if either p or q reaches any entries of F then put an “X” in the table at (p, q). Now we can say that the states p and q are distinguishable. 3) After all (p, q)’s are checked follow the recursive rule, by applying w to states (p, q) if we reach (r, s) respectively,

both r and s ∉ F, but if (r, s) were earlier proved to be distinguishable, then put an “X” in the table at (p, q). Q = {q0, q1, q2, q3, q4, q5, q6, q7, q8} ∑ = {a, b} q0 = q0 F = {q2, q3, q5, q6}

note Q-F = {q0, q1, q4, q7, q8}

δ a b q0 q1 q4 q1 q2 q3 q2 q7 q8 q3 q8 q7 q4 q5 q6 q5 q7 q8 q6 q7 q8 q7 q7 q7 q8 q8 q8

Now, build the table labeling the "p" rows q0, q1, ...q7 and labeling the "q" columns q1, ...q8

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - q0 q1 q2 q3 q4 q5 q6 q7

Now fill in for step 1) (p,q) such that p in F and q in (Q-F) { (q2, q0), (q2, q1), (q2, q4), (q2, q7), (q2, q8), (q3, q0), (q3, q1), (q3, q4), (q3, q7), (q3, q8), (q5, q0), (q5, q1), (q5, q4), (q5, q7), (q5, q8), (q6, q0), (q6, q1), (q6, q4), (q6, q7), (q6, q8)}

q1 q2 q3 q4 q5 q6 q7 q8

- X X X X - - - X X X X - - X X X - - X - - - X X - X X - - X X - X X - q0 q1 q2 q3 q4 q5 q6 q7

Now fill in more x's by checking all the cases in step 2 and apply steps 3. Finish by filling in blank table locations with "O". For example (r, s) = (δ(p=q0, a), δ(q=q1, a)) so r=q1 and s= q2 Note that (q1, q2) has an X, thus (q0, q1) gets an "x"

x X X X X 0 x 0 X X X X 0 0 X X X 0 0 X 0 x x X X x X X x x X X x X X 0 q0 q1 q2 q3 q4 q5 q6 q7

The "O" at (q1, q4) means {q1, q4) is a state in the minimum machine The "O" for (q2, q3), (q2, q5) and (q2, q6) means they are one state {q2, q3, q5, q6} in the minimum machine. Many other "O" just confirm this. The "O" in (q7, q8) means {q7, q8} is one state in the minimum machine. The resulting minimum machine is M' = (Q′,∑, δ,q0',F') with Q' = { {q0}, {q1,q4}, {q2,q3,q5,q6}, {q7,q8} }four states F' = { {q2,q3,q5,q6} }and only one final state q0' = q0

δ′ a B {q0} {q1,q4} {q1,q4} {q1,q4} {q2,q3,q5,q6} {q2,q3,q5,q6} {q2,q3,q5,q6} {q7,q8} {q7,q8} {q7,q8} {q7,q8} {q7,q8}

q1 q2 q3 q4 q5 q6 q7 q8

q1 q2 q3 q4 q5 q6 q7 q8

Note: Fill in the first column of states first. Check that every state occurs in some set and in only one set. Since this is a DFA the next columns must use exactly the state names found in the first column. e.g. q0 with input "a" goes to q1, but q1 is now {q1,q4}. At the heart of the algorithm is the following: The sets Q-F and F are disjoint, thus the pairs of states (Q-F) X (F) are distinguishable, marked X. For the pairs of states (p, q) and (r, s) where r= δ(p, a) and s= δ(q, a) if p is distinguishable from q, then r is distinguishable from s, thus mark (r, s) with an x. Testing equivalence of regular languages To test the equivalence of two regular languages we can make use of the table-filling algorithm. For this first convert each regular expression to a DFA. Imagine a DFA whose states are the union of states of the DFA obtained from regular expressions. This DFA has two start states; we can take any one out of this as the start state of new DFA. Now check the two start states using table-filling algorithm. If they are equivalent we can conclude that the regular expressions are also equivalent. The following example makes the concept clear: Consider the following DFA’s having the regular expressions

Let’s imagine that this represent a single DFA, with states A to E (A is taken as start state). Applying table-filling algorithm we get:

It shows that the two start states A and C are equivalent and so we reach in the conclusion that these two DFA’s accept the same language or the regular languages are equivalent. Can we apply table-filling algorithm to minimize all NFA’s? This can be easily concluded using following example:

Applying the table-filling algorithm:

The state C is a redundant state, but it cannot be concluded from the table. So the NFA cannot be minimized using table-filling algorithm. Minimisation of Finite State Automata

Algorithm

The following algorithm generates a total FSA equivalent to the one we start off with but with the least possible number of states. Note, however, that useless and unreachable states would first have to be removed for the algorithm to work. Also, the finite state automaton must be deterministic.

1. First make the FSA total (see the relevant algorithm); 2. Relabel the nodes 1, 2, ... n (where n is therefore the number of states); 3. Construct a table such that the entry (i,j) is TRUE if one of the states i, j is final

while the other is not. The entry is FALSE otherwise (both i, j are final or both non-final states);

4. Proceed with the table construction if there is an input a such that starting from state i with input a takes us to a state i' and stating from state j with input a takes us to a state j' such that (i ′, j ′) is TRUE if one of the states i ′, j ′is final while other is not. Otherwise, the entry (i ′, j ′) remains marked FALSE. Use recursive rule whenever necessary.

5. We now know that state i is indistinguishable from state j if and only if (i,j) is FALSE - join together the indistinguishable states

---------------------------- Applications of Regular Expression 1.Regular expressions in Unix

In the UNIX operating system various commands use an extended regular expressions language that provide shorthands for many common expressions. In this we can write character classes (A character class is a pattern that defines a set of characters and matches exactly one character from that set.) to represent large set of characters. There are some rules for forming this character classes:

The dot symbol (.) is to represent ‘any character’.

The regular expression a+b+c+…+z is represented by [abc…z] Within a character class representation, - can be used to define a set of characters in

terms of a range. For example, a-z defines the set of lower-case letters and A-Z defines the set of upper-case letters. The endpoints of a range may be specified in either order (i.e. both 0-9 and 9-0 define the set of digits).

If our expression involves operators such as minus then we can place it first or last to

avoid confusion with the range specifier. i.e. [-.0-9]. The special characters in UNIX regular language can be represented as characters using \ symbol i.e. \ provides the usual escapes within character class brackets. Thus [[\]] matches either [ or ], because \ causes the first ] in the character class representation to be taken as a normal character rather than the closing bracket of the representation. Special notations [: digit : ] same as [0-9] [: alpha:] same as [A-Za-z]

[: alnum :] same as [A-Za-z0-9] Operators | Used in place of + ? 0 or 1 of R? Means 0 or 1 occurrence of R + 1 or more of R+ means 1 or more occurrence of R {n} n copies of R {3} means RRR ^ Compliment of

If the first character after the opening bracket of a character class is ^, the set defined by the remainder of the class is complemented with respect to the computer's character set. Using this notation, the character class represented by ‘.’ can be described as [^\n]. If ^ appears as any character of a class except the first, it is not considered to be an operator. Thus [^abc] matches any character except a, b, or c but [a^bc] or [abc^] matches a, b, c or ̂ .

When more than one expression can match the current character sequence, a choice is made as follows:

1. The longest match is preferred. 2. Among rules, which match the same number of characters, the rule given first is

preferred.

2.Lexical analysis

Compilers – in a nutshell Purpose: translate a program in some language (the source language) into a lower-level language (the target language).

Phases:

Lexical Analysis: Converts a sequence of characters into words, or tokens

Syntax Analysis: Converts a sequence of tokens into a parse tree

Semantic Analysis: Manipulates parse tree to verify symbol and type information

Intermediate Code Generation: Converts parse tree into a sequence of intermediate code instructions

Optimization: Manipulates intermediate code to produce a more efficient program

Final Code Generation: Translates intermediate code into final (machine/assembly) code

Overview of Lexical Analysis

• Convert character sequence into tokens, skip comments & whitespace • Handle lexical errors • Efficiency is crucial • Tokens are specified as regular expressions, e.g. IDENTIFIER=[a-zA-Z][a-zA-Z0-9]* • Lexical Analyzers are implemented by regular expressions.

There is a problem that more than one token may be recognized at once. Suppose the string else matches for regular expression as well as the expression for identifiers. This problem is resolved by giving priority to first expression listed.

regular+expressions +closure+and+decision+properties+and+applications

Documents