languages, grammars, and regular expressions chuck cusack based partly on chapter 11 of “discrete...

Post on 13-Jan-2016

217 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Languages, Grammars, and Regular Expressions

Chuck Cusack

• Based partly on Chapter 11 of “Discrete Mathematics and its Applications,” 5th edition, by Kenneth Rosen

Alphabets and Languages• Definition: A vocabulary (or alphabet) V is a

finite, nonempty set of symbols. • Definition: A word or sentence over V is a finite

string of symbols from V.• Definition: The empty string or null string,

denoted by , is the string containing no symbols.• Definition: The set of all words over V is denoted

by V*.• Definition: A language over V is a subset of V*.

Language Examples• Let V={0,1}• 00110, 11111, 00, and 11 are words over V• 012, a234, and 222 are not words over V• V*={0,1,00,01,10,11,000,…}• In other words, V* is the set of all binary strings• The set of strings consisting of only 0s is a

language over V*

• {1,10,100,1000,10000,…} is a language over V*

Concatenation• Definition: Let V be a vocabulary, and A and B

be subsets of V*. The concatenation of A and B, denoted by AB, is the set of all strings of the form xy, where xA and yB.

• Example: Let A={0, 10}, and B={1,12}. Then– AB={01, 012, 101, 1012}– BA={10, 110, 120, 1210}– AA={00, 010, 100, 1010}– AAA=A(AA)={000, 0010, 0100, 01010, 1000,

10010, 10100, 101010}

Concatenation: An

• Definition: Let V be a vocabulary, and A a subset of V*. Then A0={} , and for n>0, we can define

An=A(n-1)A• Example: Let A={0, 10}. Then

– A0={– A1=A0A={A=A={0,10}– A2=A1A ={00, 010, 100, 1010} – A3= A2A={000, 0010, 0100, 01010, 1000,

10010, 10100, 101010}

Kleene Closure• Definition: Let V be a vocabulary, and A a subset of V*.

The Kleene closure of A, denoted by A*, is the set consisting of concatenations of an arbitrary number of strings from A. That is,

0

*

k

kAA

}{*

1

AAAk

k

• Definition: A+ is the set of nonempty strings over A. In other words,

Kleene Closure Example• Example: Let A={0, 1}. Then

– A0={– A1={0,1}

– A2={00, 01, 10, 11}

– A3={000, 001, 010, 011, 100, 101, 110, 111}

– A*={0,1}*={All binary strings}

• Example: Let B={111}. Then

– B0={B1={111}, B2={111111}

– B3={111111111}

– B* is the set of strings with 3n 1s, for every n

Regular Sets• Definition: A regular set is a set that can be

generated starting from the empty set, empty string, and single elements from the vocabulary, using concatenations, unions, and Kleene closures in arbitrary order.

• We will give a more precise definition after we define a regular expression.

Regular Expressions• Definition: The regular expressions over a set I

are defined recursively by: – (the empty set) is a regular expression,– (the set containing the empty string) is a regular

expression,– x is a regular expression for all xI,– (AB) , (AB) , and A* are regular expressions if A and B

are regular expressions

• Definition: A regular set is a set represented by a regular expression.

• Examples: 001*, 1(0(01)*11, and AB*C are regular expressions

Regular Expression Example• The regular set defined by the regular expression

01* is the set of strings starting with a 0 followed by 0 or more 1s.

• The regular set defined by (10)* is the set of strings containing 0 or more copies of 10.

• The regular set defined by 0(01)*1 is the set of all binary strings beginning with 0 and ending with 1.

• The regular set defined by (01)1(01) is the set of strings {010, 011, 110, 111}.

Regular Expression Applications• Regular expressions are actually used quite often

in computer science.• For instance, if you are editing a file with vi, and

want to see if it contains the string blah followed by a number followed by any character followed by the letter Q, you can use the regular expression

blah[0-9][0-9]*.Q• This works because vi uses regular expressions for

searching.

Grammars and Languages

• Many languages can be defined by grammars.• We are particularly interested in phrase-structure

grammars.• Before we can define phrase-structure grammars,

we need to define a few more terms.

Special Symbols• Definition: A nonterminal symbol (or just

nonterminal) is a symbol which can be replaced by other symbols.

• Definition: A terminal symbol (or just terminal) is a symbol which cannot be replaced by other symbols.

• Definition: The start symbol is a special symbol, usually denoted by S.

• The set of terminals is denoted by T, and the set of nonterminals by N.

• S is a nonterminal.

Productions• Definition: A production is a rule which tells how

to replace one string from V* with another string.• Productions are denoted by ab, which denotes

that a can be replaced by b.• Example

– Let SA0, AA1, and A0 be productions

– Then I can replace S with A0

– Since I can replace A with A1, A0 can become A10

– Since I can replace A with 0, A10 can become 010

– Thus, I can replace S with 010

Phrase-Structure Grammars• Definition: A phrase-structure grammar is a 4-

tuple G=(V,T,S,P), where – V is a vocabulary– TV is a set of terminals– SV is a start symbol– P is a set of productions

• N=V-T is the set of nonterminals• Each production contains at least one nonterminal

on its left side.• We will always use S as the start symbol.

Direct Derivations

• Let G=(V,T,S,P) be a phrase-structure grammar.

• Let A=lar and B=lbr, where l, a, b, r V*.

• Let ab be a production.

• Then we can derive B from A.

• Thus we say that A is directly derivable from B.

• We write this as AB

Derivations

• Let G=(V,T,S,P) be a phrase-structure grammar

• Let A1, A2,…,An V* be such that

A1A2…An

• Then we say that An is derivable from A1.

• We write A1* An

• The sequence of productions used is called a derivation.

Generating Languages

• Let G=(V,T,S,P) be a grammar

• Definition: The language generated by G, denoted L(G) , is the set of all strings of terminals that are derivable from S.

• Put another way,

L(G)={w T* | S * w }

Example 1

Let G be the grammar with – V={S,0,1} – T={0,1}– P={SS0, S0}

• Clearly S0, so 0L(G)• Also, SS000, so 00L(G)• And, SS0S00000, so 000L(G)• It is not hard to see that L(G) is the language

consisting of all strings with 1 or more 0s.

Example 2

Let G be the grammar with V={S,0,1}, T={0,1}, and P={SSS, S1, S0}

• Clearly S0, so 0L(G)• Also, S1, so 1L(G)• Since SSSS101, so 01L(G)• In general, we can get a sequence of Ss, and

replace each with either 0 or 1. • Given this fact, it is easy to see that

L(G) ={0,1}+, the set of all non-empty binary strings

Example 3

Let G be the grammar with V={S,A,B,0,1}, T={0,1}, and

P={SAB, BBB, AAA, A0, B1}• Clearly SAB0B01, so 01L(G)• Also, SABAAB0AB00B001, so

001L(G)• Similarly, we can get 011, 0011, 0001, etc.• In general, we can get a sequence of n 0s followed

by m 1s, where n>0, m>0.• Thus L(G) ={0n1m | m and n are positive integers}

Type 0 Grammars

• Type 0 grammars have no restrictions on the types of productions that are allowed.

• Thus type 0 grammars are just phrase-structure grammars.

• This is not too exciting, so we will move on to type 1 grammars.

Type 1 Grammars

• In a type 1 grammar, productions are of the form– aXbacb,where XN and a,b,cV* with c– (or S, but ignore this for now)

• Thus, a production can only be applied if the symbol X is surrounded by a and b.

• In other words, the production can only be applied in a certain context.

• This is why type 1 grammars are also called context-sensitive grammars.

Type 2 Grammars• Productions are of the form

– Xa, where XN and aV*.

• Thus, if X is in a string, we can replace X with a no matter what surrounds X.

• In other words, the context in which X appears does not matter.

• This is why type 2 grammars are called context-free grammars.

• Context-free grammars produce context-free languages.

Type 3 Grammars

• Productions are of the form– Xa, where XN and aT– XaY, where X,YN and aT– S

• Type 3 grammars are called regular grammars.• Regular grammars produce regular languages.• It is easy to see that a type 3 grammar is a type 2

grammar.

Types of Grammars

Type Productions allowed

0 Almost any kind allowed

1 aXbacb, where XN, a,b,cV*, c

S

2 Xa, where XN and aV*

3 Xa, where XN and aT

XaY, where X,YN and aT

S

Types of Grammars• The following summarizes the relationships

between the types of grammars

Type 0: phrase-structure

Type 1: context-sensitive

Type 2: context-free

Type 3: regular

Regular Grammar Example

• Let G be the grammar with

– V={S,A,0,1},

– T={0,1}, and

– P={S0A, A0A, A1A, A1}

• We can determine what the language is by constructing a few words.– S0A01

– S0A00A001 S0A01A011

– S0A00A000A0001 S0A00A001A0011

– S0A01A010A0101 S0A01A011A0111

• We can see that in general, L(G) is the set of binary strings beginning with 0 and ending with 1.

Regular Languages and Sets

• Theorem: Let A be a subset of V* . Then A is a regular language if and only if A is a regular set.

• In other words, a language defined by a regular grammar can also be defined by a regular expression, and vice-versa.

• Example: We just saw that the grammar with V={S,A,0,1}, T={0,1}, and P={S0A, A0A, A1A, A1} generates the set of binary strings beginning with 0 and ending with 1.

• Recall that the regular set defined by 0(01)*1 is also the set of all binary strings beginning with 0 and ending with 1.

Grammar Applications

• Context-free grammars are used to define the syntax of most programming languages.

• Regular grammars are used in several applications, including the following– Searching text for patterns

– Lexical analysis (during program compilation)

• Efficient algorithms exist to determine if a string is in a context-free or regular language.

• This is important for tasks like determining whether or not a program is syntactically valid.

Backus-Naur Form

• Backus-Naur form (BNF) is a more compact representation of productions in a type 2 grammar.

• All productions with the same left hand side are combined into one production

• The symbol is replaced with ::=• All terminals are enclosed in < and >• The right hand sides of the various productions are

combined, and separated by |

Backus-Naur Form Example

• Consider the set of productions– SAB

– BBB

– AAA

– A0

– B1

• In BNF, they are represented by – <S> ::= <A><B>

– <B> ::= <B><B> | 1

– <A> ::= <A><A> | 0

Backus-Naur Form Example 2

• The Backus Naur form for the production of a signed integer is

– <signed integer> ::= <sign><integer>– <sign> ::= + | -– <integer> ::= <digit> | <digit><integer>– <digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Backus-Naur Form Applications

• Specifying the syntax for programming languages including – Java– LISP

• Specifying database languages– SQL

• Specifying markup languages– XML

top related