pcd final ass

29
Assignment No. Aim: Assignment to understand basic syntax of LEX specifications, built-in functions and Variables Theory: LEX helps write programs whose control flow is directed by instances of regular expressions in the input stream. It is well suited for editor-script type transformations and for segmenting input in preparation for a parsing routine. LEX is a program generator designed for Lexical processing of character input streams. It accepts a high- level, problem oriented specification for character string matching, and produces a program in a general purpose language which recognizes regular expressions. The regular expressions are specified by the user in the source specifications given to LEX. The LEX written code recognizes these expressions in an input stream and partitions the input stream into strings matching the expressions. At the boundaries between strings program sections provided by the user are executed. The LEX source file associates the regular expressions and the program fragments. As each expression appears in the input to the program written by LEX, the corresponding fragment is executed - 1 -

Upload: amit-sangale

Post on 18-Nov-2014

92 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PCD Final Ass

Assignment No.

Aim: Assignment to understand basic syntax of LEX specifications, built-in functions and Variables

Theory:

LEX helps write programs whose control flow is directed by instances of regular

expressions in the input stream. It is well suited for editor-script type transformations and

for segmenting input in preparation for a parsing routine.

LEX is a program generator designed for Lexical processing of character input

streams. It accepts a high-level, problem oriented specification for character string

matching, and produces a program in a general purpose language which recognizes

regular expressions. The regular expressions are specified by the user in the source

specifications given to LEX. The LEX written code recognizes these expressions in an

input stream and partitions the input stream into strings matching the expressions. At the

boundaries between strings program sections provided by the user are executed. The LEX

source file associates the regular expressions and the program fragments. As each

expression appears in the input to the program written by LEX, the corresponding

fragment is executed

The user supplies the additional code beyond expression matching needed to

complete his tasks, possibly including code written by other generators. The program that

recognizes the expressions is generated in the general purpose programming language

employed for the user's program fragments. Thus, a high level expression language is

provided to write the string expressions to be matched while the user's freedom to write

actions is unimpaired. This avoids forcing the user who wishes to use a string

manipulation language for input analysis to write processing programs in the same and

often inappropriate string handling language

- 1 -

Page 2: PCD Final Ass

LEX is not a complete language, but rather a generator representing a new

language feature which can be added to different programming languages, called ``host

languages.'' Just as general purpose languages can produce code to run on different

computer hardware, LEX can write code in different host languages

LEX turns the user's expressions and actions (called source in this memo) into the host

general-purpose language; the generated program is named yyLEX. The yyLEX program

will recognize expressions in a stream (called input in this memo) and perform the

specified actions for each expression as it is detected. See Figure 1.

.

2. LEX Source.

The general format of LEX source is:

{definitions}

%%

{rules}

%%

{user subroutines}

- 2 -

Page 3: PCD Final Ass

where the definitions and the user subroutines are often omitted. The second %% is

optional, but the first is required to mark the beginning of the rules. The absolute

minimum LEX program is thus (no definitions, no rules) which translates into a program

which copies the input to the output unchanged.

3. LEX Regular Expressions.

A regular expression specifies a set of strings to be matched. It contains text characters

(which match the corresponding characters in the strings being compared) and operator

characters (which specify repetitions, choices, and other features). The letters of the

alphabet and the digits are always text characters.

The operator characters are

" \ [ ] ^ - ? . * + | ( ) $ / { } % < >

and if they are to be used as text characters, an escape should be used. The quotation

mark operator (") indicates that whatever is contained between a pair of quotes is to be

taken as text characters

Character classes. Classes of characters can be specified using the operator pair []. The

construction [abc] matches a single character, which may be a, b, or c. Within square

brackets, most operator meanings are ignored. Only three characters are special: these are

\ - and ^. The - character indicates ranges. For example,

[a-z0-9<>_]

indicates the character class containing all the lower case letters, the digits, the angle

brackets, and underline. Ranges may be given in either order. Using - between any pair of

characters which are not both upper case letters, both lower case letters, and both digits is

implementation dependent and will get a warning message.

Alternation and Grouping. The operator | indicates alternation:

- 3 -

Page 4: PCD Final Ass

(ab|cd)

matches either ab or cd. Note that parentheses are used for grouping, although they are

not necessary on the outside level;

ab|cd

4. LEX Actions.

When an expression written as above is matched, LEX executes the corresponding action.

This section describes some features of LEX which aid in writing actions. Note that there

is a default action, which consists of copying the input to the output. This is performed on

all strings not otherwise matched.

One of the simplest things that can be done is to ignore the input. Specifying a C null

statement, ; as an action causes this result. A frequent rule is

[ \t\n] ;

which causes the three spacing characters (blank, tab, and newline) to be ignored.

Another easy way to avoid writing actions is the action character |, which indicates that

the action for this rule is the action for the next rule. The previous example could also

have been written

" "

"\t"

"\n"

with the same result, although in different style. The quotes around \n and \t are not

required.

In more compLEX actions, the user will often want to know the actual text that matched

some expression like [a-z]+. LEX leaves this text in an external character array named

yytext. Thus, to print the name found, a rule like

[a-z]+ printf("%s", yytext);

- 4 -

Page 5: PCD Final Ass

5. Usage.

There are two steps in compiling a LEX source program. First, the LEX source must be

turned into a generated program in the host general purpose language. Then this program

must be compiled and loaded, usually with a library of LEX subroutines. The generated

program is on a file named LEX.yy.c. The I/O library is defined in terms of the C

standard library

Command:

$ LEX a.l

$ gcc LEX.yy.c –o op.out –ll

$ ./ op.out a.c

- 5 -

Page 6: PCD Final Ass

Assignment No.

Aim: Implement a lexical analyser for a subset of C using LEX Implementation that should support error handling.

Theory:

A LEX program consists of three parts:

declarations

%%

translation rules

%%

auxiliary procedures

The translation rules of a LEX program are statements of the form

Pi { action1)

Pi { action2 }

Pi { actionn }

where each pi is a regular expression and each actioni is a program fragment describing

what action the Lexical analyzer should take when pattern p i matches a LEXeme. In

LEX, the actions are written in C; in general, however, they can be in any implementation

language.

The third section holds whatever auxiliary procedures are needed by the actions.

Alternatively, these procedures can be compiled separately and loaded with the Lexical

analyzer.

LEX is generally used in the manner depicted in above fig. First, a specification of a

Lexical analyzer is prepared by creating a program LEX.l in the LEX language. Then,

LEX.l is run through the LEX compiler to produce а С program LEX.yy.c. The program

LEX.yy.c consists of a tabular representation of a transition diagram constructed from the

regular expressions of LEX.l, together with a standard routine that uses the table to

recognize LEXemes. The actions associated with regular expressions in LEX.l are pieces

of С code and are carried over directly to LEX,yy.c. Finally, LEX.yy-c is run through the

- 6 -

Page 7: PCD Final Ass

С compiler to produce an object program a.out, which is the Lexical analyzer that

transforms an input stream into a sequence of tokens.

//LEX Program

//File name : analyze.l

%{#include<stdio.h>#include<string.h>struct st{

char LEXeme[25];char name[25];

}ST[100];

int cnt=0;%}ID [a-zA-Z][a-zA-Z0-9]*DIGIT [0-9]

%%{DIGIT}+ {strcpy(ST[cnt].LEXeme,yytext);strcpy(ST[cnt].name,"const integer literal");cnt++;}

{DIGIT}+"."{DIGIT}+ {strcpy(ST[cnt].LEXeme,yytext);strcpy(ST[cnt].name,"const float literal");cnt++;}

"#include"|"#define" {strcpy(ST[cnt].LEXeme,yytext);strcpy(ST[cnt].name,"pp directive");cnt++;}{ID}".h" {strcpy(ST[cnt].LEXeme,yytext);strcpy(ST[cnt].name,"include file");cnt++;}

main|void|switch|case|continue|break|do|while|for|if|else|int|float|char {strcpy(ST[cnt].LEXeme,yytext);strcpy(ST[cnt].name,"keyword");cnt++;}

"\""{ID}"\"" {strcpy(ST[cnt].LEXeme,yytext);strcpy(ST[cnt].name,"string literal");cnt++;}

- 7 -

Page 8: PCD Final Ass

{ID} {strcpy(ST[cnt].LEXeme,yytext);strcpy(ST[cnt].name,"identifier");cnt++;}

"+"|"-"|"*"|"/"|"%" {strcpy(ST[cnt].LEXeme,yytext);strcpy(ST[cnt].name,"Arithmetic OP");cnt++;}

"&"|"|"|"^"+"~" {strcpy(ST[cnt].LEXeme,yytext);strcpy(ST[cnt].name,"Bitwise OP");cnt++;}

"<<"|">>" {strcpy(ST[cnt].LEXeme,yytext);strcpy(ST[cnt].name,"Bitwise Shift OP");cnt++;}

"&&"|"||"|"!" {strcpy(ST[cnt].LEXeme,yytext);strcpy(ST[cnt].name,"Logical OP");cnt++;}

"<"|">"|"<="|">=" {strcpy(ST[cnt].LEXeme,yytext);strcpy(ST[cnt].name,"Relational OP");cnt++;}

"=="|"!=" {strcpy(ST[cnt].LEXeme,yytext);strcpy(ST[cnt].name,"Equality OP");cnt++;}"[" {strcpy(ST[cnt].LEXeme,yytext);strcpy(ST[cnt].name,"OSB");cnt++;}"]" {strcpy(ST[cnt].LEXeme,yytext);strcpy(ST[cnt].name,"CSB");cnt++;}"{" {strcpy(ST[cnt].LEXeme,yytext);strcpy(ST[cnt].name,"OCB");cnt++;}"}" {strcpy(ST[cnt].LEXeme,yytext);strcpy(ST[cnt].name,"CCB");cnt++;}"(" {strcpy(ST[cnt].LEXeme,yytext);strcpy(ST[cnt].name,"ORB");cnt++;}")" {strcpy(ST[cnt].LEXeme,yytext);strcpy(ST[cnt].name,"CRB");cnt++;}";" {strcpy(ST[cnt].LEXeme,yytext);strcpy(ST[cnt].name,"Semicolon");cnt++;}"++" {strcpy(ST[cnt].LEXeme,yytext);strcpy(ST[cnt].name,"Inc OP");cnt++;}"--" {strcpy(ST[cnt].LEXeme,yytext);strcpy(ST[cnt].name,"Dec OP");cnt++;}"?" {strcpy(ST[cnt].LEXeme,yytext);strcpy(ST[cnt].name,"Ternary OP");cnt++;}"=" {strcpy(ST[cnt].LEXeme,yytext);strcpy(ST[cnt].name,"Assignment OP");cnt++;}

- 8 -

Page 9: PCD Final Ass

%%

main(int argc,char *argv[]){int i=0;yyin=fopen(argv[1],"r");yyLEX();printf("\nTOKEN TABLE");printf("\nLEXEME\t\t\tNAME\n");printf("___________\t\t_________________\n");for(i=0;i<cnt;i++)

{printf("\n%s",ST[i].LEXeme);printf("\t\t\t%s",ST[i].name);}

printf("\n");}

// input //File Name : subset_C.c

#include<stdio.h>#include<string.h>

#define a 20;void main(){int a=0;char a[20]="hello";

for(a=0;a<100;a++){a=a+1;}

}

// Output

[root@localhost ~]# LEX analyze.l[root@localhost ~]# cc LEX.yy.c -ll[root@localhost ~]# ./a.out subset_C.c

TOKEN TABLE

LEXEME NAME

- 9 -

Page 10: PCD Final Ass

___________ _________________

#include pp directive< Relational OPstdio.h include file> Relational OP#include pp directive< Relational OPstring.h include file> Relational OP#define pp directivea identifier20 const integer literal; Semicolonvoid keywordmain keyword( ORB) CRB{ OCBint keyworda identifier= Assignment OP0 const integer literal; Semicolonchar keyworda identifier[ OSB20 const integer literal] CSB= Assignment OP"hello" string literal; Semicolonfor keyword( ORBa identifier= Assignment OP0 const integer literal; Semicolona identifier< Relational OP100 const integer literal; Semicolona identifier++ Inc OP) CRB{ OCBa identifier= Assignment OPa identifier+ Arithmetic OP1 const integer literal; Semicolon

- 10 -

Page 11: PCD Final Ass

} CCB} CCB

- 11 -

Page 12: PCD Final Ass

Assignment No.

Aim: Assignment to understand basic syntax of YACC specifications, built-in functions and Variables.

Theory:

Yacc stands for "yet another compiler-compiler," reflecting the popularity of

parser generators in the early 1970's when the first version of Yacc was created by S. C.

Johnson. Yacc is available as a command on the UNIX system, and has been used to help

implement hundreds of compilers.

Structure of a Yacc Grammar

A yacc grammar consists of three sections: the definition section, the rules section, and

the user subroutines section.

... definition section ...

%%

... rules section ...

%%

... user subroutines section ...

The sections are separated by lines consisting of two percent signs. The first two sections

are required, although a section may be empty. The third section and the preceding "%%"

line may be omitted.

Symbols

A yacc grammar is constructed from symbols, the "words" of the grammar. Symbols are

strings of letters, digits, periods, and underscores that do not start with a digit. The

symbol error is reserved for error recovery, otherwise yacc attaches no a priori meaning

to any symbol.

- 12 -

Page 13: PCD Final Ass

Symbols produced by the LEXer are called terminal symbols or tokens. Those that are

defined on the left-hand side of rules are called nonterminal symbols or non-terminals.

Tokens may also be literal quoted characters. (See "Literal Tokens.") A widely-followed

convention makes token names all uppercase and non-terminals lowercase.

Definition Section

The definition section can include a literal block, C code copied verbatim to the

beginning of the generated C file, usually containing declaration and #include lines.

There may be %union, %start, %token, %type, %left, %right, and %nonassoc

declarations. (See "%union Declaration," "Start Declaration," "Tokens," "%type

Declarations," and "Precedence and Operator Declarations.") It can also contain

comments in the usual C format, surrounded by "/*" and "*/". All of these are optional, so

in a very simple parser the definition section may be completely empty.

Rules Section

The rules section contains grammar rules and actions containing C code.

User Subroutines Section

Yacc copies the contents of the user subroutines section verbatim to the C file. This

section typically includes routines called from the actions. In a large program, it is

sometimes more convenient to put the supporting code in a separate source file to

minimize the amount of material recompiled when you change the yacc file.

Actions

An action is C code executed when yacc matches a rule in the grammar.

The action must be a C compound statement, e.g.:

date: month '/' day '/' year { printf("date found"); } ;

The action can refer to the values associated with the symbols in the rule by

using a dollar sign followed by a number, with the first symbol after the colon being

number 1,

e.g.:

- 13 -

Page 14: PCD Final Ass

date: month ' /' day ' /' year

{ printf("date %d-%d-%d found", $1, $3, $5); }

The name "$$" refers to the value for the symbol to the left of the colon. Symbol values

can have different C types. For rules with no action, yacc uses a default of:

{ $$ = $1; }

Start Declaration

Normally, the start rule, the one that the parser starts trying to parse, is the one named in

the first rule. If you want to start with some other rule, in the declaration section you can

write:

%start somename

to start with rule somename.

In most cases the clearest way to present the grammar is top-down, with the start rule

first, so no %start is needed.

Symbol Values

Every symbol in a yacc parser, both tokens and non-terminals, can have a value

associated with it. If the token were NUMBER, the value might be the particular number,

if it were STRING, the value might be a pointer to a copy of the string, and if it were

SYMBOL, the value might be a pointer to an entry in the symbol table that describes the

symbol. Each of these kinds of value corresponds to a different C type, int or double for

the number, char * for the string, and a pointer to a structure for the symbol. Yacc makes

it easy to assign types to symbols so that it automatically uses the correct type for

each symbol.

Declaring Symbol Types

Internally, yacc declares each value as a C union that includes all of the types. You list all

of the types in a %union declaration, q.v. Yacc turns this into a typedef for a union type

called YYSTYPE. Then for each symbol whose value is set or used in action code, you

have to declare its type. Use %type for non-terminals. Use %token, %left, %right, or

%nonassoc for tokens, to give the name of the union field corresponding to its type.

Then, whenever you refer to a value using $$, $1, etc., yacc automatically uses the

appropriate field of the union.

- 14 -

Page 15: PCD Final Ass

Yacc Library

Every implementation comes with a library of helpful routines. You can include the

library by giving the —ly flag at the end of the cc command line on UNIX systems, or

the equivalent on other systems. The contents of the library vary among implementations,

but it always contains main() and yyerror().

main()

All versions of yacc come with a minimal main program which is sometimes useful for

quickie programs and for testing. It's so simple we can reproduce it here:

main(ac, av)

{

yyparse();

return 0;

}

As with any library routine, you can provide your own main(). In nearly any useful

application you will want to provide a main() that accepts command-line arguments and

flags, opens files, and checks for errors.

yyerror()

All versions of yacc also provide a simple error reporting routine. It's also simple enough

to list in its entirety:

yyerror(char *errmsg)

{

fprintf(stderr, "%s\n", errmsg);

}

This sometimes suffices, but a better error routine that reports at least the line number and

the most recent token will make your parser much more usable.

- 15 -

Page 16: PCD Final Ass

YYABORT

The special statement

YYABORT;

in an action makes the parser routine yyparse() return immediately with a non-zero value,

indicating failure. It can be useful when an action routine detects an error so severe that

there is no point in continuing the parse.

Since the parser may have a one-token lookahead, the rule action contain-

containing the YYABORT may not be reduced until the parser has read another

token.

YYACCEPT

The special statement

YYACCEPT;

in an action makes the parser routine yyparse() return immediately with a value 0,

indicating success.

It can be useful in a situation where the LEXer cannot tell when the input data ends, but

the parser can.

YYERROR

Sometimes your action code can detect context-sensitive syntax errors that the parser

itself cannot. If your code detects a syntax error, you can call the macro YYERROR to

produce exactly the same effect as if the parser had read a token forbidden by the

grammar. As soon as you invoke YYERROR the parser calls yyerror() and goes into

error recovery mode looking for a state where it can shift an error token.

- 16 -

Page 17: PCD Final Ass

Assignment No.

Aim: Program for Desk calculator using LEX and YACC tool.

Theory:

Yacc stands for "yet another compiler-compiler," reflecting the popularity of

parser generators in the early 1970's when the first version of Yacc was created by S. C.

Johnson. Yacc is available as a command on the UNIX system, and has been used to help

implement hundreds of compilers.

A Yacc source program has three parts:

declarations

%%

translation rules

%%

supporting С-routines

A Lexical analyzer created by LEX behaves in concert with a parser in the following

manner. When activated by the parser, the Lexical analyzer begins reading its remaining

input, one character at a time, until it has found the longest prefix of the input that is

matched by one of the regular expressions pi. Then, it executes actioni. Typically, actioni

will return control to the parser. However, if it does not, then the Lexical analyzer

roceeds to find more LEXemes, until an action causes control to return to the parser. The

repeated search for LEXemes until an explicit return allows the Lexical analyzer to

process white space and comments conveniently. The Lexical analyzer returns a single

quantity, the token, to the parser. To pass an attribute value with information about the

LEXeme, we can set a global variable called yylval.

- 17 -

Page 18: PCD Final Ass

The UNIX command

yacc translate.у

Transforms the file translate.у into а С program called у.tab.с. The program у.tab.с is a

representation of an LALR parser written in C, along with other С routines that the user

may have prepared. By compiling y.tab.c along with the ly library. that contains the LR

parsing program using the command

cc y.tab.c -ly

We obtain the desired object program a.out that performs the translation specified by the

original Yacc program. If other procedures are needed, they can be compiled or loaded

with у.tab.с, just as with any С program.

// LEX program desk.l

%{

#include "y.tab.h"

#include<math.h>

%}

%%

[0-9]+ { yylval.dval = atof(yytext);

return NUMBER;

- 18 -

Page 19: PCD Final Ass

}

[0-9]+\.[0-9]+ { yylval.dval = atof(yytext);

return NUMBER;

}

[a-z] { yylval.vblname = yytext[0];

return NAME;

}

[ \t] { }

\n { return 0; }

. { return yytext[0]; }

%%

- 19 -

Page 20: PCD Final Ass

// Parser program desk.y

%{

#include<math.h>

#include<stdio.h>

%}

%union

{

double dval;

char vblname;

}

%token <vblname> NAME

%token <dval> NUMBER

%left '+' '-'

%left '*' '/'

%nonassoc UMINUS

%type <dval> expression

%%

statement: NAME '=' expression { printf("%c = %g \n",$1,$3); }

| expression { printf("= %g \n",$1); }

;

expression: expression '+' expression { $$ = $1 + $3; }

| expression '-' expression { $$ = $1 - $3; }

| expression '*' expression { $$ = $1 * $3; }

| expression '/' expression { if($3 == 0.0)

{

- 20 -

Page 21: PCD Final Ass

yyerror("Divide by zero");

}

else

$$ = $1 / $3;

}

| '(' expression ')' { $$ = $2; }

| '-' expression %prec UMINUS { $$ = -$2; }

| NUMBER { $$ = $1; }

;

%%

main()

{

yyparse();

}

int yyerror (char *s)

{

printf("%s\n",s);

exit(0);

}

- 21 -

Page 22: PCD Final Ass

[root@localhost ~] # lex desk.l

[root@localhost ~] # yacc –d desk.y

[root@localhost ~] # gcc –o output

[root@localhost ~] # lex.yy.c y.tab.c –ll

[root@localhost ~] #. /output

((2+3) + (4+5))

= 26

- 22 -