as 1 compiler

Upload: jagjitsingh2010

Post on 05-Apr-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 As 1 Compiler

    1/32

    Assignment No.

    Aim: Assignment to understand basic syntax of LEX specifications, built-in functions and Variables

    Theory:

    LEX helps write programs whose control flow is directed by instances of regular

    expressions in the input stream. It is well suited for editor-script type transformations and for

    segmenting input in preparation for a parsing routine.

    LEX is a program generator designed for Lexical processing of character input streams. It

    accepts a high-level, problem oriented specification for character string matching, and produces a

    program in a general purpose language which recognizes regular expressions. The regular

    expressions are specified by the user in the source specifications given to LEX. The LEX written

    code recognizes these expressions in an input stream and partitions the input stream into stringsmatching the expressions. At the boundaries between strings program sections provided by the

    user are executed. The LEX source file associates the regular expressions and the program

    fragments. As each expression appears in the input to the program written by LEX, the

    corresponding fragment is executed

    The user supplies the additional code beyond expression matching needed to complete histasks, possibly including code written by other generators. The program that recognizes the

    expressions is generated in the general purpose programming language employed for the user's

    program fragments. Thus, a high level expression language is provided to write the string

    expressions to be matched while the user's freedom to write actions is unimpaired. This avoids

  • 7/31/2019 As 1 Compiler

    2/32

    forcing the user who wishes to use a string manipulation language for input analysis to write

    processing programs in the same and often inappropriate string handling language

    LEX is not a complete language, but rather a generator representing a new language

    feature which can be added to different programming languages, called ``host languages.'' Just as

    general purpose languages can produce code to run on different computer hardware, LEX can

    write code in different host languages

    LEX turns the user's expressions and actions (called source in this memo) into the host general-purpose language; the generated program is named yyLEX. The yyLEX program will recognize

    expressions in a stream (called input in this memo) and perform the specified actions for each

    expression as it is detected. See Figure 1.

    .

  • 7/31/2019 As 1 Compiler

    3/32

    2. LEX Source.

    The general format of LEX source is:

    {definitions}

    %%

    {rules}

    %%

    {user subroutines}

    where the definitions and the user subroutines are often omitted. The second %% is optional, but

    the first is required to mark the beginning of the rules. The absolute minimum LEX program is

    thus (no definitions, no rules) which translates into a program which copies the input to the

    output unchanged.

    3. LEX Regular Expressions.

    A regular expression specifies a set of strings to be matched. It contains text characters (which

    match the corresponding characters in the strings being compared) and operator characters

    (which specify repetitions, choices, and other features). The letters of the alphabet and the digits

    are always text characters.

    The operator characters are

    " \ [ ] ^ - ? . * + | ( ) $ / { } % < >

    and if they are to be used as text characters, an escape should be used. The quotation mark

    operator (") indicates that whatever is contained between a pair of quotes is to be taken as text

    characters

  • 7/31/2019 As 1 Compiler

    4/32

    Character classes. Classes of characters can be specified using the operator pair []. The

    construction [abc] matches a single character, which may be a, b, or c. Within square brackets,

    most operator meanings are ignored. Only three characters are special: these are \ - and ^. The -

    character indicates ranges. For example,

    [a-z0-9_]

    indicates the character class containing all the lower case letters, the digits, the angle brackets,

    and underline. Ranges may be given in either order. Using - between any pair of characters

    which are not both upper case letters, both lower case letters, and both digits is implementation

    dependent and will get a warning message.

    Alternation and Grouping. The operator | indicates alternation:

    (ab|cd)

    matches either ab or cd. Note that parentheses are used for grouping, although they are not

    necessary on the outside level;

    ab|cd

    4. LEX Actions.

    When an expression written as above is matched, LEX executes the corresponding action. This

    section describes some features of LEX which aid in writing actions. Note that there is a default

    action, which consists of copying the input to the output. This is performed on all strings not

    otherwise matched.

    One of the simplest things that can be done is to ignore the input. Specifying a C null statement, ;

    as an action causes this result. A frequent rule is

  • 7/31/2019 As 1 Compiler

    5/32

    [ \t\n] ;

    which causes the three spacing characters (blank, tab, and newline) to be ignored.

    Another easy way to avoid writing actions is the action character |, which indicates that the

    action for this rule is the action for the next rule. The previous example could also have been

    written

    " "

    "\t"

    "\n"

    with the same result, although in different style. The quotes around \n and \t are not required.

    In more compLEX actions, the user will often want to know the actual text that matched some

    expression like [a-z]+. LEX leaves this text in an external character array named yytext. Thus, to

    print the name found, a rule like

    [a-z]+ printf("%s", yytext);

    5. Usage.

    There are two steps in compiling a LEX source program. First, the LEX source must be turned

    into a generated program in the host general purpose language. Then this program must be

    compiled and loaded, usually with a library of LEX subroutines. The generated program is on a

    file named LEX.yy.c. The I/O library is defined in terms of the C standard library

    Command:

    $ LEX a.l

  • 7/31/2019 As 1 Compiler

    6/32

    $ gcc LEX.yy.co op.outll

    $ ./ op.out a.c

  • 7/31/2019 As 1 Compiler

    7/32

    Assignment No.

    Aim:Implement a lexical analyser for a subset of C using LEX Implementation that should supporterror handling.

    Theory:

    A LEX program consists of three parts:

    declarations

    %%

    translation rules

    %%

    auxiliary procedures

    The translation rules of a LEX program are statements of the form

    Pi { action1)

    Pi { action2 }

  • 7/31/2019 As 1 Compiler

    8/32

    Pi { actionn }

    where each pi is a regular expression and each actioni is a program fragment describing what

    action the Lexical analyzer should take when pattern pi matches a LEXeme. In LEX, the actions

    are written in C; in general, however, they can be in any implementation language.

    The third section holds whatever auxiliary procedures are needed by the actions. Alternatively,

    these procedures can be compiled separately and loaded with the Lexical analyzer.

    LEX is generally used in the manner depicted in above fig. First, a specification of a Lexical

    analyzer is prepared by creating a program LEX.l in the LEX language. Then, LEX.l is run

    through the LEX compiler to produce program LEX.yy.c. The program LEX.yy.c consists of

    a tabular representation of a transition diagram constructed from the regular expressions of

    LEX.l, together with a standard routine that uses the table to recognize LEXemes. The actions

    associated with regular expressions in LEX.l are pieces of code and are carried over directly to

    LEX,yy.c. Finally, LEX.yy-c is run through the compiler to produce an object program a.out,

    which is the Lexical analyzer that transforms an input stream into a sequence of tokens.

    //LEX Program

    //File name : analyze.l

    %{

    #include

    #include

    struct st

    {

    char LEXeme[25];

    char name[25];

    }ST[100];

  • 7/31/2019 As 1 Compiler

    9/32

  • 7/31/2019 As 1 Compiler

    10/32

  • 7/31/2019 As 1 Compiler

    11/32

    "{" {strcpy(ST[cnt].LEXeme,yytext);strcpy(ST[cnt].name,"OCB");cnt++;}

    "}" {strcpy(ST[cnt].LEXeme,yytext);strcpy(ST[cnt].name,"CCB");cnt++;}

    "(" {strcpy(ST[cnt].LEXeme,yytext);strcpy(ST[cnt].name,"ORB");cnt++;}

    ")" {strcpy(ST[cnt].LEXeme,yytext);strcpy(ST[cnt].name,"CRB");cnt++;}

    ";"{strcpy(ST[cnt].LEXeme,yytext);strcpy(ST[cnt].name,"Semicolon");cnt++;}

    "++" {strcpy(ST[cnt].LEXeme,yytext);strcpy(ST[cnt].name,"IncOP");cnt++;}

    "--" {strcpy(ST[cnt].LEXeme,yytext);strcpy(ST[cnt].name,"DecOP");cnt++;}

    "?" {strcpy(ST[cnt].LEXeme,yytext);strcpy(ST[cnt].name,"TernaryOP");cnt++;}

    "=" {strcpy(ST[cnt].LEXeme,yytext);strcpy(ST[cnt].name,"AssignmentOP");cnt++;}

    %%

    main(int argc,char *argv[])

    {

    int i=0;

    yyin=fopen(argv[1],"r");

    yyLEX();

    printf("\nTOKEN TABLE");

    printf("\nLEXEME\t\t\tNAME\n");

  • 7/31/2019 As 1 Compiler

    12/32

    printf("___________\t\t_________________\n");

    for(i=0;i

  • 7/31/2019 As 1 Compiler

    13/32

    {

    a=a+1;

    }

    }

    // Output

    [root@localhost ~]# LEX analyze.l

    [root@localhost ~]# cc LEX.yy.c -ll

    [root@localhost ~]# ./a.out subset_C.c

    TOKEN TABLE

    LEXEME NAME

    ___________ _________________

    #include pp directive

    < Relational OP

    stdio.h include file

    > Relational OP

    #include pp directive

    < Relational OP

  • 7/31/2019 As 1 Compiler

    14/32

    string.h include file

    > Relational OP

    #define pp directive

    a identifier

    20 const integer literal

    ; Semicolon

    void keyword

    main keyword

    ( ORB

    ) CRB

    { OCB

    int keyword

    a identifier

    = Assignment OP

    0 const integer literal

    ; Semicolon

    char keyword

    a identifier

    [ OSB

    20 const integer literal

    ] CSB

    = Assignment OP

    "hello" string literal

    ; Semicolon

    for keyword

  • 7/31/2019 As 1 Compiler

    15/32

  • 7/31/2019 As 1 Compiler

    16/32

  • 7/31/2019 As 1 Compiler

    17/32

    Assignment No.

    Aim:Assignment to understand basic syntax of YACC specifications, built-in functions and Variables.

  • 7/31/2019 As 1 Compiler

    18/32

    Theory:

    Yacc stands for "yet another compiler-compiler," reflecting the popularity of parser generators inthe early 1970's when the first version of Yacc was created by S. C. Johnson. Yacc is available as

    a command on the UNIX system, and has been used to help implement hundreds of compilers.

    Structure of a Yacc Grammar

    A yacc grammar consists of three sections: the definition section, the rules section, and the user

    subroutines section.

    ... definition section ...

    %%

    ... rules section ...

    %%

    ... user subroutines section ...

    The sections are separated by lines consisting of two percent signs. The first two sections are

    required, although a section may be empty. The third section and the preceding "%%" line may

    be omitted.

    Symbols

  • 7/31/2019 As 1 Compiler

    19/32

    A yacc grammar is constructed from symbols, the "words" of the grammar. Symbols are strings

    of letters, digits, periods, and underscores that do not start with a digit. The symbol error is

    reserved for error recovery, otherwise yacc attaches no a priori meaning to any symbol.

    Symbols produced by the LEXer are called terminal symbols or tokens. Those that are defined

    on the left-hand side of rules are called nonterminal symbols or non-terminals. Tokens may also

    be literal quoted characters. (See "Literal Tokens.") A widely-followed convention makes token

    names all uppercase and non-terminals lowercase.

    Definition Section

    The definition section can include a literal block, C code copied verbatim to the beginning of the

    generated C file, usually containing declaration and #include lines. There may be %union,

    %start, %token, %type, %left, %right, and %nonassoc declarations. (See "%union Declaration,"

    "Start Declaration," "Tokens," "%type Declarations," and "Precedence and Operator

    Declarations.") It can also contain comments in the usual C format, surrounded by "/*" and "*/".

    All of these are optional, so in a very simple parser the definition section may be completely

    empty.

    Rules Section

    The rules section contains grammar rules and actions containing C code.

    User Subroutines Section

    Yacc copies the contents of the user subroutines section verbatim to the C file. This section

    typically includes routines called from the actions. In a large program, it is sometimes more

  • 7/31/2019 As 1 Compiler

    20/32

    convenient to put the supporting code in a separate source file to minimize the amount of

    material recompiled when you change the yacc file.

    Actions

    An action is C code executed when yacc matches a rule in the grammar.

    The action must be a C compound statement, e.g.:

    date: month '/' day '/' year { printf("date found"); } ;

    The action can refer to the values associated with the symbols in the rule by

    using a dollar sign followed by a number, with the first symbol after the colon being number 1,

    e.g.:

    date: month ' /' day ' /' year

    { printf("date %d-%d-%d found", $1, $3, $5); }

    The name "$$" refers to the value for the symbol to the left of the colon. Symbol values can have

    different C types. For rules with no action, yacc uses a default of:

    { $$ = $1; }

    Start Declaration

    Normally, the start rule, the one that the parser starts trying to parse, is the one named in the first

    rule. If you want to start with some other rule, in the declaration section you can write:

    %start somename

  • 7/31/2019 As 1 Compiler

    21/32

    to start with rule somename.

    In most cases the clearest way to present the grammar is top-down, with the start rule first, so no

    %start is needed.

    Symbol Values

    Every symbol in a yacc parser, both tokens and non-terminals, can have a value associated with

    it. If the token were NUMBER, the value might be the particular number, if it were STRING, the

    value might be a pointer to a copy of the string, and if it were SYMBOL, the value might be a

    pointer to an entry in the symbol table that describes the symbol. Each of these kinds of value

    corresponds to a different C type, int or double for the number, char * for the string, and a

    pointer to a structure for the symbol. Yacc makes it easy to assign types to symbols so that it

    automatically uses the correct type for

    each symbol.

    Declaring Symbol Types

    Internally, yacc declares each value as a C union that includes all of the types. You list all of the

    types in a %union declaration, q.v. Yacc turns this into a typedef for a union type calledYYSTYPE. Then for each symbol whose value is set or used in action code, you have to declare

    its type. Use %type for non-terminals. Use %token, %left, %right, or %nonassoc for tokens, to

    give the name of the union field corresponding to its type.

    Then, whenever you refer to a value using $$, $1, etc., yacc automatically uses the appropriate

    field of the union.

    Yacc Library

  • 7/31/2019 As 1 Compiler

    22/32

    Every implementation comes with a library of helpful routines. You can include the library by

    giving thely flag at the end of the cc command line on UNIX systems, or the equivalent on

    other systems. The contents of the library vary among implementations, but it always contains

    main() and yyerror().

    main()

    All versions of yacc come with a minimal main program which is sometimes useful for quickie

    programs and for testing. It's so simple we can reproduce it here:

    main(ac, av)

    {

    yyparse();

    return 0;

    }

    As with any library routine, you can provide your own main(). In nearly any useful application

    you will want to provide a main() that accepts command-line arguments and flags, opens files,

    and checks for errors.

    yyerror()

    All versions of yacc also provide a simple error reporting routine. It's also simple enough to list

    in its entirety:

    yyerror(char *errmsg)

  • 7/31/2019 As 1 Compiler

    23/32

    {

    fprintf(stderr, "%s\n", errmsg);

    }

    This sometimes suffices, but a better error routine that reports at least the line number and the

    most recent token will make your parser much more usable.

    YYABORT

    The special statement

    YYABORT;

    in an action makes the parser routine yyparse() return immediately with a non-zero value,

    indicating failure. It can be useful when an action routine detects an error so severe that there is

    no point in continuing the parse.

    Since the parser may have a one-token lookahead, the rule action contain-

    containing the YYABORT may not be reduced until the parser has read another

    token.

    YYACCEPT

    The special statement

    YYACCEPT;

  • 7/31/2019 As 1 Compiler

    24/32

    in an action makes the parser routine yyparse() return immediately with a value 0, indicating

    success.

    It can be useful in a situation where the LEXer cannot tell when the input data ends, but the

    parser can.

    YYERROR

    Sometimes your action code can detect context-sensitive syntax errors that the parser itself

    cannot. If your code detects a syntax error, you can call the macro YYERROR to produce

    exactly the same effect as if the parser had read a token forbidden by the grammar. As soon asyou invoke YYERROR the parser calls yyerror() and goes into error recovery mode looking for

    a state where it can shift an error token.

  • 7/31/2019 As 1 Compiler

    25/32

    Assignment No.

    Aim: Program for Desk calculator using LEX and YACC tool.

    Theory:

    Yacc stands for "yet another compiler-compiler," reflecting the popularity of parser generators in

    the early 1970's when the first version of Yacc was created by S. C. Johnson. Yacc is available as

    a command on the UNIX system, and has been used to help implement hundreds of compilers.

    A Yacc source program has three parts:

    declarations

    %%

    translation rules

    %%

    supporting -routines

    A Lexical analyzer created by LEX behaves in concert with a parser in the following manner.

    When activated by the parser, the Lexical analyzer begins reading its remaining input, one

    character at a time, until it has found the longest prefix of the input that is matched by one of the

    regular expressions pi. Then, it executes actioni. Typically, actioni will return control to the

  • 7/31/2019 As 1 Compiler

    26/32

    parser. However, if it does not, then the Lexical analyzer roceeds to find more LEXemes, until an

    action causes control to return to the parser. The repeated search for LEXemes until an explicit

    return allows the Lexical analyzer to process white space and comments conveniently. The

    Lexical analyzer returns a single quantity, the token, to the parser. To pass an attribute value with

    information about the LEXeme, we can set a global variable called yylval.

    The UNIX command

    yacc translate.

    Transforms the file translate. into program called .tab.. The program .tab. is a

    representation of an LALR parser written in C, along with other routines that the user may

    have prepared. By compiling y.tab.c along with the ly library. that contains the LR parsing

    program using the command

    cc y.tab.c -ly

    We obtain the desired object program a.out that performs the translation specified by the original

    Yacc program. If other procedures are needed, they can be compiled or loaded with .tab., just

    as with any program.

  • 7/31/2019 As 1 Compiler

    27/32

    // LEX program desk.l

    %{

    #include "y.tab.h"

    #include

    %}

    %%

    [0-9]+ { yylval.dval = atof(yytext);

    return NUMBER;

    }

    [0-9]+\.[0-9]+ { yylval.dval = atof(yytext);

    return NUMBER;

    }

    [a-z] { yylval.vblname = yytext[0];

  • 7/31/2019 As 1 Compiler

    28/32

    return NAME;

    }

    [ \t] { }

    \n { return 0; }

    . { return yytext[0]; }

    %%

  • 7/31/2019 As 1 Compiler

    29/32

    // Parser program desk.y

    %{

    #include

    #include

    %}

    %union

    {

    double dval;

    char vblname;

    }

    %token NAME

    %token NUMBER

  • 7/31/2019 As 1 Compiler

    30/32

    %left '+' '-'

    %left '*' '/'

    %nonassoc UMINUS

    %type expression

    %%

    statement: NAME '=' expression { printf("%c = %g \n",$1,$3); }

    | expression { printf("= %g \n",$1); }

    ;

    expression: expression '+' expression { $$ = $1 + $3; }

    | expression '-' expression { $$ = $1 - $3; }

    | expression '*' expression { $$ = $1 * $3; }

    | expression '/' expression { if($3 == 0.0)

    {

    yyerror("Divide by zero");

    }

    else

  • 7/31/2019 As 1 Compiler

    31/32

    $$ = $1 / $3;

    }

    | '(' expression ')' { $$ = $2; }

    | '-' expression %prec UMINUS { $$ = -$2; }

    | NUMBER { $$ = $1; }

    ;

    %%

    main()

    {

    yyparse();

    }

    int yyerror (char *s)

    {

    printf("%s\n",s);

    exit(0);

    }

  • 7/31/2019 As 1 Compiler

    32/32

    [root@localhost ~] # lex desk.l

    [root@localhost ~] # yaccd desk.y

    [root@localhost ~] # gcco output

    [root@localhost ~] # lex.yy.c y.tab.cll