lexical analyzer

Lexical AnalyzerUnit 2

OutlineInformal sketch of lexical analysisIdentifies tokens in input stringIssues in lexical analysisLookaheadAmbiguitiesSpecifying lexemesRegular expressionsExamples of regular expressions

Lexical AnalyzerFunctionsGrouping input characters into tokensStripping out comments and white spacesCorrelating error messages with the source programIssues (why separating lexical analysis from parsing)Simpler designCompiler efficiencyCompiler portability (e.g. Linux to Win)

*The role of the lexical analyzerLexical analyzers are divided two processes:ScanningNo tokenization of the inputdeletion of comments, compaction of white space charactersLexical analysisProducing tokens

*Scanning PerspectivePurposeTransform a stream of symbolsInto a stream of tokens

*Lexical Analyzer ResponsibilitiesLexical analyzer [Scanner]Scan inputRemove white spacesRemove commentsManufacture tokensGenerate lexical errorsPass token to parser

Compiler ConstructionThe Role of a Lexical AnalyzerLexicalanalyzerParserSource programread charput back charpass tokenand attribute valueget nextSymbol TableRead entireprogram into memoryid

Compiler Construction

Compiler ConstructionLexical AnalysisWhat do we want to do? Example:if (i == j)Z = 0;elseZ = 1;The input is just a string of characters:\t if (i == j) \n \t \t z = 0;\n \t else \n \t \t z = 1;Goal: Partition input string into substringsWhere the substrings are tokens


Compiler ConstructionWhats a Token?A syntactic category In English:noun, verb, adjective, In a programming language:Identifier, Integer, Keyword, Whitespace,


Compiler ConstructionWhat are Tokens For?Classify program substrings according to roleOutput of lexical analysis is a stream of tokens . . .which is input to the parserParser relies on token distinctionsAn identifier is treated differently than a keyword


Compiler ConstructionTokensTokens correspond to sets of strings. Identifier: strings of letters or digits, starting with a letterInteger: a non-empty string of digits Keyword: else or if or begin or Whitespace: a non-empty sequence of blanks, newlines, and tabs


Compiler ConstructionTypical Tokens in a PLSymbols: +, -, *, /, =, , ->, Keywords: if, while, struct, float, int, Integer and Real (floating point) literals123, 123.45Char (string) literalsIdentifiersCommentsWhite space


*TerminologyTokenA classification for a common set of stringsExamples: Identifier, Integer, Float, Assign, LeftParen, RightParen,....PatternThe rules that characterize the set of strings for a tokenA pattern is a description of the form that the lexemes of token may take.Examples: [0-9]+LexemeActual sequence of characters that matches a pattern and has a given Token class.A lexeme is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an instance of that token.Examples: Identifier: Name,Data,xInteger: 345,2,0,629,....

*Examples

*Lexical ErrorsError Handling is very localized, w.r.t. Input Source Example: fi(a==f(x)) generates no lexical error in CIn what situations do errors occur?Prefix of remaining input doesnt match any defined tokenPossible error recovery actions:Deleting or Inserting Input CharactersReplacing or Transposing CharactersOr, skip over to next separator to ignore problem

*Basic Scanning techniqueUse 1 character of look-aheadObtain char with getc()Do a case analysisBased on lookahead charBased on current lexemeOutcomeIf char can extend lexeme, all is well, go on.If char cannot extend lexeme:Figure out what the complete lexeme is and return its tokenPut the lookahead back into the symbol stream

Compiler ConstructionToken AttributeE = C1 ** 10

TokenAttributeIDIndex to symbol table entry E=IDIndex to symbol table entry C1**NUM10


*Lexical Errors

It is hard for a lexical analyzer to tell, without the aid of other components, that there is a source-code error.E.g., fi ( a == f(x)) ...Keyword if?an identifier?

*Suppose a situation in which none of the patterns for tokens matches a prefix of the remaining input. E.g. $%#if a>0 a+=1;

*The simplest recovery strategy is panic mode recovery.Delete successive characters from the remaining input until the lexical analyzer can find a well-formed token.

This technique may occasionally confuse the parser, but in an interactive computing environment it may be quit adequate.

*Other possible error-recovery actionsDelete one extraneous character from the remaining input.Insert a missing character into the remaining input.Replace a character by another character.Transpose two adjacent characters.

Compiler ConstructionLexical Error and RecoveryError detectionError reportingError recoveryDelete the current character and restart scanning at the next characterDelete the first character read by the scanner and resume scanning at the character following it.How about runaway strings and comments?


Compiler ConstructionSpecification of TokensRegular expressions are an important notation for specifying lexeme patterns. While they cannot express all possible patterns, they are very effective in specifying those types of patterns that we actually need for tokens.


*LA reads the source program character by character from secondary storage. Reading the input character by character by is costly.Therefore a block of data is first read into buffer and then scanned by the LA.In order to efficiently move back and forth in the input stream input buffering is used.There are various methods by which buffering may be doneOne Buffer SchemeTwo Buffer SchemeOne Buffer Scheme: If a lexeme crosses over the boundary of the buffer the buffer has to be refilled in order to scan the rest of the lexeme and the first part of the lexeme will be overwritten in the process.

Input Buffering

*Two buffers of the same size, say 4096, are alternately reloaded.Two pointers to the input are maintained:Pointer lexeme_Begin marks the beginning of the current lexeme.Pointer forward pointer scans ahead until a pattern match is found.The string of characters between the two pointers is the current lexeme. Initially both pointers point to the beginning of the lexeme.The fp then starts scanning forward until a match for a pattern is found. When the end of the lexeme is identified, the fp is set to the character at its right end, the token and the attribute corresponding to this lexeme is returned. After that both the pointers lb and fp are then set to the beginning of the next token, which is the character immediately past the lexeme.Buffer Pairs

Buffering Input Buffering: Some efficiency issues concerned with the buffering of input. A two-buffer input scheme that is useful when lookahead on the input is necessary to identify tokens. Techniques for speeding up the lexical analyser, such as the use of sentinels to mark the buffer end. There are three general approaches to the implementation of a lexical analyser:Use a lexical-analyser generator, such as Lex compiler to produce the lexical analyser from a regular expression based specification. In this, the generator provides routines for reading and buffering the input.2. Write the lexical analyser in a conventional systems-programming language, using I/O facilities of that language to read the input.3. Write the lexical analyser in assembly language and explicitly manage the reading of input. Buffer pairs: Because of a large amount of time can be consumed moving characters, specialized buffering techniques have been developed to reduce the amount of overhead required to process an input character. The scheme to be discussed: Consists a buffer divided into two N-character halves.

N Number of characters on one disk block, e.g., 1024 or 4096. Read N characters into each half of the buffer with one system read command. If fewer than N characters remain in the input, then eof is read into the buffer after the input characters. Two pointers to the input buffer are maintained. The string of characters between two pointers is the current lexeme. Initially both pointers point to the first character of the next lexeme to be found. Forward pointer, scans ahead until a match for a pattern is found.

Once the next lexeme is determined, the forward pointer is set to the character at its right end. If the forward pointer is about to move past the halfway mark, the right half is filled with N new input characters. If the forward pointer is about to move past the right end of the buffer, the left half is filled with N new characters and the forward pointer wraps around to the beginning of the buffer. Disadvantage of this scheme: This scheme works well most of the time, but the amount of lookahead is limited. This limited lookahead may make it impossible to recognize tokens in situations where the distance that the forward pointer must travel is more than the length of the buffer. For example: DECLARE ( ARG1, ARG2, , ARGn ) in PL/1 program; Cannot determine whether the DECLARE is a keyword or an array name until the character that follows the right parenthesis.

*Whenever we advance fp we must check whether we have reached the end of one buffer or not as in that case we have to refill the other buffer before advancing fp. So we have to make three tests.For checking whether the first buffer is full or not.For checking whether the second buffer is full or not.To check the end of the input.

*If forward at end of first half then begin reload second half; forward:=forward + 1;EndElse if forward at end of second half then begin reload first half; move forward to beginning of first halfEnd Else if (fp==eof){ terminate scanning}Else forward:=forward + 1;

Sentinels

We can reduce these three tests to one by introducing a special character called sentinel at the end. Let us choose EOF as the sentinel at the end of each buffer.E = M * eofC * * 2 eof eof

Sentinels In the previous scheme, must check each time the move forward pointer that have not moved off one half of the buffer. If it is done, then must reload the other half.Therefore the ends of the buffer halves require two tests for each advance of the forward pointer.This can reduce the two tests to one if it is extend each buffer half to hold a sentinel character at the end.The sentinel is a special character that cannot be part of the source program. (eof character is used as sentinel). In this, most of the time it performs only one test to see whether forward points to an eof.Only when it reach the end of the buffer half or eof, it performs more tests.Since N input characters are encountered between eofs, the average number of tests per input character is very close to 1.

*forward:=forward+1;If (forward==eof){If forward at end of first half then begin reload second half; forward:=forward + 1;EndElse if forward at end of second half then begin reload first half; move forward to beginning of first halfEnd Else terminate lexical analysis;}

*How can deal with a long and long andlong lexeme, this is a problem in the two buffer scheme.

DECLARE(ARG1, ARG2,,ARGn)

E.g. When a function is rewritten in c++, a function name is represent several function.

*Regular expressions are an important notation for specifying token patterns.

Study formal notations for regular expressions.

these expressions are used in lexical-analyzer generator.

Sec. 3.7 shows how to build the lexical analyzer by converting regular expressions to automata.3.3 Specification of Tokens

*1Regular Definition of TokensDefined in regular expressione.g. identifier can be defined by regular GrammarId letter(letter|digit) letter A|B||Z|a|b||zdigit 0|1|2||9Identifier can also be expressed by following regular expression (A|B||Z|a|b||z)(A|B||Z|a|b||z| 0|1|2||9)*

*Regular expressions are an important notation for specifying patterns. Each pattern matches a set of strings, so regular expressions will serve as names for sets of strings.

*2Regular Expression & Regular languageRegular ExpressionA notation that allows us to define a pattern in a high level language.Regular languageEach regular expression r denotes a language L(r) (the set of sentences relating to the regular expression r)

*Each token in a program can be expressed in a regular expression

*3The construct rule of regular expression over alphabet is a regular expression that denote {} is regular expression{} is the related regular language2) If a is a symbol in , then a is a regular expression that denotes {a}a is regular expression{a} is the related regular language

*3) Suppose and are regular expressions, then |, , (), * , * is also a regular expression L(|)= L()L() L()= L()L() L(())= L() L(*)={}L()L()L()... L() L()

*4Algebraic laws of regular expressions 1) |= |2) |(|)=(|)| () =( ) 3) (| )= | (|)= | 4) = = 5)(*)*=*6) *=| * = *7) (|)*= (* | *)*= (* *)*

*8) If L(),then= | = * = | = *Notes: We assume that the precedence of * is the highest, the precedence of | is the lowest and they are left associative

*Example unsigned numbers such as 5280, 39.37, 6.336E4, 1.894E-4digit0 | 1 || 9digits digit digit*optional_fraction .digits|optional_exponent (E(+|-| )digits)|num digits optional_fraction optional_exponent

*5Notational Short-handsa)One or more instances ( r )+ digit+b)Zero or one instance r? is a shorthand for r| (E(+|-)?digits)? c)Character classes [a-z] denotes a|b|c||z [A-Za-z] [A-Za-z0-9]

*1Task of recognition of token in a lexical analyzerIsolate the lexeme for the next token in the input bufferProduce as output a pair consisting of the appropriate token and attribute-value, such as , using the translation table given in the Fig in next page3.4 Recognition of Tokens

*

Regular expressionTokenAttribute-valueifif-ididPointer to table entry=otherStart stateAccepting stateNotes: Here we use * to indicate states on which input retraction must take place

*4Implementing a Transition DiagramEach state gets a segment of codeIf there are edges leaving a state, then its code reads a character and selects an edge to follow, if possibleUse nextchar() to read next character from the input buffer

*while (1) { switch(state) { case 0: c=nextchar(); if (c==blank || c==tab || c==newline){ state=0;lexeme_beginning++} else if (c== ) state=6 else state=fail(); break case 9: c=nextchar(); if (isletter( c)) state=10; else state=fail(); break }}}

*5A generalized transition diagram Finite AutomationDeterministic or non-deterministic FANon-deterministic means that more than one transition out of a state may be possible on the the same input symbol

*6The model of recognition of tokens

*e.gThe FA simulator for Identifiers is:

Which represent the rule: identifier=letter(letter|digit)*

lexical analyzer

Documents