tokenizing - computer science and...
TRANSCRIPT
![Page 1: Tokenizing - Computer Science and Engineeringweb.cse.ohio-state.edu/.../web-sw2/extras/slides/26.Tokenizing.pdf · Aside: Characters vs. Tokens • In the examples of CFGs, we dealt](https://reader031.vdocuments.us/reader031/viewer/2022022609/5b8f0a4a09d3f28c298bdfc4/html5/thumbnails/1.jpg)
Tokenizing
19 March 2013 OSU CSE 1
![Page 2: Tokenizing - Computer Science and Engineeringweb.cse.ohio-state.edu/.../web-sw2/extras/slides/26.Tokenizing.pdf · Aside: Characters vs. Tokens • In the examples of CFGs, we dealt](https://reader031.vdocuments.us/reader031/viewer/2022022609/5b8f0a4a09d3f28c298bdfc4/html5/thumbnails/2.jpg)
BL Compiler Structure
19 March 2013 OSU CSE 2
Code Generator Parser Tokenizer
string of characters
(source code)
string of tokens
(“words”)
abstract program
string of integers
(object code)
The tokenizer is relatively easy.
![Page 3: Tokenizing - Computer Science and Engineeringweb.cse.ohio-state.edu/.../web-sw2/extras/slides/26.Tokenizing.pdf · Aside: Characters vs. Tokens • In the examples of CFGs, we dealt](https://reader031.vdocuments.us/reader031/viewer/2022022609/5b8f0a4a09d3f28c298bdfc4/html5/thumbnails/3.jpg)
Aside: Characters vs. Tokens
• In the examples of CFGs, we dealt with languages over the alphabet of individual characters (e.g., Java’s char values) Σ = character
• Now, we deal with languages over an alphabet of tokens, each of which is a unit that you want to consider as a single entity in the language – Choice of tokens is a design decision
19 March 2013 OSU CSE 3
![Page 4: Tokenizing - Computer Science and Engineeringweb.cse.ohio-state.edu/.../web-sw2/extras/slides/26.Tokenizing.pdf · Aside: Characters vs. Tokens • In the examples of CFGs, we dealt](https://reader031.vdocuments.us/reader031/viewer/2022022609/5b8f0a4a09d3f28c298bdfc4/html5/thumbnails/4.jpg)
Example: Expression CFG
expr → expr add-op term | term term → term mult-op factor | factor factor → ( expr ) | digit-seq add-op → + | - mult-op → * | DIV | REM digit-seq → digit digit-seq | digit digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
19 March 2013 OSU CSE 4
![Page 5: Tokenizing - Computer Science and Engineeringweb.cse.ohio-state.edu/.../web-sw2/extras/slides/26.Tokenizing.pdf · Aside: Characters vs. Tokens • In the examples of CFGs, we dealt](https://reader031.vdocuments.us/reader031/viewer/2022022609/5b8f0a4a09d3f28c298bdfc4/html5/thumbnails/5.jpg)
Example: Expression CFG
expr → expr add-op term | term term → term mult-op factor | factor factor → ( expr ) | digit-seq add-op → + | - mult-op → * | DIV | REM digit-seq → digit digit-seq | digit digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
19 March 2013 OSU CSE 5
Appropriate tokens for this CFG are “words”
consisting of strings of consecutive terminal
symbols (characters) that “belong together”, e.g., "+", "DIV", "5".
![Page 6: Tokenizing - Computer Science and Engineeringweb.cse.ohio-state.edu/.../web-sw2/extras/slides/26.Tokenizing.pdf · Aside: Characters vs. Tokens • In the examples of CFGs, we dealt](https://reader031.vdocuments.us/reader031/viewer/2022022609/5b8f0a4a09d3f28c298bdfc4/html5/thumbnails/6.jpg)
Tokenizer
• The job of the tokenizer is to transform a string of characters into a string of tokens
• Example: – Input: "4 + (7 DIV 3) REM 5"
19 March 2013 OSU CSE 6
![Page 7: Tokenizing - Computer Science and Engineeringweb.cse.ohio-state.edu/.../web-sw2/extras/slides/26.Tokenizing.pdf · Aside: Characters vs. Tokens • In the examples of CFGs, we dealt](https://reader031.vdocuments.us/reader031/viewer/2022022609/5b8f0a4a09d3f28c298bdfc4/html5/thumbnails/7.jpg)
Tokenizer
• The job of the tokenizer is to transform a string of characters into a string of tokens
• Example: – Input: "4 + (7 DIV 3) REM 5"
19 March 2013 OSU CSE 7
characters used as terminal symbols of the language
![Page 8: Tokenizing - Computer Science and Engineeringweb.cse.ohio-state.edu/.../web-sw2/extras/slides/26.Tokenizing.pdf · Aside: Characters vs. Tokens • In the examples of CFGs, we dealt](https://reader031.vdocuments.us/reader031/viewer/2022022609/5b8f0a4a09d3f28c298bdfc4/html5/thumbnails/8.jpg)
Tokenizer
• The job of the tokenizer is to transform a string of characters into a string of tokens
• Example: – Input: "4 + (7 DIV 3) REM 5"
19 March 2013 OSU CSE 8
whitespace characters
![Page 9: Tokenizing - Computer Science and Engineeringweb.cse.ohio-state.edu/.../web-sw2/extras/slides/26.Tokenizing.pdf · Aside: Characters vs. Tokens • In the examples of CFGs, we dealt](https://reader031.vdocuments.us/reader031/viewer/2022022609/5b8f0a4a09d3f28c298bdfc4/html5/thumbnails/9.jpg)
Tokenizer
• The job of the tokenizer is to transform a string of characters into a string of tokens
• Example: – Input: "4 + (7 DIV 3) REM 5"
19 March 2013 OSU CSE 9
Mathematically, input is a string of character
![Page 10: Tokenizing - Computer Science and Engineeringweb.cse.ohio-state.edu/.../web-sw2/extras/slides/26.Tokenizing.pdf · Aside: Characters vs. Tokens • In the examples of CFGs, we dealt](https://reader031.vdocuments.us/reader031/viewer/2022022609/5b8f0a4a09d3f28c298bdfc4/html5/thumbnails/10.jpg)
Tokenizer
• The job of the tokenizer is to transform a string of characters into a string of tokens
• Example: – Input: "4 + (7 DIV 3) REM 5" – Output: <"4", "+", "(", "7", "DIV", "3", ")", "REM", "5">
19 March 2013 OSU CSE 10
Mathematically, output is a string of string of character
![Page 11: Tokenizing - Computer Science and Engineeringweb.cse.ohio-state.edu/.../web-sw2/extras/slides/26.Tokenizing.pdf · Aside: Characters vs. Tokens • In the examples of CFGs, we dealt](https://reader031.vdocuments.us/reader031/viewer/2022022609/5b8f0a4a09d3f28c298bdfc4/html5/thumbnails/11.jpg)
Another Example: BL
• In BL, tokens can be the “words” such as "IF", "next-is-empty", etc.
• A BL tokenizer is then easy: it can simply treat strings of consecutive whitespace characters as separators between tokens – This makes it easy for the language to allow
line separators, extra spaces and tabs used for indentation, etc., to have no impact on the legality of a program
19 March 2013 OSU CSE 11
![Page 12: Tokenizing - Computer Science and Engineeringweb.cse.ohio-state.edu/.../web-sw2/extras/slides/26.Tokenizing.pdf · Aside: Characters vs. Tokens • In the examples of CFGs, we dealt](https://reader031.vdocuments.us/reader031/viewer/2022022609/5b8f0a4a09d3f28c298bdfc4/html5/thumbnails/12.jpg)
Resources • Wikipedia: Lexical Analysis
– http://en.wikipedia.org/wiki/Lexical_analysis
19 March 2013 OSU CSE 12