build code with lex and yacc, part 1: introduction · web viewbuild code with lex and yacc, part 1:...

30
Build code with lex and yacc, Part 1: Introduction Meet lex, yacc, flex, and bison Level: Introductory Peter Seebach ([email protected] ) Freelance writer 11 Aug 2004 Lex and yacc are tools to automatically build C code suitable for parsing things in simple languages. These tools are most often used for parts of compilers or interpreters, or for reading configuration files. In the first of two articles, Peter Seebach explains what lex and yacc actually do and shows how to use them for simple tasks. Most people never need to know what lex and yacc do. You occasionally need to have them installed to compile something you downloaded, but for the most part, it just works. Maybe an occasional README file refers to "shift/reduce" conflicts. However, these tools remain a valuable part of the Unix toolbox, and a bit of familiarity with them will go a long way. In fact, while lex and yacc are almost always mentioned in the same breath, they can be used

Upload: others

Post on 15-Feb-2021

36 views

Category:

Documents


2 download

TRANSCRIPT

Build code with lex and yacc, Part 1: Introduction

Build code with lex and yacc, Part 1: Introduction

Meet lex, yacc, flex, and bison

Level: Introductory

Peter Seebach ([email protected])

Freelance writer

11 Aug 2004

Lex and yacc are tools to automatically build C code suitable for parsing things in simple languages. These tools are most often used for parts of compilers or interpreters, or for reading configuration files. In the first of two articles, Peter Seebach explains what lex and yacc actually do and shows how to use them for simple tasks.

Most people never need to know what lex and yacc do. You occasionally need to have them installed to compile something you downloaded, but for the most part, it just works. Maybe an occasional README file refers to "shift/reduce" conflicts. However, these tools remain a valuable part of the Unix toolbox, and a bit of familiarity with them will go a long way.

In fact, while lex and yacc are almost always mentioned in the same breath, they can be used separately. A number of very entertaining if trivial programs have been written entirely in lex (see Resources for links to those). Programs using yacc, but not lex, are rarer.

Throughout these articles, the names "lex" and "yacc" will be used to include also flex and bison, the GNU versions of these utilities. The code provided should work in all major versions, such as MKS yacc. It's all one big happily family!

This is a two-part series. The first article will introduce lex and yacc in more general terms, exploring what they do and how they do it. The second shows an actual application built using them.

What are lex and yacc, and why do people refer to them together?

Lex and yacc are a matched pair of tools. Lex breaks down files into sets of "tokens," roughly analogous to words. Yacc takes sets of tokens and assembles them into higher-level constructs, analogous to sentences. Yacc is designed to work with the output of Lex, although you can write your own code to fill that gap. Likewise, lex's output is mostly designed to be fed into some kind of parser.

They're for reading files that have reasonably structured formats. For instance, code in many programming languages can be read using lex and yacc. Many data files have predictable enough formats to be read using them, as well. Lex and yacc can be used to parse fairly simple and regular grammars. Natural languages are beyond their scope, but most computer programming languages are within their bounds.

Lex and yacc are tools for building programs. Their output is itself code, which needs to be fed into a compiler; typically, additional user code is added to use the code generated by lex and/or yacc. Some simple programs can get by on almost no additional code; others use a parser as a tiny portion of a much larger and more complicated program.

A more detailed look at each of these programs is in order.

Lex: A lexical analyzer generator

A lexical analyzer isn't a handheld gizmo you get on a sci-fi show. It's a program that breaks input into recognized pieces. For instance, a simple lexical analyzer might count the words in its input. Lex takes a specification file and builds a corresponding lexical analyzer, coded in C.

Perhaps the best way to approach this is to look at an example. Here's a simple lex program, taken from the man page for flex.

Listing 1. Counting words using lex

int num_lines = 0, num_chars = 0;

%%

\n ++num_lines; ++num_chars;

. ++num_chars;

%%

main() {

yylex();

printf("# of lines = %d, # of chars = %d\n", num_lines, num_chars);

}

This program has three sections, separated by %% markers. The first and last sections are plain old C code. The middle section is the interesting one. It consists of a series of rules that lex translates into the lexical analyzer. Each rule, in turn, consists of a regular expression and some code to be run when that regular expression is matched. Any text that isn't matched is simply copied to standard output. So, if your program is trying to parse a language, it's important to make sure that all possible inputs are caught by the lexer; otherwise, you get whatever's left over displayed to the user as though it were a message.

In fact, the code in Listing 1. is a complete program; if you run it through lex, compile it, and run the result, it'll do exactly what it looks like it does.

In this case, it's very easy to see what happens. A newline always matches the first rule. Any other character matches the second. Lex tries each rule in order, matching the longest stream of input it can. If something doesn't match any rules at all, lex will just copy it to standard output; this behavior is often undesirable. A simple solution is to add a last rule that matches anything, and either does nothing (if you're lazy) or emits some kind of diagnostic. Note that lex gives preference to longer matches, even if they're later. So, given these rules:

u { printf("You!\n"); }

uu { printf("W!\n"); }

and "uuu" for input, Lex will match the second rule first, consuming the first two letters, then the first rule. However, if something can match either of two rules, order in the lex specification determines which one is used. Some versions of lex may warn you when a rule cannot be matched.

What makes lex useful and interesting is that it can handle much more complicated rules. For instance, a rule to recognize a C identifier might look like

[a-zA-Z_][0-9a-zA-Z_]* { return IDENTIFIER; }

The syntax used is plain old regular expression syntax. There are a couple of extensions. One is that you can give names to common patterns. In the first section of the lex program, before the first %%, you can define names for a few of these:

DIGIT [0-9]

ALPHA [a-zA-Z]

ALNUM [0-9a-zA-Z]

IDENT [0-9a-zA-Z_]

Then, you can refer back to them by putting their names in braces in the rules section:

({ALPHA}|_){IDENT}* { return IDENTIFIER; }

Each rule has corresponding code that is executed when the regular expression is matched. The code can do any necessary processing and optionally return a value. The value will be used if a parser is using lex's output. For instance, in the case of the simple line-counting program, there's no real need for a parser. If you're using a parser to interpret code in a new language, you will need to return something to the parser to tell it what tokens you've found. You can just use an enum or a series of #define directives in a shared include file, or you can have yacc generate a list of predefined values for you.

Lex, by default, reads from standard input. You can point it at another file quite easily; it's a bit harder to, for instance, read from a buffer. There is no completely standardized way to do this; the most portable solution is to open a temporary file, write data to it, and hand it to the lexer. Here's a sample bit of code to do this:

Listing 2. Handing a memory buffer to the lexer

int

doparse(char *s) {

char buf[15];

int fd;

if (!s) {

return 0;

}

strcpy(buf, "/tmp/lex.XXXXX");

fd = mkstemp(buf);

if (fd < 0) {

fprintf(stderr, "couldn't create temporary file: %s\n",

strerror(errno));

return 0;

}

unlink(buf);

write(fd, s, strlen(s));

lseek(fd, 0, SEEK_SET);

yyin = fdopen(fd, "r");

yylex();

fclose(yyin);

}

This code carefully unlinks the temporary file, leaving it open but already removed, to clean up automatically. A more careful programmer, or one not writing for an audience with limited space for sample code, might wish to consider the implications of the user's choice of TMPDIR.

yacc: yet another compiler compiler

So, you've broken your input into a stream of tokens. Now you need some way to recognize higher-level patterns. This is where yacc comes in: yacc lets you describe what you want to do with tokens. A yacc grammar tends to look sort of like this:

Listing 3. A simple yacc grammar

value:

VARIABLE

| NUMBER

expression:

value '+' value

| value '-' value

This means that an expression can take any of several forms; for instance, a variable, a plus sign, and a number could be an expression. The pipe character (|) indicates alternatives. The symbols produced by the lexer are called terminals or tokens. The things assembled from them are called non-terminals. So, in this example, NUMBER is a terminal; the lexer is producing this result. By contrast, value is a non-terminal, which is created only by assembling it from terminals.

Yacc files, like lex files, come in sections separated by %% markers. As with a lex file, a yacc file comes in three sections, the last of which is optional, and contains just plain C code to be incorporated in the generated file.

Yacc can recognize patterns of tokens; for instance, as in the example above, it can recognize that an expression can consist of a value, either a plus or minus sign, and another value. It can also take actions; blocks of code enclosed in {} pairs will be executed when the parser reaches that point in an expression. For instance, one might write:

expression:

value '+' value { printf("Matched a '+' expression.\n"); }

The first section of a yacc file defines the objects that the parser will manipulate and generate. In some cases, it could be empty, but most often, it will contain at least a few %token directives. These directives are used to define the tokens the lexer can return. When yacc is run with the -d option, it generates a header file defining constants. Thus, our earlier lex example using:

({ALPHA}|_){IDENT}* { return IDENTIFIER; }

might have a corresponding yacc grammar containing the line:

%token IDENTIFIER

Yacc would create a header file (called y.tab.h by default) containing a line something like:

#define IDENTIFIER 257

These numbers will be outside the range of possible valid characters; thus, the lexer can return individual characters as themselves, or special tokens using these defined values. This can create a problem when porting code from one system to another: generally, you are best off re-running lex and yacc on a new platform, rather than porting the generated code.

By default, the yacc-generated parser will start by trying to parse an instance of the first rule found in the rules section of the file. You can change this by specifying a different rule with %start, but most often it's more reasonable to put the top-level rule at the top of the section.

Tokens, types, and values, oh my!

The next question is how to do anything with the components of an expression. The general way this is done is to define a data type to contain objects that yacc will be manipulating. This data type is a C union object, and is declared in the first section of a yacc file using a %union declaration. When tokens are defined, they can have a type specified. For a toy programming language, for instance, you might do something like this:

Listing 4. A minimalist %union declaration

%union {

long value;

}

%token NUMBER

%type expression

This indicates that, whenever the parser gets a NUMBER token returned by the lexer, it can expect that the member named value of the global variable yylval has been filled in with a meaningful value. Of course, your lexer has to handle this in some way:

Listing 5. Making a lexer use yylval

[0-9]+ {

yylval.value = strtol(yytext, 0, 10);

return NUMBER;

}

Yacc allows you to refer to the components of an expression by symbolic names. When a non-terminal is being parsed, the components that went into it are named $1, $2, and so on; the value it will pass back up to a higher-level parser is called $$. For instance:

Listing 6. Using variables in yacc

expression:

NUMBER '+' NUMBER { $$ = $1 + $3; }

Note that the literal plus sign is $2 and has no meaningful value, but it still takes up a place. There's no need to specify a "return" or anything else: just assign to the magic name $$. The %type declaration specifies that an expression non-terminal also uses the value member of the union.

In some cases, it may be useful to have multiple types of objects in the %union declaration; at this point, you have to make sure that the types you declare in %type and %token declarations are the ones you really use. For instance, if you have a %union declaration that includes both integer and pointer values, and you declare a token to use the pointer value, but your lexer fills in the integer value... Bad Things can happen.

Of course, this doesn't solve one last problem: the starting non-terminal expression has nowhere to return to, so you will need to do something with the values it produces. One way to do this is to make sure that you do all the work you need to do as you go; another is to build a single giant object (say, a linked list of items) and assign a pointer to it into a global variable at the end of the starting rule. So, for instance, if you were to build the above expression parser into a general-purpose calculator, just writing the parser would leave you with a program that very carefully parsed expressions, then did nothing at all with them. This is academically very interesting and it has a certain aesthetic appeal, but it's not particularly useful.

This basic introduction should leave you qualified to fool around a little with lex and yacc on your own time; perhaps building a simple calculator that does do something with the expressions it parses. One thing you'll find if you play around a bit is that lex and yacc can be confusing to troubleshoot. In the next installment, we'll look at troubleshooting techniques; as well, we will build a larger and more powerful parser for a real-world task.

Resources

• Part 2 of this series covers a couple of concrete applications of lex and yacc, showing how to build a simple calculator and a program to parse e-mail messages.

• You can find the sample code used for this article posted at Peter's Web site.

• The Lex & Yacc Page has a number of interesting historical references, as well as very good lex and yacc documentation.

• The GNU versions of lex and yacc are flex and bison and, as with all things GNU, have excellent documentation including complete user manuals in a variety of formats.

• John Levine's book lex & yacc, 2nd Edition (O'Reilly & Associates, 1992) remains the definitive resource.

• Jumpstart your yacc...and lex too! (developerWorks, November 2000) gives some more introductory background on everyone's favorite lexer/parser combo, and also has an example of a wordcount program.

• Using SDL, Part 4: lex and yacc (developerWorks, May 2000) of the Pirates Ho! game-writing saga shows how the authors put together a configuration file parser for the game using lex and yacc.

• Find more resources for Linux developers in the developerWorks Linux zone.

• Purchase Linux books at discounted prices in the Linux section of the Developer Bookstore.

• Develop and test your Linux applications using the latest IBM tools and middleware with a developerWorks Subscription: you get IBM software from WebSphere, DB2, Lotus, Rational, and Tivoli, and a license to use the software for 12 months, all for less money than you might think.

• Download no-charge trial versions of selected developerWorks Subscription products that run on Linux, including WebSphere Studio Site Developer, WebSphere SDK for Web services, WebSphere Application Server, DB2 Universal Database Personal Developers Edition, Tivoli Access Manager, and Lotus Domino Server, from the Speed-start your Linux app section of developerWorks. For an even speedier start, help yourself to a product-by-product collection of how-to articles and tech support.

Build code with lex and yacc, Part 2: Development and troubleshooting

Get started actually building something

Level: Introductory

Peter Seebach ([email protected])

Freelance writer

24 Aug 2004

The second article of this two-part series explores more advanced lex/yacc development and introduces basic troubleshooting techniques. See e-mail headers parsed before your very eyes! Marvel at cryptic error messages! See a computer actually compute something!

Building on the ground we covered in Part 1 of this two-part series, this article covers a couple of concrete applications of lex and yacc and discusses some common pitfalls. The examples -- a simple calculator and a program to parse e-mail messages -- are simple enough.

Lex-only code

While lex can be used to generate sequences of tokens for a parser, it can also be used to perform simple input operations. A number of programs have been written to convert simple English text into dialects using only lex, for instance. The most famous is probably the classic Swedish Chef translator, widely distributed on the Internet. These work by recognizing simple bits and pieces of patterns, and acting on them immediately. A good lexer example can help a lot with learning how to write a tokenizer. A link to the Swedish Chef program (sources are available, so you can see exactly how it's done) is in the Resources section of this article.

Most simple programming projects, of course, can get by with very trivial lexers. A bit of care put into designing a language can help simplify this substantially, by guaranteeing that tokens can be recognized reliably without needing to know about context. Of course, not all languages are this congenial; for instance, lex has a hard time with C string constants, where special characters modify other characters.

Toy example: A calculator

Here's a simple example of a program that can be written in a page of yacc and a half-page of lex code. It illustrates a few interesting points, and it's even sort of handy... occasionally.

Listing 1. Lexer for a simple calculator

%{

#include "calc.h"

#include "y.tab.h"

%}

NUMBER [0-9]+

OP [+-/*]

%%

{NUMBER} { yylval.value = strtol(yytext, 0, 10); return NUMBER; }

({OP}|\n) { return yytext[0]; }

. { ; }

All this does is match numbers, operators, and newlines. Numbers have the special token NUMBER returned; everything else is just returned as plain characters. Unmatched characters are silently ignored.

Listing 2. Parser for a simple calculator

%{

#include

#include "calc.h"

%}

%union {

long value;

}

%type expression add_expr mul_expr

%token NUMBER

%%

input:

expression

| input expression

expression:

expression '\n' { printf("%ld\n", $1); }

| expression '+' add_expr { $$ = $1 + $3; }

| expression '-' add_expr { $$ = $1 - $3; }

| add_expr { $$ = $1; }

add_expr:

add_expr '*' mul_expr { $$ = $1 * $3; }

| add_expr '/' mul_expr { $$ = $1 / $3; }

| mul_expr { $$ = $1; }

mul_expr:

NUMBER { $$ = $1; }

This parser illustrates one way in which you can handle precedence and associativity without explicitly using the %prec declarations to guide yacc in resolving conflicts. If looser-binding operators take as operands expressions using tighter-binding operators, precedence is handled automatically. This is how, for instance, the C standard defines its formal grammar.

Note that, when a newline is seen after an expression, the value of the expression is simply printed. No long-term storage is required, no variables are modified. This works fine for a quick desktop calculator.

Note also the use of left recursion. Yacc prefers left-recursion. You can write the rule for "input" just as correctly as:

input: expression

| expression input

However, if you do this, each expression you write will cause another layer of recursion in the parser; as written originally, each expression is reduced to an input object before the next one needs to be parsed. Left-recursion is more efficient and, for large input sets, may be the only viable option.

Concrete example: Parsing e-mail

One likely task is reading a file containing e-mail messages and extracting their headers and contents. This is an interesting example, because the rules for reading headers are complicated enough to make it more practical to use start states for some of them. The lexer actually does some of the work of figuring out where in a message it is, but the parser still ties everything together.

The sample program can read individual messages, or the Berkeley "mbox" format. It uses a line starting with "From " as a message separator; this could be adapted, but it works well enough.

The support code for this is fairly straightforward. The "sz" type is a code convenience, a string library that hides all the work of reallocating strings when they grow. The structures are fairly minimalist and not very well generalized; this is to make the code smaller and easier to read. Real code would be a bit more general.

The grammar is mostly fairly simple. A file consists of one or more messages. Left recursion is used so that the rule can be reduced after each message. A message consists of a header, an empty line, and a body. The body is easy enough to read: a bunch of lines. You may note that the body has no explicit terminator. Rather, the first thing that isn't part of the body of a message must match the beginnings of the next possible object; a bit of reading back and forth reveals that it must be part of a header. Since only LINE tokens can occur in a body, the next other token that shows up will be part of a header -- or a syntax error. As a special case (implicit in yacc, rather than stated in the grammar), a 0 token indicates that there are no more tokens, and ends parsing. If the start rule (in this case, file) has successfully reduced, the parse is considered successful.

The header grammar is fairly complicated, although not as complicated as it could be. A header may start with a headerline, or with a FROMLINE token. The headerline non-terminal, in turn, requires a header name, a header body, and optional continuations. Once again, the continuation lines are handled using left recursion to keep things efficient. Note that, in theory, this grammar would accept a header with multiple "From " lines in it; this could be fixed, but there's no obvious reason to worry about it.

The lexer for this example is more complicated than the one for the calculator; quite a bit so. It uses start states -- a feature allowing the lexer to match some rules only some of the time. The two states it uses are named BODY and HEADER. The HEADER state is used when parsing the value part of a given name/value pair, not for parsing the entire message header. The reason that there are HEADER and BODY rules for matching the LINE pattern is to make sure that the lexer identifies header names and continuations before it starts just handing back full lines that the parser would then have to analyze to decide what to do with them.

This does mean that the lexer and the parser both have to know that an empty line separates a header from a body. Similarly, the lexer has to know about the structure of continuations; the parser only knows that it sometimes gets additional text to add on to a header.

In this program, the parser uses global variables to track the list of messages parsed. This is how the test program gets access to the list of messages; the call to yyparse() returns only a success or failure indication, not any objects it may have created. (The return value is zero for a successful parse, and non-zero otherwise.)

Troubleshooting common problems

Lex and yacc are very good at parsing reasonably simple file formats. The biggest limitation is that yacc is limited to one token of lookahead. This means that if yacc would have to read two tokens before it knew which action to take, it would be unable to parse this grammar.

Listing 3. Uncle Shelby's ABZ, as a yacc grammar

a_f:

'a' 'b' 'c' 'd' 'e' 'f' { printf("alphabet\n"); }

| 'a' 'b' 'z' 'd' 'e' 'f' { printf("Silverstein\n"); }

In this case, yacc can tell, by the time it has to take an action, which alphabet it's parsing -- the regular one, or Uncle Shelby's ABZ. On the other hand, if you wrote the rules like this, it wouldn't work:

Listing 4. An unparsable grammar

a_f:

'a' { printf("alphabet\n"); } 'b' 'c' 'd' 'e' 'f'

| 'a' { printf("Silverstein\n"); } 'b' 'z' 'd' 'e' 'f'

In this case, the only token yacc would have available to decide which of these two actions to take would be the letter "b," which is the same in both rules. This grammar is ambiguous; yacc can't figure out what's going on.

Yacc can look ahead one character, though. So:

Listing 5. A parsable grammar

a_f:

'a' 'b' { printf("alphabet\n"); } 'c' 'd' 'e' 'f'

| 'a' 'b' { printf("Silverstein\n"); } 'z' 'd' 'e' 'f'

This version is fine. When yacc gets the "c" or "z," it knows enough to determine which rule to use. This is normally enough flexibility to handle reasonably simple file formats.

Conflict resolution

Most people eventually run into either a shift/reduce conflict or a reduce/reduce conflict. The first is less problematic, the second generally more severe. The most well-known example of a shift/reduce conflict is the "dangling else" ambiguity in C.

Imagine a yacc grammar something like this:

Listing 6. Dangling else

statement if '(' condition ')' statement

| if '(' condition ')' statement else statement

| printf '(' string ')' ';'

This is fairly similar to the actual C grammar, if perhaps a bit simplified. Now, consider what happens to this in the case where a conditional statement is nested within another conditional statement:

Listing 7. Nested conditional

if (a)

if (b)

printf("a and b\n");

else

printf("???\n");

Does the else token get attached to the inner if (if (b)) or the outer one? Both are possible. Either way, there's one instance of the if/else construct, and one plain if. This is called a shift/reduce conflict. When yacc sees the else token, it can either reduce the sequence (from if (b) through the semicolon) into a statement (already having reduced the whole printf line into a statement), or it can shift the else token onto it, making it into the first half of a longer pattern. In fact, yacc's interpretation will be this one:

Listing 8. The dangling else, resolved

if (a)

if (b)

printf("a and b\n");

else

printf("a and not-b\n");

The default behavior is to prefer a shift over a reduction, and this is most often what you want. This behavior can be enforced by setting the precedence of the different patterns. Another option is to design your grammar to exclude this ambiguity: for instance, in perl, the braces around the body if an if statement are not optional, making it easy to tell whether the else is part of the inner if statement or the outer one.

Lexers can be problematic too. A common problem is having a rule match too much text; worse, since lex prefers the longest match it can find, this can result in a totally inappropriate rule being matched. Start states can help with this a lot. A more powerful tool is exclusive states, in which only rules matching the exclusive state can be matched (this is not supported in all versions of lex, but should be available in anything modern).

If your version of lex lacks exclusive states, you can qualify every rule with states, and switch between them. Use a %{...%} block to BEGIN the state you want to use as a default state, and it'll work as though all other states were exclusive.

Multiple parsers and lexers

It used to be that using multiple parsers within a single program required a fair amount of careful tuning of the files generated by lex and yacc. Conveniently, this has been corrected: modern versions of both lex and yacc allow you to specify a prefix to use on the names of symbols in generated code. Just do that -- it's easier. In flex, the option is -Pprefix; in modern yacc, it's most often -p prefix. You can almost always arrange to use flex or bison, if the default lex and yacc available to you don't support this.

Going further

One of the best things you can do to learn more about lex and yacc is to write a few toy programs. Sample programs are also easy to find on the Internet: a bit of searching will find all sorts of cool stuff. A few of the links at the end of the article might help get you started.

Whenever possible, try to develop separate tests for your lexer and your parser. When you have a problem, knowing what went wrong is crucial; knowing whether it's the wrong token or an error in your grammar is a good starting point. A bit of work put into writing a good yyerror routine will help a lot. As a grammar gets more complicated, it gets more important to give clear and meaningful diagnostics of the errors encountered.

Resources

• Part 1 of this series introduces lex, yacc, flex, and bison; check it out before reading this installment.

• You can find the sample code used for this article posted at Peter's Web site.

• The Lex & Yacc Page, or "The asteroid to kill this dinosaur is still in orbit," has a number of interesting historical references as well as very good lex and yacc documentation.

• The GNU versions of lex and yacc are flex and bison. As with all things GNU, they have excellent documentation including complete user manuals in a variety of formats.

• John Levine's book lex & yacc, 2nd Edition (O'Reilly & Associates, 1992) remains the definitive resource.

• The beginning of the "ABZ" grammar used to illustrate the way to resolve conflicts and otherwise troubleshoot problems in lex and yacc comes from Uncle Shelby's ABZ's (Simon and Schuster, 1961) by Shel Silverstein.

• Bork bork bork! Chef/Jive/ValSpeak/Pig is a translator that speaks Swedish Chef, Valley Girl, Jive, and Pig Latin: written in pure lex code. Put in "lex and yacc are fun" and it comes out "lex und yecc ere-a foon" (The Muppet Show's Swedish Chef is the default).

• ANSI C yacc grammar is a reasonably complete C89 grammar, done in yacc. Complete with matching lexer. Write your own C compiler! Just needs a little work. :)

• Jumpstart your yacc...and lex too! (developerWorks, 2000) gives some more introductory background on everyone's favorite lexer/parser combo, and also has an example of a wordcount program.

• Using SDL, Part 4: lex and yacc of the Pirates Ho! game-writing saga shows how the authors put together a configuration file parser for the game using lex and yacc.

• Find more resources for Linux developers in the developerWorks Linux zone.

• Purchase Linux books at discounted prices in the Linux section of the Developer Bookstore.

• Develop and test your Linux applications using the latest IBM tools and middleware with a developerWorks Subscription: you get IBM software from WebSphere, DB2, Lotus, Rational, and Tivoli, and a license to use the software for 12 months, all for less money than you might think.

• Download no-charge trial versions of selected developerWorks Subscription products that run on Linux, including WebSphere Studio Site Developer, WebSphere SDK for Web services, WebSphere Application Server, DB2 Universal Database Personal Developers Edition, Tivoli Access Manager, and Lotus Domino Server, from the Speed-start your Linux app section of developerWorks. For an even speedier start, help yourself to a product-by-product collection of how-to articles and tech support.

About the author

Peter Seebach first encountered lex in the form of the Swedish Chef program and has been a fascinated student ever since. He's used lex and yacc to develop several toy languages for various purposes and only regrets it a little. You can reach him at [email protected].

http://www.ibm.com/developerworks/