the art of disassembly - pudn.comread.pudn.com/downloads114/ebook/478072/artofdisassembly.pdf ·...

The Art Of Disassemblyhttp://aod.anticrack.de and http://board.anticrack.de

A project by: Zero, CuTedEvil, Crick

CHAPTER 0 Welcome To The Art Of Disassembly

The Art Of Disassembly 3

Welcome To The Art Of Disassembly

4

What is AoD - The Art of Disassembly?Art of Disassembly understands itselves as a Handbook for writing a disassembler.

Is this book like "Art of Assembly" ? No. We really do not want to mess with this great freeebook.

So this book is very long. Did we really wrote all alone? No, not really. Especially for thetheoretical part we have included some of the best articles we found. Some of them isSEH, the PE-tutorial or the "how to build a disassembler". A very long addition I havedone is the chapter "Let´s build a compiler" with standalone 350 pages. There was reallyno need to "rewrite" these articles.

For the theoretical sources we included some good code-snippets we found on the weband at http://board.win32asmcommunity.net.

Sure we respect the work of these authors and do not claim their work as ours. Youshould do the same. Therefore we place always a footnote to the author and/or the loca-tion where we found the article/source.

The practical part of the disassembler was developed during an online-course/discussionat http://board.anticrack.de and has many contributors. We hope to mention them all.

At the end we hope that this book will be a complete handbook for building a disassem-bler in assembly language for the win32 environment.

As development language we decided to use MASM32v7 because it is free and well sup-ported at http://board.win32asmcommunity.net. But sure you can use (after reading thisbook) TASM, NASM, FASM or whatever.

How much will this book cost and where can you get a paperback version ?

First this book will be always free. There is no need to pay for it.

Second there will be never a printed version. This book is like Art of Assembly only avail-able in PDF format. Sure we will never sell articles made by others even we haveincluded them here.

The Art Of Disassembly

What is AoD - The Art of Disassembly?

Are we allowed to include the articles made by others ?

Yes, if we respect the authors and add footnotes to them and never disclaim their work asours. Please see this book as academical publication to increase knowledge.

Zero - Main Author

CuTedEvil - Main Coder

Crick - Main Coder



6

LicensingNo Licensing.

No Freeware or shareware or whatever.

No copyright, copyleft, copytop or copydown… just a little copycenter :D

For included articles not by us please respect their copyrights !

This document IS ABSOLUTELY FREE !!!

So we call it learn-ware.

You are allowed to do with this document whatever you want, as long you keep it as it is.Do not extract parts and disclaim them as full tutorial. People are always lucky when theyfind the full document.

So you are allowed to teach your grandma, print this document or take it into a pub andplace your beer on it.


Disclaimer

Disclaimer- We disclaim ourselves.

- We are not responsible for damages at your computer when you use the informationsdescribed in this book.

- We are not responsible when you are loosing your hairs as a result of this heavy andcomplex material.

- We are not responsible what ever you do with the knowledge you gain from here.

- We are not responsible for the article-contents by other people we have included



8

The Software You Really Need !There are some tools we will need for writing our disassembler. All tools you need todownload are for free and can be used without licensing for our disassembler.

- MASM32v7 package as our assembler

- RadAsm as IDE for development

- OllyDbg for debugging

Anyway you may need some more links to get informed:

- http://aod.anticrack.deThis is the main site for this project and this document. Check it to get the latestrelease. We will offer links to all necessary tools as well as a change-log of thisdocument.

- http://www.anticrack.deAll information you need for coding assembly and reverse-engineering

- http://board.anticrack.deThis is the place where our disassembler and this document is developed. Pleasecheck out the disassembler-forum!

- http://board.win32asmcommunity.netThe main place for asking question about MASM and assembly coding. No reverse-engineering topics here please!

-http://www.cs.vu.nl/~dick/PTAPG.htmlParsing Techniques - A practical guide by Dick Grune and Ceriel J.H. Jacobs

That´s it !

Free tools, a free book and a free mind will take us on the road of wasting time…


The Software You Really Need !



10

CHAPTER 1 Basic knowledge you need

First we need to have a look at some very important basic knowledge. It is very important thatyou understand the first lesson before you start with coding a disassembler.

In this chapter we will first do a short journey into the PE-Filestructure of win32 applications.Then we will discuss some coding techniques which can make your life easier when you arecoding the disassembler-engine. So we will have a look at modular and procedural coding,after this we will have a short overview of object oriented programming (OOP). This is nohandbook for good coding, so you should know some parts of these coding-concepts. There-fore we will not go into deep details of these topics. Next we will discuss linked lists and tree/graph structures. Especially linked lists are very important when you load a file into memoryand want to parse it. Combining linked lists with OOP can be a very powerfull tool. Afterunderstanding this we have a look at parsing-problems and how to loop though the bytes inmemory. At the end of this chapter we will need to look at the opcodes- and mnemomics-comcept, which is one of the bases of our disassembler.

This chapter is for the very unexperienced users and should give you a good backgroundknowledge which you need to build your own disassembler engine.


Basic knowledge you need

12

Lesson 1 - A little journey into the PE-FilestructureThe PE-Header is the most important thing you have to understand. It defines the struc-ture of a normal (PE) file in the win32 environment.

When you are coding a disassembler you have to play with it. You need to detect if it hasa valid structure, inspect the different sections, want to have a look in the import andexport tables and need to find the entry point of the application. The next lessons are theoriginal tutorials by Iczelion. They are the best you can find to get a good overview of thePE-filestructure in a win32 assembly environment.

There was really no need to write an own PE tutorial. Most of the beginning assemblycoders have learned from these tutorials. We respect the work by Iczelion and you shoulddo the same!


Lesson 1 - A little journey into the PE-Filestructure

Overview of the PE-File format1

PE stands for Portable Executable. It's the native file format of Win32. Its specification isderived somewhat from the Unix Coff (common object file format). The meaning of "portableexecutable" is that the file format is universal across win32 platform: the PE loader of everywin32 platform recognizes and uses this file format even when Windows is running on CPUplatforms other than Intel. It doesn't mean your PE executables would be able to port to otherCPU platforms without change. Every win32 executable (except VxDs and 16-bit Dlls) usesPE file format. Even NT's kernel mode drivers use PE file format. Thus studying the PE fileformat gives you valuable insights into the structure of Windows.

Let's jump into the general outline of PE file format without further ado.

1. This is the original tutorial by Iczelion



14

The above picture is the general layout of a PE file. All PE files (even 32-bit DLLs) muststart with a simple DOS MZ header. We usually aren't interested in this structure much.It's provided in the case when the program is run from DOS, so DOS can recognize it asa valid executable and can thus run the DOS stub which is stored next to the MZ header.The DOS stub is actually a valid EXE that is executed in case the operating systemdoesn't know about PE file format. It can simply display a string like "This programrequires Windows" or it can be a full-blown DOS program depending on the intent of theprogrammer. We are also not very interested in DOS stub: it's usually provided by theassembler/compiler. In most case, it simply uses int 21h, service 9 to print a string saying"This program cannot run in DOS mode".

DOS MZ header

DOS stub

PE header

Section table

Section 1

Section 2

Section ...

Section n



After the DOS stub comes the PE header. The PE header is a general term for the PE-relatedstructure named IMAGE_NT_HEADERS. This structure contains many essential fields thatare used by the PE loader. We will be quite familiar with it as you know more about PE file for-mat. In the case the program is executed in the operating system that knows about PE fileformat, the PE loader can find the starting offset of the PE header from the DOS MZ header.Thus it can skip the DOS stub and go directly to the PE header which is the real file header.

The real content of the PE file is divided into blocks called sections. A section is nothing morethan a block of data with common attributes such as code/data, read/write etc. You can thinkof a PE file as a logical disk. The PE header is the boot sector and the sections are files in thedisk. The files can have different attributes such as read-only, system, hidden, archive and soon. I want to make it clear from this point onwards that the grouping of data into a section isdone on the common attribute basis: not on logical basis. It doesn't matter how the code/dataare used , if the data/code in the PE file have the same attribute, they can be lumped togetherin a section. You should not think of a section as "data", "code" or some other logical con-cepts: sections can contain both code and data provided that they have the same attribute. Ifyou have a block of data that you want to be read-only, you can put that data in the sectionthat is marked as read-only. When the PE loader maps the sections into memory, it examinesthe attributes of the sections and gives the memory block occupied by the sections the indi-cated attributes.

If we view the PE file format as a logical disk, the PE header as the boot sector and the sec-tions as files, we still don't have enough information to find out where the files reside on thedisk, ie. we haven't discussed the directory equivalent of the PE file format. Immediately fol-lowing the PE header is the section table which is an array of structures. Each structure con-tains the information about each section in the PE file such as its attribute, the file offset,virtual offset. If there are 5 sections in the PE file, there will be exactly 5 members in thisstructure array. We can then view the section table as the root directory of the logical disk.Each member of the array is equvalent to the each directory entry in the root directory.



16

That's all about the physical layout of the PE file format. I'll summarize the major steps inloading a PE file into memory below:

1.When the PE file is run, the PE loader examines the DOS MZ header for the offsetof the PE header. If found, it skips to the PE header.

2.The PE loader checks if the PE header is valid. If so, it goes to the end of the PEheader.

3.Immediately following the PE header is the section table. The PE header readsinformation about the sections and maps those sections into memory using filemapping. It also gives each section the attributes as specified in the section table.

4.After the PE file is mapped into memory, the PE loader concerns itself with thelogical parts of the PE file, such as the import table.

The above steps are oversimplification and are based on my own observation. There maybe some inaccuracies but it should give you the clear picture of the process.You shoulddownload LUEVELSMEYER's description about PE file format. It's very detailed and youshould keep it as a reference.



Detecting a valid PE-File2

Theory

How can you verify if a given file is a PE file? That question is difficult to answer. Thatdepends on the length that you want to go to do that. You can verify every data structuredefined in the PE file format or you are satisfied with verifying only the crucial ones. Most ofthe time, it's pretty pointless to verify every single structure in the files. If the crucial structuresare valid, we can assume that the file is a valid PE. And we will use that assumption.

The essential structure we will verify is the PE header itself. So we need to know a little aboutit, programmatically. The PE header is actually a structure called IMAGE_NT_HEADERS. Ithas the following definition:

IMAGE_NT_HEADERS STRUCT Signature dd ? FileHeader IMAGE_FILE_HEADER <> OptionalHeader IMAGE_OPTIONAL_HEADER32 <> IMAGE_NT_HEADERS ENDS

Signature is a dword that contains the value 50h, 45h, 00h, 00h. In more human term, it con-tains the text "PE" followed by two terminating zeroes. This member is the PE signature sowe will use it in verifying if a given file is a valid PE one.

FileHeader is a structure that contains information about the physical layout of the PE filesuch as the number of sections, the machine the file is targeted and so on.

OptionalHeader is a structure that contains information about the logical layout of the PE file.Despite the "Optional" in its name, it's always present.




18

Our goal is now clear. If value of the signature member of the IMAGE_NT_HEADERS isequal to "PE" followed by two zeroes, then the file is a valid PE. In fact, for comparisonpurpose, Microsoft has defined a constant named IMAGE_NT_SIGNATURE which wecan readily use.

IMAGE_DOS_SIGNATURE equ 5A4Dh IMAGE_OS2_SIGNATURE equ 454Eh IMAGE_OS2_SIGNATURE_LE equ 454Ch IMAGE_VXD_SIGNATURE equ 454Ch IMAGE_NT_SIGNATURE equ 4550h

The next question: how can we know where the PE header is? The answer is simple: theDOS MZ header contains the file offset of the PE header. The DOS MZ header is definedas IMAGE_DOS_HEADER structure. You can check it out in windows.inc. The e_lfanewmember of the IMAGE_DOS_HEADER structure contains the file offset of the PE header.

The steps are now as follows:

1.Verify if the given file has a valid DOS MZ header by comparing the first word ofthe file with the value IMAGE_DOS_SIGNATURE.

2.If the file has a valid DOS header, use the value in e_lfanew member to find thePE header

3.Comparing the first word of the PE header with the value IMAGE_NT_HEADER.If both values match, then we can assume that the file is a valid PE.



Example.386

.model flat,stdcall

option casemap:none

include \masm32\include\windows.inc

include \masm32\include\kernel32.inc

include \masm32\include\comdlg32.inc

include \masm32\include\user32.inc

includelib \masm32\lib\user32.lib

includelib \masm32\lib\kernel32.lib

includelib \masm32\lib\comdlg32.lib

SEH struct

PrevLink dd ? ; the address of the previous seh structure

CurrentHandler dd ? ; the address of the exception handler

SafeOffset dd ? ; The offset where it's safe to continue execution

PrevEsp dd ? ; the old value in esp

PrevEbp dd ? ; The old value in ebp

SEH ends

.data

AppName db "PE tutorial no.2",0

ofn OPENFILENAME <>

FilterString db "Executable Files (*.exe, *.dll)",0,"*.exe;*.dll",0

db "All Files",0,"*.*",0,0

FileOpenError db "Cannot open the file for reading",0

FileOpenMappingError db "Cannot open the file for memory mapping",0

FileMappingError db "Cannot map the file into memory",0

FileValidPE db "This file is a valid PE",0

FileInValidPE db "This file is not a valid PE",0

.data?

buffer db 512 dup(?)

hFile dd ?

hMapping dd ?



20

pMapping dd ?

ValidPE dd ?

.code

start proc

LOCAL seh:SEH

mov ofn.lStructSize,SIZEOF ofn

mov ofn.lpstrFilter, OFFSET FilterString

mov ofn.lpstrFile, OFFSET buffer

mov ofn.nMaxFile,512

mov ofn.Flags, OFN_FILEMUSTEXIST or OFN_PATHMUSTEXIST or OFN_LONGNAMES or OFN_EXPLORER or OFN_HIDEREADONLY

invoke GetOpenFileName, ADDR ofn

.if eax==TRUE

invoke CreateFile, addr buffer, GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL

.if eax!=INVALID_HANDLE_VALUE

mov hFile, eax

invoke CreateFileMapping, hFile, NULL, PAGE_READONLY,0,0,0

.if eax!=NULL

mov hMapping, eax

invoke MapViewOfFile,hMapping,FILE_MAP_READ,0,0,0

.if eax!=NULL

mov pMapping,eax

assume fs:nothing

push fs:[0]

pop seh.PrevLink

mov seh.CurrentHandler,offset SEHHandler

mov seh.SafeOffset,offset FinalExit

lea eax,seh

mov fs:[0], eax

mov seh.PrevEsp,esp

mov seh.PrevEbp,ebp

mov edi, pMapping

assume edi:ptr IMAGE_DOS_HEADER

.if [edi].e_magic==IMAGE_DOS_SIGNATURE



add edi, [edi].e_lfanew

assume edi:ptr IMAGE_NT_HEADERS

.if [edi].Signature==IMAGE_NT_SIGNATURE

mov ValidPE, TRUE

.else

mov ValidPE, FALSE

.endif

.else

mov ValidPE,FALSE

.endif

FinalExit:

.if ValidPE==TRUE

invoke MessageBox, 0, addr FileValidPE, addr AppName, MB_OK+MB_ICONINFORMATION

.else

invoke MessageBox, 0, addr FileInValidPE, addr AppName, MB_OK+MB_ICONINFORMATION

.endif

push seh.PrevLink

pop fs:[0]

invoke UnmapViewOfFile, pMapping

.else

invoke MessageBox, 0, addr FileMappingError, addr AppName, MB_OK+MB_ICONERROR

.endif

invoke CloseHandle,hMapping

.else

invoke MessageBox, 0, addr FileOpenMappingError, addr AppName, MB_OK+MB_ICONERROR

.endif

invoke CloseHandle, hFile

.else

invoke MessageBox, 0, addr FileOpenError, addr AppName, MB_OK+MB_ICONERROR

.endif

.endif

invoke ExitProcess, 0



22

start endp

SEHHandler proc C uses edx pExcept:DWORD, pFrame:DWORD, pContext:DWORD, pDis-patch:DWORD

mov edx,pFrame

assume edx:ptr SEH

mov eax,pContext

assume eax:ptr CONTEXT

push [edx].SafeOffset

pop [eax].regEip

push [edx].PrevEsp

pop [eax].regEsp

push [edx].PrevEbp

pop [eax].regEbp

mov ValidPE, FALSE

mov eax,ExceptionContinueExecution

ret

SEHHandler endp

end start



Analysis:

The program opens a file and checks if the DOS header is valid, if it is, it checks the PEheader if it's valid. If it is, then it assumes the file is a valid PE. In this example, I use struc-tured exception handling (SEH) so that we don't have to check for every possible error: if afault occurs, we assume that it's because the file is not a valid PE thus giving our programwrong information. Windows itself uses SEH heavily in its parameter validation routines. Ifyou're interested in SEH, read the article by Jeremy Gordon.

The program displays an open file common dialog to the user and when the user chooses anexecutable file, it opens the file and maps it into memory. Before it goes on with the verifica-tion, it sets up a SEH: assume fs:nothing

push fs:[0]

pop seh.PrevLink



lea eax,seh

mov fs:[0], eax

mov seh.PrevEsp,esp

mov seh.PrevEbp,ebp

We start by assuming the use of fs register as nothing. This must be done because MASMassumes the use of fs register to ERROR. Next we store the address of the previous SEHhandler in our structure for use by Windows. We store the address of our SEH handler, theaddress where the execution can safely resume if a fault occurs, the current values of espand ebp so that our SEH handler can get the state of the stack back to normal before itresumes the execution of our program. mov edi, pMapping





24

After we are done with setting up SEH, we continue with the verification. We put theaddress of the first byte of the target file in edi, which is the first byte of the DOS header.For ease of comparison, we tell the assembler that it can assume edi as pointing to theIMAGE_DOS_HEADER structure (which is the truth). We then compare the first word ofthe DOS header with the string "MZ" which is defined as a constant in windows.incnamed IMAGE_DOS_SIGNATURE. If the comparison is ok, we continue to the PEheader. If not, we set the value in ValidPE to FALSE, meaning that the file is not a validPE. add edi, [edi].e_lfanew



mov ValidPE, TRUE

.else

mov ValidPE, FALSE

.endif

To get to the PE header, we need the value in e_lfanew of the DOS header. This field con-tains the file offset of the PE header, relative to the file beginning. Thus we add this valueto edi and we get to the first byte of the PE header. It's this place that a fault may occur. Ifthe file is really not a PE file, the value in e_lfanew will be incorrect and thus using itamounts to using a wild pointer. If we don't use SEH, we must check the value of thee_lfanew against the file size which is ugly. If all goes well, we compare the first dword ofthe PE header with the string "PE". Again there is a handy constant namedIMAGE_NT_SIGNATURE which we can use. If the result of comparison is true, weassume the file is a valid PE.



If the value in e_lfanew is incorrect, a fault may occur and our SEH handler will get control. Itsimply restores the stack pointer, bsae pointer and resumes the execution at the safe offsetwhich is at the FinalExit label.FinalExit:

.if ValidPE==TRUE

invoke MessageBox, 0, addr FileValidPE, addr AppName, MB_OK+MB_ICONINFORMATION

.else


.endif

The above code is simplicity itself. It checks the value in ValidPE and displays a message tothe user accordingly. push seh.PrevLink

pop fs:[0]

When the SEH is no longer used, we dissociate it from the SEH chain.



26

File-Header3

Let's summarize what we have learned so far:

"DOS MZ header is called IMAGE_DOS_HEADER. Only two of its members are impor-tant to us: e_magic which contains the string "MZ" and e_lfanew which contains the fileoffset of the PE header.

"We use the value in e_magic to check if the file has a valid DOS header by comparing itto the value IMAGE_DOS_SIGNATURE. If both values match, we can assume that thefile has a valid DOS header.

"In order to go to the PE header, we must move the file pointer to the offset specified bythe value in e_lfanew.

"The first dword of the PE header should contain the string "PE" followed by two zeroes.We compare the value in this dword to the value IMAGE_NT_SIGNATURE. If they match,then we can assume that the PE header is valid.

We will learn more about the PE header in this tutorial. The official name of the PE headeris IMAGE_NT_HEADERS. To refresh your memory, I show it below.

IMAGE_NT_HEADERS STRUCT

Signature dd ?

FileHeader IMAGE_FILE_HEADER <>

OptionalHeader IMAGE_OPTIONAL_HEADER32 <>

IMAGE_NT_HEADERS ENDS




Signature is the PE signature, "PE" followed by two zeroes. You already know and use thismember.

FileHeader is a structure that contains the information about the physical layout/properies ofthe PE file in general.

OptionalHeader is also a structure that contains the information about the logical layoutinside the PE file.

The most interesting information is in OptionalHeader. However, some fields in FileHeaderare also important. We will learn about FileHeader in this tutorial so we can move to studyOptionalHeader in the next tutorials.

IMAGE_FILE_HEADER STRUCT

Machine WORD ?

NumberOfSections WORD ?

TimeDateStamp dd ?

PointerToSymbolTable dd ?

NumberOfSymbols dd ?

SizeOfOptionalHeader WORD ?

Characteristics WORD ?

IMAGE_FILE_HEADER ENDS



28

In summary, only three members are somewhat useful to us: Machine, NumberOfSec-tions and Characteristics. You would normally not change the values of Machine andCharacteristics but you must use the value in NumberOfSections when you're walking thesection table.

I'm jumping the gun here but in order to illustrate the use of NumberOfSections, I need todigress briefly to the section table.

TABLE 1. The File-Header

Field Name MeaningsMachine The CPU platform the file is intended for. For Intel platform, the value is

IMAGE_FILE_MACHINE_I386 (14Ch). I tried to use 14Dh and 14Eh as stated in the pe.txt by LUEVELSMEYER but Windows refused to run it. This field is rarely of interest to us except as a quick way of preventing a program to be executed.

NumberOfSections The number of sections in the file. We will need to modify the value in this member if we add or delete a section from the file.

TimeDateStamp The date and time the file is created. Not useful to us.PointerToSymbolTable used for debugging. NumberOfSymbols used for debugging. SizeOfOptionalHeader The size of the OptionalHeader member that immediately follows this

structure. Must be set to a valid value.Characteristics Contains flags for the file, such as whether this file is an exe or a dll.



The section table is an array of structures. Each structure contains the information of a sec-tion. Thus if there are 3 sections, there will be 3 members in this array. You need the value inNumberOfSections so you know how many members there are in the array. You would thinkthat checking for the structure with all zeroes in its members would help. Windows does usethis approach. You can verify this fact by setting the value in NumberOfSections to a valuehigher than the real value and Windows still runs the file without problem. From my observa-tion, I think Windows reads the value in NumberOfSections and examines each structure inthe section table. If it finds a structure that contains all zeroes, it terminates the search. Else itwould process until the number of structures specified in NumberOfSections is met. Whycan't we ignore the value in NumberOfSections? Several reasons. The PE specificationdoesn't specify that the section table array must end with an all-zero structure. Thus theremay be a situation where the last array member is contiguous to the first section, withoutempty space at all. Another reason has to do with bound imports. The new-style binding putsthe information immediately following the section table's last structure array member. Thusyou still need NumberOfSections.



30

Optional Headers4

We have learned about the DOS header and some members of the PE header. Here'sthe last, the biggest and probably the most important member of the PE header, theoptional header.

To refresh your memory, the optional header is a structure that is the last member ofIMAGE_NT_HEADERS. It contains information about the logical layout in the PE file.There are 31 fields in this structure. Some of them are crucial and some are not useful. I'llexplain only those fields that are really useful.

There is a word that's used frequently in relation to PE file format: RVA

RVA stands for relative virtual address. You know what virtual address is. RVA is a daunt-ing term for such a simple concept. Simply put, an RVA is a distance from a referencepoint in the virtual address space. I bet you're familiar with file offset: an RVA is exactlythe same thing as file offset. However, it's relative to a point in virtual address space, nota file. I'll show you an example. If a PE file loads at 400000h in the virtual address (VA)space and the program starts execution at the virtual address 401000h, we can say thatthe program starts execution at RVA 1000h. An RVA is relative to the starting VA of themodule.

Why does the PE file format use RVA? It's to help reduce the load of the PE loader. Sincea module can be relocated anywhere in the virtual address space, it would be a hell forthe PE loader to fix every relocatable items in the module. In contrast, if all relocatableitems in the file use RVA, there is no need for the PE loader to fix anything: it simply relo-cates the whole module to a new starting VA. It's like the concept of relative path andabsolute path: RVA is akin to relative path, VA is like absolute path.




TABLE 2. Optional Header

Field MeaningsAddressOfEntryPoint It's the RVA of the first instruction that will be executed when the PE loader is

ready to run the PE file. If you want to divert the flow of execution right from the start, you need to change the value in this field to a new RVA and the instruc-tion at the new RVA will be executed first.

ImageBase It's the preferred load address for the PE file. For example, if the value in this field is 400000h, the PE loader will try to load the file into the virtual address space starting at 400000h. The word "preferred" means that the PE loader may not load the file at that address if some other module already occupied that address range.

SectionAlignment The granularity of the alignment of the sections in memory. For example, if the value in this field is 4096 (1000h), each section must start at multiples of 4096 bytes. If the first section is at 401000h and its size is 10 bytes, the next section must be at 402000h even if the address space between 401000h and 402000h will be mostly unused.

FileAlignment The granularity of the alignment of the sections in the file. For example, if the value in this field is 512 (200h), each section must start at multiples of 512 bytes. If the first section is at file offset 200h and the size is 10 bytes, the next section must be located at file offset 400h: the space between file offsets 522 and 1024 is unused/undefined.

MajorSubsystemVersionMinorSubsystemVersion

The win32 subsystem version. If the PE file is designed for Win32, the sub-system version must be 4.0 else the dialog won't have 3-D look.

SizeOfImage The overall size of the PE image in memory. It's the sum of all headers and sec-tions aligned to SectionAlignment.

SizeOfHeaders The size of all headers+section table. In short, this value is equal to the file size minus the combined size of all sections in the file. You can also use this value as the file offset of the first section in the PE file.

Subsystem Tell in which of the NT subsystem the PE file is intended for. For most win32 progs, only two values are used: Windows GUI and Windows CUI (console).

DataDirectory An array of IMAGE_DATA_DIRECTORY structures. Each structure gives the RVA of an important data structure in the PE file such as the import address table.



32

Section Table5

Theory:

Up to this tutorial, we learned about the DOS header, the PE header. What remains is thesection table. A section table is actually an array of structure immediately following the PEheader. The number of the array members is determined by NumberOfSections field inthe file header (IMAGE_FILE_HEADER) structure. The structure is calledIMAGE_SECTION_HEADER.IMAGE_SIZEOF_SHORT_NAME equ 8

IMAGE_SECTION_HEADER STRUCT

Name1 db IMAGE_SIZEOF_SHORT_NAME dup(?)

union Misc

PhysicalAddress dd ?

VirtualSize dd ?

ends

VirtualAddress dd ?

SizeOfRawData dd ?

PointerToRawData dd ?

PointerToRelocations dd ?

PointerToLinenumbers dd ?

NumberOfRelocations dw ?

NumberOfLinenumbers dw ?

Characteristics dd ?

IMAGE_SECTION_HEADER ENDS

Again, not all members are useful. I'll describe only the ones that are really important.




Now that we know about IMAGE_SECTION_HEADER structure, let's see how we can emu-late the PE loader's job:1.Read NumberOfSections in IMAGE_FILE_HEADER so we know how many sections there are in the file.

2.Use the value in SizeOfHeaders as the file offset of the section table and moves the file pointer to that offet.

3.Walk the structure array, examining each member.

4.For each structure, we obtain the value in PointerToRawData and move the file pointer to that offset. Then we read the value in SizeOfRawData so we know how many bytes we should map into memory. Read the value in VirtualAddress and add the value in ImageBase to it to get the virtual address the section should start from. And then we are ready to map the section into memory and mark the attribute of the mem-ory according to the flags in Characteristics.

5.Walk the array until all the sections are processed.

Note that we didn't make use the the name of the section: it's not really necessary.

TABLE 3. Section Table

Field MeaningsName1 Actually the name of this field is "name" but the word "name" is an MASM

keyword so we have to use "Name1" instead. This member contains thename of the section. Note that the maximum length is 8 bytes. The name isjust a label, nothing more. You can use any name or even leave this fieldblank. Note that there is no mention of the terminating null. The name isnot an ASCIIZ string so don't expect it to be terminated with a null.

VirtualAddress The RVA of the section. The PE loader examines and uses the value inthis field when it's mapping the section into memory. Thus if the value inthis field is 1000h and the PE file is loaded at 400000h, the section will beloaded at 401000h.

SizeOfRawData The size of the section's data rounded up to the next multiple of file align-ment. The PE loader examines the value in this field so it knows how manybytes in the section it should map into memory.

PointerToRawData The file offset of the beginning of the section. The PE loader uses thevalue in this field to find where the data in the section is in the file.

Characteristics Contains flags such as whether this section contains executable code, ini-tialized data, uninitialized data, can it be written to or read from.



34

Example:

This example opens a PE file and walks the section table, showing the information aboutthe sections in a listview control. .386

.model flat,stdcall

option casemap:none





include \masm32\include\comctl32.inc

includelib \masm32\lib\comctl32.lib




IDD_SECTIONTABLE equ 104

IDC_SECTIONLIST equ 1001

SEH struct


CurrentHandler dd ? ; the address of the new exception handler




SEH ends

.data


ofn OPENFILENAME <>


db "All Files",0,"*.*",0,0






FileInValidPE db "This file is not a valid PE",0

template db "%08lx",0

SectionName db "Section",0

VirtualSize db "V.Size",0

VirtualAddress db "V.Address",0

SizeOfRawData db "Raw Size",0

RawOffset db "Raw Offset",0

Characteristics db "Characteristics",0

.data?

hInstance dd ?


hFile dd ?

hMapping dd ?

pMapping dd ?

ValidPE dd ?

NumberOfSections dd ?

.code

start proc

LOCAL seh:SEH

invoke GetModuleHandle,NULL

mov hInstance,eax







.if eax==TRUE



mov hFile, eax



36


.if eax!=NULL

mov hMapping, eax


.if eax!=NULL

mov pMapping,eax

assume fs:nothing

push fs:[0]

pop seh.PrevLink



lea eax,seh

mov fs:[0], eax

mov seh.PrevEsp,esp

mov seh.PrevEbp,ebp

mov edi, pMapping






mov ValidPE, TRUE

.else

mov ValidPE, FALSE

.endif

.else

mov ValidPE,FALSE

.endif

FinalExit:

push seh.PrevLink

pop fs:[0]

.if ValidPE==TRUE

call ShowSectionInfo

.else




.endif


.else


.endif


.else


.endif


.else


.endif

.endif


invoke InitCommonControls

start endp

SEHHandler proc C uses pExcept:DWORD,pFrame:DWORD,pContext:DWORD,pDispatch:DWORD

mov edx,pFrame

assume edx:ptr SEH

mov eax,pContext



pop [eax].regEip

push [edx].PrevEsp

pop [eax].regEsp

push [edx].PrevEbp

pop [eax].regEbp

mov ValidPE, FALSE


ret



38

SEHHandler endp

DlgProc proc uses edi esi hDlg:DWORD, uMsg:DWORD, wParam:DWORD, lParam:DWORD

LOCAL lvc:LV_COLUMN

LOCAL lvi:LV_ITEM

.if uMsg==WM_INITDIALOG

mov esi, lParam

mov lvc.imask,LVCF_FMT or LVCF_TEXT or LVCF_WIDTH or LVCF_SUBITEM

mov lvc.fmt,LVCFMT_LEFT

mov lvc.lx,80

mov lvc.iSubItem,0

mov lvc.pszText,offset SectionName

invoke SendDlgItemMessage,hDlg,IDC_SECTIONLIST,LVM_INSERTCOLUMN,0,addr lvc inc lvc.iSubItem

mov lvc.fmt,LVCFMT_RIGHT

mov lvc.pszText,offset VirtualSize

invoke SendDlgItemMessage,hDlg,IDC_SECTIONLIST,LVM_INSERTCOLUMN,1,addr lvc

inc lvc.iSubItem

mov lvc.pszText,offset VirtualAddress


inc lvc.iSubItem

mov lvc.pszText,offset SizeOfRawData


inc lvc.iSubItem

mov lvc.pszText,offset RawOffset


inc lvc.iSubItem

mov lvc.pszText,offset Characteristics


mov ax, NumberOfSections

movzx eax,ax

mov edi,eax

mov lvi.imask,LVIF_TEXT

mov lvi.iItem,0

assume esi:ptr IMAGE_SECTION_HEADER



.while edi>0

mov lvi.iSubItem,0

invoke RtlZeroMemory,addr buffer,9

invoke lstrcpyn,addr buffer,addr [esi].Name1,8

lea eax,buffer

mov lvi.pszText,eax

invoke SendDlgItemMessage,hDlg,IDC_SECTIONLIST,LVM_INSERTITEM,0,addr lvi

invoke wsprintf,addr buffer,addr template,[esi].Misc.VirtualSize

lea eax,buffer

mov lvi.pszText,eax

inc lvi.iSubItem

invoke SendDlgItemMessage,hDlg,IDC_SECTIONLIST,LVM_SETITEM,0,addr lvi

invoke wsprintf,addr buffer,addr template,[esi].VirtualAddress

lea eax,buffer

mov lvi.pszText,eax

inc lvi.iSubItem


invoke wsprintf,addr buffer,addr template,[esi].SizeOfRawData

lea eax,buffer

mov lvi.pszText,eax

inc lvi.iSubItem


invoke wsprintf,addr buffer,addr template,[esi].PointerToRawData

lea eax,buffer

mov lvi.pszText,eax

inc lvi.iSubItem


invoke wsprintf,addr buffer,addr template,[esi].Characteristics

lea eax,buffer

mov lvi.pszText,eax

inc lvi.iSubItem


inc lvi.iItem

dec edi

add esi, sizeof IMAGE_SECTION_HEADER



40

.endw

.elseif

uMsg==WM_CLOSE

invoke EndDialog,hDlg,NULL

.else

mov eax,FALSE

ret

.endif

mov eax,TRUE

ret

DlgProc endp

ShowSectionInfo proc uses edi

mov edi, pMapping




mov ax,[edi].FileHeader.NumberOfSections

movzx eax,ax

mov NumberOfSections,eax

add edi,sizeof IMAGE_NT_HEADERS

invoke DialogBoxParam, hInstance, IDD_SECTIONTABLE,NULL, addr DlgProc, edi

ret

ShowSectionInfo endp

end start



Analysis:

This example reuses the code of the example in PE tutorial 2. After it verifies that the file is avalid PE, it calls a function, ShowSectionInfo.ShowSectionInfo proc uses edi

mov edi, pMapping




We use edi as the pointer to the data in the PE file. At first, we initialize it to the value ofpMapping which is the address of the DOS header. Then we add the value in e_lfanew to itso it now contains the address of the PE header. mov ax,[edi].FileHeader.NumberOfSections

mov NumberOfSections,ax

Since we need to walk the section table, we must obtain the number of sections in this file.That's the value in NumberOfSections member of the file header. Don't forget that this mem-ber is of word size. add edi,sizeof IMAGE_NT_HEADERS

Edi currently contains the address of the PE header. Adding the size of the PE header to itwill make it point at the section table. invoke DialogBoxParam, hInstance, IDD_SECTIONTABLE,NULL, addr DlgProc, edi

Call DialogBoxParam to show the dialog box containing the listview control. Note that wepass the address of the section table as its last parameter. This value will be available inlParam during WM_INITDIALOG message.



42

In the dialog box procedure, in response to WM_INITDIALOG message, we store thevalue of lParam (address of the section table) in esi, the number of sections in edi andthen dress up the listview control. When everything is ready, we enter a loop which willinsert the info about each section into the listview control. This part is very simple. .while edi>0

mov lvi.iSubItem,0

Put this string in the first column. invoke RtlZeroMemory,addr buffer,9

invoke lstrcpyn,addr buffer,addr [esi].Name1,8

lea eax,buffer

mov lvi.pszText,eax

We will display the name of the section but we must convert it to an ASCIIZ string first. invoke SendDlgItemMesage,hDlg,IDC_SECTIONLIST,LVM_INSERTITEM,0,addr lvi

Then we display it in the first column.

We continue with this scheme until the last value we want to display for this section is dis-played. Then we must move to the next structure. dec edi

add esi, sizeof IMAGE_SECTION_HEADER

.endw

We decrement the value in edi for each section processed. And we add the size ofIMAGE_SECTION_HEADER to esi so it contains the address of the nextIMAGE_SECTION_HEADER structure.



The steps in walking the section table are:

1.Verify that the file is a valid PE

2.Go to the beginning of the PE header

3.Obtain the number of sections from NumberOfSections field in the file header.

4.Go to the section table either by adding ImageBase to SizeOfHeaders or by addingthe address of the PE header to the size of the PE header. (The section table imme-diately follows the PE header). If you don't use file mapping, you need to move thefile pointer to the section table using SetFilePointer. The file offset of the sectiontable is in SizeOfHeaders.(SizeOfHeaders is a member ofIMAGE_OPTIONAL_HEADER)

5.Process each IMAGE_SECTION_HEADER structure.



44

Import Table6

We will learn about import table in this tutorial. Let me warn you first. This tutorial is a longand difficult one for those who aren't familiar with the import table. You may need to readthis tutorial several times and may even have to examine the related structures under adebugger.

Theory:

First of all, you should know what an import function is. An import function is a functionthat is not in the caller's module but is called by the module, thus the name "import". Theimport functions actually reside in one or more DLLs. Only the information about the func-tions is kept in the caller's module. That information includes the function names and thenames of the DLLs in which they reside.

Now how can we find out where in the PE file the information is kept? We must turn to thedata directory for the answer. I'll refresh your memory a bit. Below is the PE header:IMAGE_NT_HEADERS STRUCT

Signature dd ?

FileHeader IMAGE_FILE_HEADER <>

OptionalHeader IMAGE_OPTIONAL_HEADER <>

IMAGE_NT_HEADERS ENDS

The last member of the optional header is the data directory:

IMAGE_OPTIONAL_HEADER32 STRUCT

....

LoaderFlags dd ?

NumberOfRvaAndSizes dd ?

DataDirectory IMAGE_DATA_DIRECTORY 16 dup(<>)

IMAGE_OPTIONAL_HEADER32 ENDS




The data directory is an array of IMAGE_DATA_DIRECTORY structure. A total of 16 mem-bers. If you remember the section table as the root directory of the sections in a PE file, youshould also think of the data directory as the root directory of the logical components storedinside those sections. To be precise, the data directory contains the locations and sizes of theimportant data structures in the PE file. Each member contains information about an impor-tant data structure.

Member Info inside 0 Export symbols 1 Import symbols 2 Resources 3 Exception 4 Security 5 Base relocation 6 Debug 7 Copyright string 8 Unknown

9 Thread local storage (TLS)

10 Load configuration 11 Bound Import 12 Import Address Table 13 Delay Import 14 COM descriptor



46

Only the members painted in gold are known to me. Now that you know what each mem-ber of the data directory contains, we can learn about the member in detail. Each memberof the data directory is a structure called IMAGE_DATA_DIRECTORY which has the fol-lowing definition:IMAGE_DATA_DIRECTORY STRUCT

VirtualAddress dd ?

isize dd ?

IMAGE_DATA_DIRECTORY ENDS

VirtualAddress is actually the relative virtual address (RVA) of the data structure. Forexample, if this structure is for import symbols, this field contains the RVA of theIMAGE_IMPORT_DESCRIPTOR array.

isize contains the size in bytes of the data structure referred to by VirtualAddress.

Here's the general scheme on finding important data structures in a PE file:

1.From the DOS header, you go to the PE header

2.Obtain the address of the data directory in the optional header.

3.Multiply the size of IMAGE_DATA_DIRECTORY with the member index you want:for example if you want to know where the import symbols are, you must multiplythe size of IMAGE_DATA_DIRECTORY (8 bytes) with 1.

4.Add the result to the address of the data directory and you have the address of theIMAGE_DATA_DIRECTORY structure that contains the info about the desired datastructure.

Now we will enter into the real discussion about the import table. The address of theimport table is contained in the VirtualAddress field of the second member of the datadirectory. The import table is actually an array of IMAGE_IMPORT_DESCRIPTOR struc-tures. Each structure contains information about a DLL the PE file imports symbols from.For example, if the PE file imports functions from 10 different DLLs, there will be 10 mem-



bers in this array. The array is terminated by the member which contain all zeroes. Now wecan examine the structure in detail:IMAGE_IMPORT_DESCRIPTOR STRUCT

union

Characteristics dd ?

OriginalFirstThunk dd ?

ends

TimeDateStamp dd ?

ForwarderChain dd ?

Name1 dd ?

FirstThunk dd ?

IMAGE_IMPORT_DESCRIPTOR ENDS

The first member of this structure is a union. Actually, the union only provides the alias forOriginalFirstThunk, so you can call it "Characteristics". This member contains the the RVA ofan array of IMAGE_THUNK_DATA structures.

What is IMAGE_THUNK_DATA? It's a union of dword size. Usually, we interpret it as thepointer to an IMAGE_IMPORT_BY_NAME structure. Note that IMAGE_THUNK_DATA con-tains the pointer to an IMAGE_IMPORT_BY_NAME structure: not the structure itself.

Look at it this way: There are several IMAGE_IMPORT_BY_NAME structures. We collect theRVA of those structures (IMAGE_THUNK_DATAs) into an array, terminate it with 0. Then weput the RVA of the array into OriginalFirstThunk.



48

The IMAGE_IMPORT_BY_NAME structure contains information about an import func-tion. Now let's see what IMAGE_IMPORT_BY_NAME structure looks like:IMAGE_IMPORT_BY_NAME STRUCT

Hint dw ?

Name1 db ?

IMAGE_IMPORT_BY_NAME ENDS

Hint contains the index into the export table of the DLL the function resides in. This field isfor use by the PE loader so it can look up the function in the DLL's export tablequickly.This value is not essential and some linkers may set the value in this field to 0.

Name1 contains the name of the import function. The name is an ASCIIZ string. Note thatName1's size is defined as byte but it's really a variable-sized field. It's just that there is noway to represent a variable-sized field in a structure. The structure is provided so that youcan refer to the data structure with descriptive names.

TimeDateStamp and ForwarderChain are advanced stuff: We will talk about them afteryou have firm grasp of the other members.

Name1 contains the RVA to the name of the DLL, in short, the pointer to the name of theDLL. The string is an ASCIIZ one.

FirstThunk is very similar to OriginalFirstThunk, ie. it contains an RVA of an array ofIMAGE_THUNK_DATA structures(a different array though).

Ok, if you're still confused, look at it this way: There are severalIMAGE_IMPORT_BY_NAME structures. You create two arrays, then fill them with theRVAs of those IMAGE_IMPORT_BY_NAME structures, so both arrays contain exactlythe same values (i.e. exact duplicate). Now you assign the RVA of the first array to Origi-nalFirstThunk and the RVA of the second array to FirstThunk.



Now you should be able to understand what I mean. Don't be confused by the nameIMAGE_THUNK_DATA: it's only an RVA into IMAGE_IMPORT_BY_NAME structure. If youreplace the word IMAGE_THUNK_DATA with RVA in your mind, you'll perhaps see it moreclearly. The number of array elements in OriginalFirstThunk and FirstThunk array depends onthe functions the PE file imports from the DLL. For example, if the PE file imports 10 functionsfrom kernel32.dll, Name1 in the IMAGE_IMPORT_DESCRIPTOR structure will contain theRVA of the string "kernel32.dll" and there will be 10 IMAGE_THUNK_DATAs in each array.

OriginalFirstThunk IMAGE_IMPORT_BY_NAME FirstThunk | |

IMAGE_THUNK_DATA IMAGE_THUNK_DATA IMAGE_THUNK_DATA IMAGE_THUNK_DATA

... IMAGE_THUNK_DATA

---> ---> ---> ---> ---> --->

Function 1Function 2Function 3Function 4

... Function n

<---<---<---<---<---<---

IMAGE_THUNK_DATAIMAGE_THUNK_DATAIMAGE_THUNK_DATAIMAGE_THUNK_DATA




50

The next question is: why do we need two arrays that are exactly the same? To answerthat question, we need to know that when the PE file is loaded into memory, the PEloader will look at the IMAGE_THUNK_DATAs and IMAGE_IMPORT_BY_NAMEs anddetermine the addresses of the import functions. Then it replaces theIMAGE_THUNK_DATAs in the array pointed to by FirstThunk with the real addresses ofthe functions. Thus when the PE file is ready to run, the above picture is changed to:

The array of RVAs pointed to by OriginalFirstThunk remains unchanged so that if theneed arises to find the names of import functions, the PE loader can still find them.

There is a little twist on this *straightforward* scheme. Some functions are exported byordinal only. It means you don't call the functions by their names: you call them by theirpositions. In this case, there will be no IMAGE_IMPORT_BY_NAME structure for thatfunction in the caller's module. Instead, the IMAGE_THUNK_DATA for that function willcontain the ordinal of the function in the low word and the most significant bit (MSB) ofIMAGE_THUNK_DATA set to 1. For example, if a function is exported by ordinal only andits ordinal is 1234h, the IMAGE_THUNK_DATA for that function will be 80001234h.Microsoft provides a handy constant for testing the MSB of a dword,IMAGE_ORDINAL_FLAG32. It has the value of 80000000h.

OriginalFirstThunk IMAGE_IMPORT_BY_NAME FirstThunk | |

IMAGE_THUNK_DATA IMAGE_THUNK_DATA IMAGE_THUNK_DATA IMAGE_THUNK_DATA


---> ---> ---> ---> ---> --->

Function 1Function 2Function 3Function 4

... Function n

Address of Function 1Address of Function 2Address of Function 3Address of Function 4

... Address of Function n



Suppose that we want to list the names of ALL import functions of a PE file, we need to followthe steps below:

1.Verify that the file is a valid PE

2.From the DOS header, go to the PE header

3.Obtain the address of the data directory in OptionalHeader

4.Go to the 2nd member of the data directory. Extract the value of VirtualAddress

5.Use that value to go to the first IMAGE_IMPORT_DESCRIPTOR structure

6.Check the value of OriginalFirstThunk. If it's not zero, follow the RVA in OriginalFirstThunkto the RVA array. If OriginalFirstThunk is zero, use the value in FirstThunk instead. Some link-ers generate PE files with 0 in OriginalFirstThunk. This is considered a bug. Just to be on thesafe side, we check the value in OriginalFirstThunk first.

7.For each member in the array, we check the value of the member againstIMAGE_ORDINAL_FLAG32. If the most significant bit of the member is 1, then the function isexported by ordinal and we can extract the ordinal number from the low word of the member.

8.If the most significant bit of the member is 0, use the value in the member as the RVA intothe IMAGE_IMPORT_BY_NAME, skip Hint, and you're at the name of the function.

9.Skip to the next array member, and retrieve the names until the end of the array is reached(it's null -terminated). Now we are done extracting the names of the functions imported from aDLL. We go to the next DLL.

10.Skip to the next IMAGE_IMPORT_DESCRIPTOR and process it. Do that until the end ofthe array is reached (IMAGE_IMPORT_DESCRIPTOR array is terminated by a member withall zeroes in its fields).

Example:

This example opens a PE file and reads the names of all import functions of that file into an edit control. It also shows the values in the IMAGE_IMPORT_DESCRIPTOR struc-tures.



52

.386

.model flat,stdcall

option casemap:none








IDD_MAINDLG equ 101

IDC_EDIT equ 1000

IDM_OPEN equ 40001

IDM_EXIT equ 40003

DlgProc proto :DWORD,:DWORD,:DWORD,:DWORD

ShowImportFunctions proto :DWORD

ShowTheFunctions proto :DWORD,:DWORD

AppendText proto :DWORD,:DWORD

SEH struct


CurrentHandler dd ? ; the address of the new exception handler




SEH ends

.data


ofn OPENFILENAME <>


db "All Files",0,"*.*",0,0






NotValidPE db "This file is not a valid PE",0

CRLF db 0Dh,0Ah,0

ImportDescriptor db 0Dh,0Ah,"================[ IMAGE_IMPORT_DESCRIPTOR ]=============",0

IDTemplate db "OriginalFirstThunk = %lX",0Dh,0Ah

db "TimeDateStamp = %lX",0Dh,0Ah

db "ForwarderChain = %lX",0Dh,0Ah

db "Name = %s",0Dh,0Ah

db "FirstThunk = %lX",0

NameHeader db 0Dh,0Ah,"Hint Function",0Dh,0Ah

db "-----------------------------------------",0

NameTemplate db "%u %s",0

OrdinalTemplate db "%u (ord.)",0

.data?


hFile dd ?

hMapping dd ?

pMapping dd ?

ValidPE dd ?

.code

start:


invoke DialogBoxParam, eax, IDD_MAINDLG,NULL,addr DlgProc, 0


DlgProc proc hDlg:DWORD, uMsg:DWORD, wParam:DWORD, lParam:DWORD


invoke SendDlgItemMessage,hDlg,IDC_EDIT,EM_SETLIMITTEXT,0,0

.elseif uMsg==WM_CLOSE

invoke EndDialog,hDlg,0

.elseif uMsg==WM_COMMAND



54

.if lParam==0

mov eax,wParam

.if ax==IDM_OPEN

invoke ShowImportFunctions,hDlg

.else ; IDM_EXIT

invoke SendMessage,hDlg,WM_CLOSE,0,0

.endif

.endif

.else

mov eax,FALSE

ret

.endif

mov eax,TRUE

ret

DlgProc endp

SEHHandler proc C pExcept:DWORD, pFrame:DWORD, pContext:DWORD, pDispatch:DWORD

mov edx,pFrame

assume edx:ptr SEH

mov eax,pContext



pop [eax].regEip

push [edx].PrevEsp

pop [eax].regEsp

push [edx].PrevEbp

pop [eax].regEbp

mov ValidPE, FALSE


ret

SEHHandler endp

ShowImportFunctions proc uses edi hDlg:DWORD

LOCAL seh:SEH

mov ofn.lStructSize,SIZEOF



ofn mov ofn.lpstrFilter, OFFSET FilterString





.if eax==TRUE



mov hFile, eax


.if eax!=NULL

mov hMapping, eax


.if eax!=NULL

mov pMapping,eax

assume fs:nothing

push fs:[0]

pop seh.PrevLink



lea eax,seh

mov fs:[0], eax

mov seh.PrevEsp,esp

mov seh.PrevEbp,ebp

mov edi, pMapping






mov ValidPE, TRUE

.else

mov ValidPE, FALSE

.endif



56

.else

mov ValidPE,FALSE

.endif

FinalExit:

push seh.PrevLink

pop fs:[0]

.if ValidPE==TRUE

invoke ShowTheFunctions, hDlg, edi

.else

invoke MessageBox,0, addr NotValidPE, addr AppName, MB_OK+MB_ICONERROR

.endif


.else


.endif


.else


.endif


.else


.endif

.endif

ret

ShowImportFunctions endp

AppendText proc hDlg:DWORD,pText:DWORD

invoke SendDlgItemMessage,hDlg,IDC_EDIT,EM_REPLACESEL,0,pText

invoke SendDlgItemMessage,hDlg,IDC_EDIT,EM_REPLACESEL,0,addr CRLF

invoke SendDlgItemMessage,hDlg,IDC_EDIT,EM_SETSEL,-1,0

ret

AppendText endp



RVAToOffset PROC uses edi esi edx ecx pFileMap:DWORD,RVA:DWORD

mov esi,pFileMap

assume esi:ptr IMAGE_DOS_HEADER

add esi,[esi].e_lfanew

assume esi:ptr IMAGE_NT_HEADERS

mov edi,RVA ; edi == RVA

mov edx,esi

add edx,sizeof IMAGE_NT_HEADERS

mov cx,[esi].FileHeader.NumberOfSections

movzx ecx,cx

assume edx:ptr IMAGE_SECTION_HEADER

.while ecx>0 ; check all sections

.if edi>=[edx].VirtualAddress

mov eax,[edx].VirtualAddress

add eax,[edx].SizeOfRawData

.if edi<eax ; The address is in this section


sub edi,eax

mov eax,[edx].PointerToRawData

add eax,edi ; eax == file offset

ret

.endif

.endif

add edx,sizeof IMAGE_SECTION_HEADER

dec ecx

.endw

assume edx:nothing

assume esi:nothing

mov eax,edi

ret

RVAToOffset endp

ShowTheFunctions proc uses esi ecx ebx hDlg:DWORD, pNTHdr:DWORD

LOCAL temp[512]:BYTE



58

invoke SetDlgItemText,hDlg,IDC_EDIT,0

invoke AppendText,hDlg,addr buffer

mov edi,pNTHdr


mov edi, [edi].OptionalHeader.DataDirectory[sizeof IMAGE_DATA_DIRECTORY].Vir-tualAddress

invoke RVAToOffset,pMapping,edi

mov edi,eax

add edi,pMapping

assume edi:ptr IMAGE_IMPORT_DESCRIPTOR

.while !([edi].OriginalFirstThunk==0 && [edi].TimeDateStamp==0 && [edi].For-warderChain==0 && [edi].Name1==0 && [edi].FirstThunk==0)

invoke AppendText,hDlg,addr ImportDescriptor

invoke RVAToOffset,pMapping, [edi].Name1

mov edx,eax

add edx,pMapping

invoke wsprintf, addr temp, addr IDTemplate, [edi].OriginalFirst-Thunk,[edi].TimeDateStamp,[edi].ForwarderChain,edx,[edi].FirstThunk invoke AppendText,hDlg,addr temp

.if [edi].OriginalFirstThunk==0

mov esi,[edi].FirstThunk

.else

mov esi,[edi].OriginalFirstThunk

.endif

invoke RVAToOffset,pMapping,esi

add eax,pMapping

mov esi,eax

invoke AppendText,hDlg,addr NameHeader

.while dword ptr [esi]!=0

test dword ptr [esi],IMAGE_ORDINAL_FLAG32

jnz ImportByOrdinal

invoke RVAToOffset,pMapping,dword ptr [esi]

mov edx,eax

add edx,pMapping

assume edx:ptr IMAGE_IMPORT_BY_NAME

mov cx, [edx].Hint



movzx ecx,cx

invoke wsprintf,addr temp,addr NameTemplate,ecx,addr [edx].Name1

jmp ShowTheText

ImportByOrdinal:

mov edx,dword ptr [esi]

and edx,0FFFFh

invoke wsprintf,addr temp,addr OrdinalTemplate,edx

ShowTheText:

invoke AppendText,hDlg,addr temp

add esi,4

.endw

add edi,sizeof IMAGE_IMPORT_DESCRIPTOR

.endw

ret

ShowTheFunctions endp

end start



60

Analysis:

The program shows an open file dialog box when the user clicks Open in the menu. It ver-ifies that the file is a valid PE and then calls ShowTheFunctions.

ShowTheFunctions proc uses esi ecx ebx hDlg:DWORD, pNTHdr:DWORD LOCAL temp[512]:BYTE

Reserve 512 bytes of stack space for string operation. invoke SetDlgItemText,hDlg,IDC_EDIT,0

Clear the text in the edit control invoke AppendText,hDlg,addr buffer

Insert the name of the PE file into the edit control. AppendText just sendsEM_REPLACESEL messages to append the text to the edit control. Note that it sendsEM_SETSEL with wParam=-1 and lParam=0 to the edit control to move the cursor to theend of the text. mov edi,pNTHdr


mov edi, [edi].OptionalHeader.DataDirectory[sizeof IMAGE_DATA_DIRECTORY].Vir-tualAddress

Obtain the RVA of the import symbols. edi at first points to the PE header. We use it to goto the 2nd member of the data directory array and obtain the value of VirtualAddressmember.



invoke RVAToOffset,pMapping,edi

mov edi,eax

add edi,pMapping

Here comes one of the pitfalls for newcomers to PE programming. Most of the addresses inthe PE file are RVAs and RVAs are meaningful only when the PE file is loaded into memoryby the PE loader. In our case, we do map the file into memory but not the way the PE loaderdoes. Thus we cannot use those RVAs directly. Somehow we have to convert those RVAsinto file offsets. I write RVAToOffset function just for this purpose. I won't analyze it in detailhere. Suffice to say that it checks the submitted RVA against the starting-ending RVAs of allsections in the PE file and use the value in PointerToRawData field in theIMAGE_SECTION_HEADER structure to convert the RVA to file offset.

To use this function, you pass it two parameters: the pointer to the memory mapped file andthe RVA you want to convert. It returns the file offset in eax. In the above snippet, we mustadd the pointer to the memory mapped file to the file offset to convert it to virtual address.Seems complicated, huh? :) assume edi:ptr IMAGE_IMPORT_DESCRIPTOR

.while !([edi].OriginalFirstThunk==0 && [edi].TimeDateStamp==0 && [edi].Forward-erChain==0 && [edi].Name1==0 && [edi].FirstThunk==0)

edi now points to the first IMAGE_IMPORT_DESCRIPTOR structure. We will walk the arrayuntil we find the structure with zeroes in all members which denotes the end of the array. invoke AppendText,hDlg,addr ImportDescriptor

invoke RVAToOffset,pMapping, [edi].Name1

mov edx,eax

add edx,pMapping

We want to display the values of the current IMAGE_IMPORT_DESCRIPTOR structure in theedit control. Name1 is different from the other members since it contains the RVA to the nameof the dll. Thus we must convert it to a virtual address first. invoke wsprintf, addr temp, addr IDTemplate, [edi].OriginalFirst-Thunk,[edi].TimeDateStamp,[edi].ForwarderChain,edx,[edi].FirstThunk invoke AppendText,hDlg,addr temp

Display the values of the current IMAGE_IMPORT_DESCRIPTOR.



62

.if [edi].OriginalFirstThunk==0

mov esi,[edi].FirstThunk

.else

mov esi,[edi].OriginalFirstThunk

.endif

Next we prepare to walk the IMAGE_THUNK_DATA array. Normally we would choose touse the array pointed to by OriginalFirstThunk. However, some linkers errornously put 0in OriginalFirstThunk thus we must check first if the value of OriginalFirstThunk is zero. Ifit is, we use the array pointed to by FirstThunk instead. invoke RVAToOffset,pMapping,esi

add eax,pMapping

mov esi,eax

Again, the value in OriginalFirstThunk/FirstThunk is an RVA. We must convert it to virtualaddress. invoke AppendText,hDlg,addr NameHeader

.while dword ptr [esi]!=0

Now we are ready to walk the array of IMAGE_THUNK_DATAs to look for the names ofthe functions imported from this DLL. We will walk the array until we find an entry whichcontains 0. test dword ptr [esi],IMAGE_ORDINAL_FLAG32

jnz ImportByOrdinal

The first thing we do with the IMAGE_THUNK_DATA is to test it againstIMAGE_ORDINAL_FLAG32. This test checks if the most significant bit of theIMAGE_THUNK_DATA is 1. If it is, the function is exported by ordinal so we have no needto process it further. We can extract its ordinal from the low word of theIMAGE_THUNK_DATA and go on with the next IMAGE_THUNK_DATA dword.



invoke RVAToOffset,pMapping,dword ptr [esi]

mov edx,eax

add edx,pMapping

assume edx:ptr IMAGE_IMPORT_BY_NAME

If the MSB of the IAMGE_THUNK_DATA is 0, it contains the RVA ofIMAGE_IMPORT_BY_NAME structure. We need to convert it to virtual address first. mov cx, [edx].Hint

movzx ecx,cx

invoke wsprintf,addr temp,addr NameTemplate,ecx,addr [edx].Name1

jmp ShowTheText

Hint is a word-sized field. We must convert it to a dword-sized value before submitting it towsprintf. And we print both the hint and the function name in the edit controlImportByOrdinal:

mov edx,dword ptr [esi]

and edx,0FFFFh

invoke wsprintf,addr temp,addr OrdinalTemplate,edx

In the case the function is exported by ordinal only, we zero out the high word and display theordinal.ShowTheText:


add esi,4

After inserting the function name/ordinal into the edit control, we skip to the nextIMAGE_THUNK_DATA. .endw

add edi,sizeof IMAGE_IMPORT_DESCRIPTOR

When all IMAGE_THUNK_DATA dwords in the array are processed, we skip to the nextIMAGE_IMPORT_DESCRIPTOR to process the import functions from other DLLs.



64

Appendix:

It would be incomplete if I don't mention something about bound import. In order toexplain what it is, I need to digress a bit. When the PE loader loads a PE file into memory,it examines the import table and loads the required DLLs into the process address space.Then it walks the IMAGE_THUNK_DATA array much like we did and replaces theIMAGE_THUNK_DATAs with the real addresses of the import functions. This step takestime. If somehow the programmer can predict the addresses of the functions correctly, thePE loader doesn't have to fix the IMAGE_THUNK_DATAs each time the PE file is run.Bound import is the product of that idea.

To put it in simple terms, there is a utility named bind.exe that comes with Microsoft com-pilers such as Visual Studio that examines the import table of a PE file and replaces theIMAGE_THUNK_DATA dwords with the addresses of the import functions.When the fileis loaded, the PE loader must check if the addresses are valid. If the DLL versions do notmatch the ones in the PE files or if the DLLs need to be relocated, the PE loader knowsthat the precomputed addresses are not valid thus it must walk the array pointed to byOriginalFirstThunk to calculate the new addresses of import functions.

Bound import doesn't have much significance in our example because we use Original-FirstThunk by default. For more information about the bound import, I recommmendLUEVELSMEYER's pe.txt.



Export Table7

Theory:

When the PE loader runs a program, it loads the associated DLLs into the process addressspace. It then extracts information about the import functions from the main program. It usesthe information to search the DLLs for the addresses of the functions to be patched into themain program. The place in the DLLs where the PE loader looks for the addresses of thefunctions is the export table.

When a DLL/EXE exports a function to be used by other DLL/EXE, it can do so in two ways:it can export the function by name or by ordinal only. Say if there is a function named "GetSy-sConfig" in a DLL, it can choose to tell the other DLLs/EXEs that if they want to call the func-tion, they must specify it by its name, ie. GetSysConfig. The other way is to export by ordinal.What's an ordinal? An ordinal is a 16-bit number that uniquely identifies a function in a partic-ular DLL. This number is unique only within the DLL it refers to. For example, in the aboveexample, the DLL can choose to export the function by ordinal, say, 16. Then the other DLLs/EXEs which want to call this function must specify this number in GetProcAddress. This iscalled export by ordinal only.

Export by ordinal only is strongly discouraged because it can cause a maintenance problemfor the DLL. If the DLL is upgraded/updated, the programmer of that DLL cannot alter theordinals of the functions else other programs that depend on the DLL will break.

Now we can examine the export structure. As with import table, you can find where the exporttable is from looking at the data directory. In this case, the export table is the first member ofthe data directory. The export structure is called IMAGE_EXPORT_DIRECTORY. There are11 members in the structure but only some of them are really used.




66

Field Name Meaning

nName The actual name of the module. This field is necessary because the name of the file can be changed. If it's the case, the PE loader will use this internal name.

nBase A number that you must bias against the ordinals to get the indexes into the address-of-function array.

NumberOfFunctions Total number of functions/symbols that are exported by this module.

NumberOfNames

Number of functions/symbols that are exported by name. This value is not the number of ALL functions/symbols in the module. For that number, you need to check NumberOfFunctions. This value can be 0. In that case, the module may export by ordinal only. If there is no function/symbol to be exported in the first case, the RVA of the export table in the data directory will be 0.

AddressOfFunctions

An RVA that points to an array of RVAs of the functions/symbols in the module. In short, RVAs to all functions in the module are kept in an array and this field points to the head of that array.

AddressOfNames An RVA that points to an array of RVAs of the names of functions in the module.

AddressOfNameOrdinals An RVA that points to a 16-bit array that contains the ordinals associated with the function names in the AddressOfNames array above.



Just reading the above table may not give you the real picture of the export table. The simpli-fied explanation below will clarify the concept.

The export table exists for use by the PE loader. First of all, the module must keep theaddresses of all exported functions somewhere so the PE loader can look them up. It keepsthem in an array that is pointed to by the field AddressOfFunctions. The number of elementsin the array is kept in NumberOfFunctions. Thus if the module exports 40 functions, it musthave 40 members in the array pointed to by AddressOfFunctions and NumberOfFunctionsmust contain a value 40. Now if some functions are exported by names, the module mustkeep the names in the file. It keeps the RVAs to the names in an array so the PE loader canlook them up. That array is pointed to by AddressOfNames and the number of names inNumberOfNames. Think about the job of the PE loader, it knows the names of the functions,it must somehow obtain the addresses of those functions. Up to now, the module has twoarrays: the names and the addresses but there is no linkage between them. Thus we needsomething that relates the names of the functions to their addresses. The PE specificationuses indexes into the address array as that essential linkage. Thus if the PE loader finds thename it looks for in the name array, it can obtain the index into the address table for thatname too. The indexes are kept in another array (the last one) pointed to by the fieldAddressOfNameOrdinals. Since this array exists as the linkage between the names and theaddresses, it must have exactly the same number of elements as the name array, ie. eachname can have one and only one associated address. The reverse is not true: an addressmay have several names associated with it. Thus we can have "aliases" that refer to thesame address. To make the linkage works, both name and index arrays must run in parallel,ie. the first element in the index array must hold the index for the first name and so on.



68

An example or two is in order. If we have the name of an export function and we need toget its address in the module, we can do like this:1.Go to the PE header

2.Read the virtual address of the export table in the data directory

3.Go to the export table and obtain the number of names (NumberOfNames)

4.Walk the arrays pointed to by AddressOfNames and AddressOfNameOrdinals in par-allel, searching for the matching name. If the name is found in the AddressOfNames array, you must extract the value in the associated element in the AddressOfNameOrdinals array. For example, if you find the RVA of the match-ing name in 77th element of the AddressOfNames array, you must extract the value stored in the 77th element of the AddressOfNameOrdinals array. If you walk the array until NumberOfNames elements are examined, you know that the name is not in this module.

5.Use the value from the AddressOfNameOrdinals array as the index into the AddressOfFunctions array. Say, if the value is 5, you must extract the value in the 5th element of the AddressOfFunctions array. That value is the RVA of the function.

AddressOfNames AddressOfNameOrdinals | |

RVA of Name 1RVA of Name 2RVA of Name 3RVA of Name 4

... RVA of Name N

<--><--><--><-->...

<-->

Index of Name 1Index of Name 2Index of Name 3Index of Name 4

... Index of Name N



Now we can turn our attention to the nBase member of the IMAGE_EXPORT_DIRECTORYstructure. You already know that the AddressOfFunctions array contains the addresses of allexport symbols in a module. And the PE loader uses the indexes into this array to find theaddresses of the functions. Let's imagine the scenario where we use the indexes into thisarray as the ordinals. Since the programmers can specify the starting ordinal number in .deffile, like 200, it means that there must be at least 200 elements in the AddressOfFunctionsarray. Furthermore the first 200 elements are not used but they must exist so that the PEloader can use the indexes to find the correct addresses. This is not good at all. The nBasemember exists to solve this problem. If the programmer specifies the starting ordinal of 200,the value in nBase would be 200. When the PE loader reads the value in nBase, it knows thatthe first 200 elements do not exist and that it should subtract the ordinal by the value in nBaseto obtain the true index into the AddressOfFunctions array. With the use of nBase, there is noneed to provide 200 empty elements.

Note that nBase doesn't affect the values in the AddressOfNameOrdinals array. Despite thename "AddressOfNameOrdinals", this array contains the true indexes into the AddressOf-Functions array, not the ordinals.

With the discussion of nBase out of the way, we can continue to the next example.

Suppose that we have an ordinal of a function and we need to obtain the address of that func-tion, we can do it like this:1.Go to the PE header

2.Obtain the RVA of the export table from the data directory

3.Go to the export table and obtain the value of nBase.

4.Subtract the ordinal by the value in nBase and you have the index into the AddressOfFunctions array.

5.Compare the index with the value in NumberOfFunctions. If the index is larger or equal to the value in NumberOfFunctions, the ordinal is invalid.

6.Use the index to obtain the RVA of the function in the AddressOfFunctions array.

Note that obtaining the address of a function from an ordinal is much easier and faster thanusing the name of the function. There is no need to walk the AddressOfNames andAddressOfNameOrdinals arrays. The performance gain, however, must be balanced againstthe difficulty in the maintaining the module.



70

In conclusion, if you want to obtain the address of a function from its name, you need towalk both AddressOfNames and AddressOfNameOrdinals arrays to obtain the index intothe AddressOfFunctions array. If you have the ordinal of the function, you can go directlyto the AddressOfFunctions array after the ordinal is biased by nBase.

If a function is exported by name, you can use either its name or its ordinal in GetProcAd-dress. But what if the function is exported by ordinal only? We come to that now.

"A function is exported by ordinal only" means the function doesn't have entries in bothAddressOfNames and AddressOfNameOrdinals arrays. Remember the two fields, Num-berOfFunctions and NumberOfNames. The existence of these two fields is the evidencethat some functions may not have names. The number of functions must be at least equalto the number of names. The functions that don't have names are exported by their ordi-nals only. For example, if there are 70 functions but only 40 entries in theAddressOfNames array, it means there are 30 functions in the module that are exportedby their ordinals only. Now how can we find out which functions are exported by ordinalsonly? It's not easy. You must find that out by exclusion, ie. the entries in the AddressOf-Functions array that are not referenced by the AddressOfNameOrdinals array contain theRVAs of the functions that are exported by ordinals only.



Example:This example is similar to the one in the previous tutorial. However, it displays the values of some members of IMAGE_EXPORT_DIRECTORY structure and also lists the RVAs, ordinals, and names of the exported functions. Note that this example doesn't list the functions that are exported by ordinals only.

.386

.model flat,stdcall

option casemap:none








IDD_MAINDLG equ 101

IDC_EDIT equ 1000

IDM_OPEN equ 40001

IDM_EXIT equ 40003

DlgProc proto :DWORD,:DWORD,:DWORD,:DWORD

ShowExportFunctions proto :DWORD

ShowTheFunctions proto :DWORD,:DWORD

AppendText proto :DWORD,:DWORD

SEH struct

PrevLink dd ?

CurrentHandler dd ?

SafeOffset dd ?

PrevEsp dd ?

PrevEbp dd ?

SEH ends

.data




72

ofn OPENFILENAME <>


db "All Files",0,"*.*",0,0




NotValidPE db "This file is not a valid PE",0

NoExportTable db "No export information in this file",0

CRLF db 0Dh,0Ah,0

ExportTable db 0Dh,0Ah,"======[ IMAGE_EXPORT_DIRECTORY ]======",0Dh,0Ah

db "Name of the module: %s",0Dh,0Ah

db "nBase: %lu",0Dh,0Ah

db "NumberOfFunctions: %lu",0Dh,0Ah

db "NumberOfNames: %lu",0Dh,0Ah

db "AddressOfFunctions: %lX",0Dh,0Ah

db "AddressOfNames: %lX",0Dh,0Ah

db "AddressOfNameOrdinals: %lX",0Dh,0Ah,0

Header db "RVA Ord. Name",0Dh,0Ah

db "----------------------------------------------",0

template db "%lX %u %s",0

.data?


hFile dd ?

hMapping dd ?

pMapping dd ?

ValidPE dd ?

.code

start:


invoke DialogBoxParam, eax, IDD_MAINDLG,NULL,addr DlgProc, 0


DlgProc proc hDlg:DWORD, uMsg:DWORD, wParam:DWORD, lParam:DWORD




invoke SendDlgItemMessage,hDlg,IDC_EDIT,EM_SETLIMITTEXT,0,0

.elseif uMsg==WM_CLOSE

invoke EndDialog,hDlg,0

.elseif uMsg==WM_COMMAND

.if lParam==0

mov eax,wParam

.if ax==IDM_OPEN

invoke ShowExportFunctions,hDlg

.else ; IDM_EXIT

invoke SendMessage,hDlg,WM_CLOSE,0,0

.endif

.endif

.else

mov eax,FALSE

ret

.endif

mov eax,TRUE

ret

DlgProc endp

SEHHandler proc C pExcept:DWORD, pFrame:DWORD, pContext:DWORD, pDispatch:DWORD

mov edx,pFrame

assume edx:ptr SEH

mov eax,pContext



pop [eax].regEip

push [edx].PrevEsp

pop [eax].regEsp

push [edx].PrevEbp

pop [eax].regEbp

mov ValidPE, FALSE


ret



74

SEHHandler endp

ShowExportFunctions proc uses edi hDlg:DWORD

LOCAL seh:SEH







.if eax==TRUE



mov hFile, eax


.if eax!=NULL

mov hMapping, eax


.if eax!=NULL

mov pMapping,eax

assume fs:nothing

push fs:[0]

pop seh.PrevLink



lea eax,seh

mov fs:[0], eax

mov seh.PrevEsp,esp

mov seh.PrevEbp,ebp

mov edi, pMapping








mov ValidPE, TRUE

.else

mov ValidPE, FALSE

.endif

.else

mov ValidPE,FALSE

.endif

FinalExit:

push seh.PrevLink

pop fs:[0]

.if ValidPE==TRUE

invoke ShowTheFunctions, hDlg, edi

.else

invoke MessageBox,0, addr NotValidPE, addr AppName, MB_OK+MB_ICONERROR

.endif


.else


.endif


.else


.endif


.else


.endif

.endif

ret

ShowExportFunctions endp

AppendText proc hDlg:DWORD,pText:DWORD

invoke SendDlgItemMessage,hDlg,IDC_EDIT,EM_REPLACESEL,0,pText



76

invoke SendDlgItemMessage,hDlg,IDC_EDIT,EM_REPLACESEL,0,addr CRLF

invoke SendDlgItemMessage,hDlg,IDC_EDIT,EM_SETSEL,-1,0

ret

AppendText endp

RVAToFileMap PROC uses edi esi edx ecx pFileMap:DWORD,RVA:DWORD

mov esi,pFileMap

assume esi:ptr IMAGE_DOS_HEADER

add esi,[esi].e_lfanew

assume esi:ptr IMAGE_NT_HEADERS

mov edi,RVA ; edi == RVA

mov edx,esi

add edx,sizeof IMAGE_NT_HEADERS

mov cx,[esi].FileHeader.NumberOfSections

movzx ecx,cx

assume edx:ptr IMAGE_SECTION_HEADER

.while ecx>0

.if edi>=[edx].VirtualAddress


add eax,[edx].SizeOfRawData

.if edi<eax


sub edi,eax

mov eax,[edx].PointerToRawData

add eax,edi

add eax,pFileMap

ret

.endif

.endif

add edx,sizeof IMAGE_SECTION_HEADER

dec ecx

.endw

assume edx:nothing

assume esi:nothing

mov eax,edi



ret

RVAToFileMap endp

ShowTheFunctions proc uses esi ecx ebx hDlg:DWORD, pNTHdr:DWORD

LOCAL temp[512]:BYTE

LOCAL NumberOfNames:DWORD

LOCAL Base:DWORD

mov edi,pNTHdr


mov edi, [edi].OptionalHeader.DataDirectory.VirtualAddress

.if edi==0

invoke MessageBox,0, addr NoExportTable,addr AppName,MB_OK+MB_ICONERROR

ret

.endif

invoke SetDlgItemText,hDlg,IDC_EDIT,0

invoke AppendText,hDlg,addr buffer

invoke RVAToFileMap,pMapping,edi

mov edi,eax

assume edi:ptr IMAGE_EXPORT_DIRECTORY

mov eax,[edi].NumberOfFunctions

invoke RVAToFileMap, pMapping,[edi].nName

invoke wsprintf, addr temp,addr ExportTable, eax, [edi].nBase, [edi].NumberOfFunc-tions, [edi].NumberOfNames, [edi].AddressOfFunctions, [edi].AddressOfNames, [edi].AddressOfNameOrdinals


invoke AppendText,hDlg,addr Header

push [edi].NumberOfNames

pop NumberOfNames

push [edi].nBase

pop Base

invoke RVAToFileMap,pMapping,[edi].AddressOfNames

mov esi,eax

invoke RVAToFileMap,pMapping,[edi].AddressOfNameOrdinals

mov ebx,eax

invoke RVAToFileMap,pMapping,[edi].AddressOfFunctions



78

mov edi,eax

.while NumberOfNames>0

invoke RVAToFileMap,pMapping,dword ptr [esi]

mov dx,[ebx]

movzx edx,dx

mov ecx,edx

shl edx,2

add edx,edi

add ecx,Base

invoke wsprintf, addr temp,addr template,dword ptr [edx],ecx,eax


dec NumberOfNames

add esi,4

add ebx,2

.endw

ret

ShowTheFunctions endp

end start



Analysis:mov edi,pNTHdr


mov edi, [edi].OptionalHeader.DataDirectory.VirtualAddress

.if edi==0

invoke MessageBox,0, addr NoExportTable,addr AppName,MB_OK+MB_ICONERROR

ret

.endif

After the program verifies that the file is a valid PE, it goes to the data directory and obtainsthe virtual address of the export table. If the virtual address is zero, the file doesn't have anyexported symbol.mov eax,[edi].NumberOfFunctions

invoke RVAToFileMap, pMapping,[edi].nName

invoke wsprintf, addr temp,addr ExportTable, eax, [edi].nBase, [edi].NumberOfFunc-tions, [edi].NumberOfNames, [edi].AddressOfFunctions, [edi].AddressOfNames, [edi].AddressOfNameOrdinals


We display the important information in the IMAGE_EXPORT_DIRECTORY structure in theedit control.push [edi].NumberOfNames

pop NumberOfNames

push [edi].nBase

pop Base

Since we want to enumerate all function names, we need to know how many names thereare in the export table. nBase is used when we want to convert the indexes into theAddressOfFunctions array into ordinals. invoke RVAToFileMap,pMapping,[edi].AddressOfNames

mov esi,eax

invoke RVAToFileMap,pMapping,[edi].AddressOfNameOrdinals

mov ebx,eax

invoke RVAToFileMap,pMapping,[edi].AddressOfFunctions

mov edi,eax



80

The addresses of the three arrays are stored in esi, ebx, and edi, ready to be accessed..while NumberOfNames>0

Continue until all names are processed. invoke RVAToFileMap,pMapping,dword ptr [esi]

Since esi points to an array of RVAs of the exported names, dereference it will give theRVA of the current name. We convert it to the virtual address, to be used in wsprintf later.

mov dx,[ebx]

movzx edx,dx

mov ecx,edx

add ecx,Base

ebx points to the array of ordinals. Its array elements are word-size. Thus we need to con-vert the value into a dword first. edx and ecx contain the index into the AddressOfFunc-tions array. We will use edx as the pointer into the AddressOfFunctions array. We add thevalue of nBase to ecx to obtain the ordinal number of the function. shl edx,2

add edx,edi

We multiply the index by 4 (each element in the AddressOfFunctions array is 4 bytes insize) and then add the address of the AddressOfFunctions array to it. Thus edx points tothe RVA of the function. invoke wsprintf, addr temp,addr template,dword ptr [edx],ecx,eax


We display the RVA, ordinal, and the name of the function in the edit control. dec NumberOfNames

add esi,4

add ebx,2

.endw

Update the counter and the addresses of the current elements in AddressOfNames and AddressOfNameOrdinals arrays. Continue until all names are processed.



The PE file format by Bernd Luevelsmeyer

PrefaceThe PE ("portable executable") file format is the format of executable binaries (DLLs and pro-grams) for MS windows NT, windows 95 and win32s; in windows NT, the drivers are in this format, too. It can also be used for object files and libraries.

The format is designed by Microsoft and standardized by the TIS (tool interface standard) Committee (Microsoft, Intel, Borland, Watcom, IBM and others) in 1993, apparently based on a good knowledge of COFF, the "common object file format" used for object files and execut-ables on several UNIXes and on VMS.

The win32 SDK includes a header file <winnt.h> containing #defines and typedefs for the PE-format. I will mention the struct-member-names and #defines as we go.

You may also find the DLL "imagehelp.dll" to be helpful. It is part of windows NT, but docu-mentation is scarce. Some of its functions are described in the "Developer Network".



82

General Layout

At the start of a PE file we find an MS-DOS executable ("stub"); thismakes any PE file avalid MS-DOS executable.

After the DOS-stub there is a 32-bit-signature with the magic number0x00004550 (IMAGE_NT_SIGNATURE).

Then there is a file header (in the COFF-format) that tells on which machine the binary issupposed to run, how many sections are in it, the time it was linked, whether it is an exe-cutable or a DLL and so on. (The difference between executable and DLL in this contextis: a DLL can not be started but only be used by another binary, and a binary cannot linkto an executable).

After that, we have an optional header (it is always there but still called "optional" - COFFuses an "optional header" for libraries but not for objects, that's why it is called "optional").This tells us more about how the binary should be loaded: The starting address, theamount of stack to reserve, the size of the data segment etc..

An interesting part of the optional header is the trailing array of 'data directories'; thesedirectories contain pointers to data in the 'sections'. If, for example, the binary has anexport directory, you will find a pointer to that directory in the array memberIMAGE_DIRECTORY_ENTRY_EXPORT, and it will point into one of the sections.

Following the headers we find the 'sections', introduced by the 'section headers'. Essen-tially, the sections' contents is what you really need to execute a program, and all theheader and directory stuff is just there to help you find it.

Each section has some flags about alignment, what kind of data it contains ("initializeddata" and so on), whether it can be shared etc., and the data itself. Most, but not all, sec-tions contain one or more directories referenced through the entries of the optionalheader's "data directory" array, like the directory of exported functions or the directory ofbase relocations. Directoryless types of contents are, for example, "executable code" or"initialized data".



+-------------------+

| DOS-stub |

+-------------------+

| file-header |

+-------------------+

| optional header |

|- - - - - - - - - -|

| |

| data directories |

| |

+-------------------+

| |

| section headers |

| |

+-------------------+

| |

| section 1 |

| |

+-------------------+

| |

| section 2 |

| |

+-------------------+

| |

| ... |

| |

+-------------------+

| |

| section n |

| |

+-------------------+



84

DOS-stub and Signature

The concept of a DOS-stub is well-known from the 16-bit-windows-executables (whichwere in the "NE" format). The stub is used for OS/2-executables, self-extracting archivesand other applications, too.

For PE-files, it is a MS-DOS 2.0 compatible executable that almost always consists ofabout 100 bytes that output an error message such as "this program needs windows NT".You recognize a DOS-stub by validating the DOS-header, being a structIMAGE_DOS_HEADER. The first 2 bytes should be the sequence "MZ" (there is a#define IMAGE_DOS_SIGNATURE for this WORD). You distinguish a PE binary fromother stubbed binaries by the trailing signature, which you find at the offset given by theheader member 'e_lfanew' (which is 32 bits long beginning at byte offset 60). For OS/2and windows binaries, the signature is a 16-bit-word; for PE files, it is a 32-bit-longwordaligned at a 8-byte-boundary and having the value IMAGE_NT_SIGNATURE #defined tobe 0x00004550.

File Header

To get to the IMAGE_FILE_HEADER, validate the "MZ" of the DOS-header (1st 2 bytes),then find the 'e_lfanew' member of the DOS-stub's header and skip that many bytes fromthe beginning of the file. Verify the signature you will find there. The file header, a structIMAGE_FILE_HEADER, begins immediatly after it; the members are described top tobottom.



The first member is the 'Machine', a 16-bit-value indicating the system the binary is intendedto run on. Known legal values are IMAGE_FILE_MACHINE_I386 (0x14c) for Intel 80386 processor or better

0x014d for Intel 80486 processor or better

0x014e for Intel Pentium processor or better

0x0160 for R3000 (MIPS) processor, big endian

IMAGE_FILE_MACHINE_R3000 (0x162) for R3000 (MIPS) processor, little endian



IMAGE_FILE_MACHINE_ALPHA (0x184) for DEC Alpha AXP processor

IMAGE_FILE_MACHINE_POWERPC (0x1F0) for IBM Power PC, little endian

Then we have the 'NumberOfSections', a 16-bit-value. It is the number of sections that followthe headers. We will discuss the sections later.

Next is a timestamp 'TimeDateStamp' (32 bit), giving the time the file was created. You candistinguish several versions of the same file by this value, even if the "official" version numberwas not altered. (The format of the timestamp is not documented except that it should besomewhat unique among versions of the same file, but apparently it is 'seconds since Janu-ary 1 1970 00:00:00' in UTC - the format used by most C compilers for the time_t.)

This timestamp is used for the binding of import directories, which will be discussed later.Warning: some linkers tend to set this timestamp to absurd values which are not the time oflinking in time_t format as described.

The members 'PointerToSymbolTable' and 'NumberOfSymbols' (both 32 bit) are used fordebugging information. I don't know how to decipher them, and I've found the pointer to bealways 0.

'SizeOfOptionalHeader' (16 bit) is simply sizeof(IMAGE_OPTIONAL_HEADER). You can useit to validate the correctness of the PE file's structure.



86

'Characteristics' is 16 bits and consists of a collection of flags, most of them being validonly for object files and libraries:

Bit 0 (IMAGE_FILE_RELOCS_STRIPPED) is set if there is no relocation informa-tion in the file. This refers to relocation information per section in the sections themselves; it is not used for executables, which have relocation information in the 'base relocation' directory described below.

Bit 1 (IMAGE_FILE_EXECUTABLE_IMAGE) is set if the file is executable, i.e. it is not an object file or a library. This flag may also be set if the linker attempted to create an executable but failed for some reason, and keeps the image in order to do e.g. incremental linking the next time.

Bit 2 (IMAGE_FILE_LINE_NUMS_STRIPPED) is set if the line number information is stripped; this is not used for executable files.

Bit 3 (IMAGE_FILE_LOCAL_SYMS_STRIPPED) is set if there is no information about local symbols in the file (this is not used for executable files).

Bit 4 (IMAGE_FILE_AGGRESIVE_WS_TRIM) is set if the operating system is sup-posed to trim the working set of the running process (the amount of RAM the process uses) aggressivly by paging it out. This should be set if it is a demon-like application that waits most of the time and only wakes up once a day, or the like.

Bits 7 (IMAGE_FILE_BYTES_REVERSED_LO) and 15(IMAGE_FILE_BYTES_REVERSED_HI) are set if the endianess of the file is not what the machine would expect, so it must swap bytes before reading. This is unreliable for executable files (the OS expects executables to be correctly byte-ordered).

Bit 8 (IMAGE_FILE_32BIT_MACHINE) is set if the machine is expected to be a 32 bit machine. This is always set for current implementations; NT5 may work differently.

Bit 9 (IMAGE_FILE_DEBUG_STRIPPED) is set if there is no debugging information in the file. This is unused for executable files. According to other information ([6]), this bit is called "fixed" and is set if the image can only run if it is loaded at the preferred load address (i.e. it is not relocatable).

Bit 10 (IMAGE_FILE_REMOVABLE_RUN_FROM_SWAP) is set if the application may not run from a removable medium such as a floppy or a CD-ROM. In this case, the operating system is advised to copy the file to the swapfile and exe-cute it from there.

Bit 11 (IMAGE_FILE_NET_RUN_FROM_SWAP) is set if the application may not run from the network. In this case, the operating system is advised to copy the file to the swapfile and execute it from there.



Bit 12 (IMAGE_FILE_SYSTEM) is set if the file is a system file such as a driver. This is unused for executable files; it is also not used in all the NT driv-ers I inspected.

Bit 13 (IMAGE_FILE_DLL) is set if the file is a DLL.

Bit 14 (IMAGE_FILE_UP_SYSTEM_ONLY) is set if the file is not designed to run on multiprocessor systems (that is, it will crash there because it relies in some way on exactly one processor).



88

Relative Virtual Addresses

The PE format makes heavy use of so-called RVAs. An RVA, aka "relative virtualaddress", is used to describe a memory address if you don't know the base address. It isthe value you need to add to the base address to get the linear address. The baseaddress is the address the PE image is loaded to, and may vary from one invocation tothe next.

Example: suppose an executable file is loaded to address 0x400000 and execution startsat RVA 0x1560. The effective execution start will then be at the address 0x401560. If theexecutable were loaded to 0x100000, the execution start would be 0x101560.

Things become complicated because the parts of the PE-file (the sections) are not neces-sarily aligned the same way the loaded image is. For example, the sections of the file areoften aligned to 512-byte-borders, but the loaded image is perhaps aligned to 4096-byte-borders. See 'SectionAlignment' and 'FileAlignment' below.

So to find a piece of information in a PE-file for a specific RVA, you must calculate the off-sets as if the file were loaded, but skip according to the file-offsets. As an example, sup-pose you knew the execution starts at RVA 0x1560, and want to diassemble the codestarting there. To find the address in the file, you will have to find out that sections in RAMare aligned to 4096 bytes and the ".code"-section starts at RVA 0x1000 in RAM and is16384 bytes long; then you know that RVA 0x1560 is at offset 0x560 in that section. Findout that the sections are aligned to 512-byte-borders in the file and that ".code" begins atoffset 0x800 in the file, and you know that the code execution start is at byte0x800+0x560=0xd60 in the file.

Then you disassemble and find an access to a variable at the linear address 0x1051d0.The linear address will be relocated upon loading the binary and is given on the assump-tion that the preferred load address is used. You find out that the preferred load address is0x100000, so we are dealing with RVA 0x51d0. This is in the data section which starts atRVA 0x5000 and is 2048 bytes long. It begins at file offset 0x4800.

Hence. the veriable can be found at file offset 0x4800+0x51d0-0x5000=0x49d0.



Optional Header

Immediatly following the file header is the IMAGE_OPTIONAL_HEADER (which, in spite ofthe name, is always there). It contains information about how to treat the PE-file exactly. We'llalso have the members from top to bottom.

The first 16-bit-word is 'Magic' and has, as far as I looked into PE-files, always the value0x010b.

The next 2 bytes are the version of the linker ('MajorLinkerVersion' and 'MinorLinkerVersion')that produced the file. These values, again, are unreliable and do not always reflect the linkerversion properly. (Several linkers simply don't set this field.)

And, coming to think about it, what good is the version if you have got no idea *which* linkerwas used?

The next 3 longwords (32 bit each) are intended to be the size of the executable code('SizeOfCode'), the size of the initialized data ('SizeOfInitializedData', the so-called "data seg-ment"), and the size of the uninitialized data ('SizeOfUninitializedData', the so-called "bsssegment"). These values are, again, unreliable (e.g. the data segment may actually be splitinto several segments by the compiler or linker), and you get better sizes by inspecting the'sections' that follow the optional header.

Next is a 32-bit-value that is a RVA. This RVA is the offset to the codes's entry point('AddressOfEntryPoint'). Execution starts here; it is e.g. the address of a DLL's LibMain() or aprogram's startup code (which will in turn call main()) or a driver's DriverEntry(). If you dare toload the image "by hand", you call this address to start the process after you have done allthe fixups and the relocations.

The next 2 32-bit-values are the offsets to the executable code ('BaseOfCode') and the initial-ized data ('BaseOfData'), both of them RVAs again, and both of them being of little interestbecause you get more reliable information by inspecting the 'sections' that follow the head-ers.

There is no offset to the uninitialized data because, being uninitialized, there is little point inproviding this data in the image.



90

The next entry is a 32-bit-value giving the preferred (linear) load address ('ImageBase') ofthe entire binary, including all headers. This is the address (always a multiple of 64 KB)the file has been relocated to by the linker; if the binary can in fact be loaded to thataddress, the loader doesn't need to relocate the file again, which is a win in loading time.The preferred load address can not be used if another image has already been loaded tothat address (an "address clash", which happens quite often if you load several DLLs thatare all relocated to the linker's default), or the memory in question has been used forother purposes (stack, malloc(), uninitialized data, whatever). In these cases, the imagemust be loaded to some other address and it needs to be relocated (see 'relocation direc-tory' below). This has further consequences if the image is a DLL, because then the"bound imports" are no longer valid, and fixups have to be made to the binary that usesthe DLL - see 'import directory' below.

The next 2 32-bit-values are the alignments of the PE-file's sections in RAM ('Section-Alignment', when the image has been loaded) and in the file ('FileAlignment'). Usuallyboth values are 32, or FileAlignment is 512 and SectionAlignment is 4096. Sections willbe discussed later.

The next 2 16-bit-words are the expected operating system version ('MajorOperatingSys-temVersion' and 'MinorOperatingSystemVersion' [they _do_ like self-documenting namesat MS]). This version information is intended to be the operating system's (e.g. NT orWin95) version, as opposed to the subsystem's version (e.g. Win32); it is often not sup-plied, or wrong supplied. The loader doesn't use it, apparently.

The next 2 16-bit-words are the binary's version, ('MajorImageVersion' and 'MinorImage-Version'). Many linkers don't set this information correctly and many programmers don'tbother to supply it, so it is better to rely on the version-resource if one exists.

The next 2 16-bit-words are the expected subsystem version ('MajorSubsystemVersion'and 'MinorSubsystemVersion'). This should be the Win32 version or the POSIX version,because 16-bit-programs or OS/2-programs won't be in PE-format, obviously. This sub-system version should be supplied correctly, because it *is* checked and used:

If the application is a Win32-GUI-application and runs on NT4, and the subsystem versionis *not* 4.0, the dialogs won't be 3D-style and certain other features will also work "old-style" because the application expects to run on NT 3.51, which had the program man-ager instead of explorer and so on, and NT 4.0 will mimic that behaviour as faithfully aspossible.



Then we have a 'Win32VersionValue' of 32 bits. I don't know what it is good for. It has been 0in all the PE files that I inspected.

Next is a 32-bits-value giving the amount of memory the image will need, in bytes ('SizeOfIm-age'). It is the sum of all headers' and sections' lengths if aligned to 'SectionAlignment'. It is ahint to the loader how many pages it will need in order to load the image.

The next thing is a 32-bit-value giving the total length of all headers including the data direc-tories and the section headers ('SizeOfHeaders'). It is at the same time the offset from thebeginning of the file to the first section's raw data.

Then we have got a 32-bit-checksum ('CheckSum'). This checksum is, for current versions ofNT, only checked if the image is a NT-driver (the driver will fail to load if the checksum isn'tcorrect). For other binary types, the checksum need not be supplied and may be 0.

The algorithm to compute the checksum is property of Microsoft, and they won't tell you.However, several tools of the Win32 SDK will compute and/or patch a valid checksum, andthe function CheckSumMappedFile() in the imagehelp.dll will do so too.

The checksum is supposed to prevent loading of damaged binaries that would crash anyway- and a crashing driver would result in a BSOD, so it is better not to load it at all.



92

Then there is a 16-bit-word 'Subsystem' that tells in which of the NT-subsystems theimage runs: IMAGE_SUBSYSTEM_NATIVE (1)

The binary doesn't need a subsystem. This is used for drivers.

IMAGE_SUBSYSTEM_WINDOWS_GUI (2)

The image is a Win32 graphical binary. (It can still open a

console with AllocConsole() but won't get one automatically at

startup.)

IMAGE_SUBSYSTEM_WINDOWS_CUI (3)

The binary is a Win32 console binary. (It will get a console

per default at startup, or inherit the parent's console.)

IMAGE_SUBSYSTEM_OS2_CUI (5)

The binary is a OS/2 console binary. (OS/2 binaries will be in

OS/2 format, so this value will seldom be used in a PE file.)

IMAGE_SUBSYSTEM_POSIX_CUI (7)

The binary uses the POSIX console subsystem.

Windows 95 binaries will always use the Win32 subsystem, so the only legal values forthese binaries are 2 and 3; I don't know if "native" binaries on windows 95 are possible.

The next thing is a 16-bit-value that tells, if the image is a DLL, when to call the DLL'sentry point ('DllCharacteristics'). This seems not to be used; apparently, the DLL is alwaysnotified about everything.

If bit 0 is set, the DLL is notified about process attachment (i.e. DLL load).

If bit 1 is set, the DLL is notified about thread detachments (i.e. thread terminations).

If bit 2 is set, the DLL is notified about thread attachments (i.e. thread creations).

If bit 3 is set, the DLL is notified about process detachment (i.e. DLL unload).



The next 4 32-bit-values are the size of reserved stack ('SizeOfStackReserve'), the size of ini-tially committed stack ('SizeOfStackCommit'), the size of the reserved heap ('SizeOfHeapRe-serve') and the size of the committed heap ('SizeOfHeapCommit').

The 'reserved' amounts are address space (not real RAM) that is reserved for the specificpurpose; at program startup, the 'committed' amount is actually allocated in RAM. The 'com-mitted' value is also the amount by which the committed stack or heap grows if necessary.(Other sources claim that the stack will grow in pages, regardless of the 'SizeOfStackCommit'value. I didn't check this.)

So, as an example, if a program has a reserved heap of 1 MB and a committed heap of 64KB, the heap will start out at 64 KB and is guaranteed to be enlargeable up to 1 MB. Theheap will grow in 64-KB-chunks.

The 'heap' in this context is the primary (default) heap. A process can create more heaps if soit wishes.

The stack is the first thread's stack (the one that starts main()). The process can create morethreads which will have their own stacks. DLLs don't have a stack or heap of their own, so thevalues are ignored for their images. I don't know if drivers have a heap or a stack of their own,but I don't think so.

After these stack- and heap-descriptions, we find 32 bits of 'LoaderFlags', which I didn't find auseful description of. I only found a vague note about setting bits that automatically invoke abreakpoint or a debugger after loading the image; however, this doesn't seem to work.

Then we find 32 bits of 'NumberOfRvaAndSizes', which is the number of valid entries in thedirectories that follow immediatly. I've found this value to be unreliable; you might wish usethe constant IMAGE_NUMBEROF_DIRECTORY_ENTRIES instead, or the lesser of both.

After the 'NumberOfRvaAndSizes' there is an array ofIMAGE_NUMBEROF_DIRECTORY_ENTRIES (16) IMAGE_DATA_DIRECTORYs.

Each of these directories describes the location (32 bits RVA called 'VirtualAddress') and size(also 32 bit, called 'Size') of a particular piece of information, which is located in one of thesections that follow the directory entries. For example, the security directory is found at theRVA and has the size that are given at index 4.



94

The directories that I know the structure of will be discussed later. Defined directoryindexes are: IMAGE_DIRECTORY_ENTRY_EXPORT (0)

The directory of exported symbols; mostly used for DLLs.

Described below.

IMAGE_DIRECTORY_ENTRY_IMPORT (1)

The directory of imported symbols; see below.

IMAGE_DIRECTORY_ENTRY_RESOURCE (2)

Directory of resources. Described below.

IMAGE_DIRECTORY_ENTRY_EXCEPTION (3)

Exception directory - structure and purpose unknown.

IMAGE_DIRECTORY_ENTRY_SECURITY (4)

Security directory - structure and purpose unknown.

IMAGE_DIRECTORY_ENTRY_BASERELOC (5)

Base relocation table - see below.

IMAGE_DIRECTORY_ENTRY_DEBUG (6)

Debug directory - contents is compiler dependent. Moreover, many

compilers stuff the debug information into the code section and

don't create a separate section for it.

IMAGE_DIRECTORY_ENTRY_COPYRIGHT (7)

Description string - some arbitrary copyright note or the like.

IMAGE_DIRECTORY_ENTRY_GLOBALPTR (8)

Machine Value (MIPS GP) - structure and purpose unknown.



IMAGE_DIRECTORY_ENTRY_TLS (9)

Thread local storage directory - structure unknown; contains

variables that are declared "__declspec(thread)", i.e.

per-thread global variables.

IMAGE_DIRECTORY_ENTRY_LOAD_CONFIG (10)

Load configuration directory - structure and purpose unknown.

IMAGE_DIRECTORY_ENTRY_BOUND_IMPORT (11)

Bound import directory - see description of import directory.

IMAGE_DIRECTORY_ENTRY_IAT (12)

Import Address Table - see description of import directory.

As an example, if we find at index 7 the 2 longwords 0x12000 and 33, and the load address is0x10000, we know that the copyright data is at address 0x10000+0x12000 (in whatever sec-tion there may be), and the copyright note is 33 bytes long. If a directory of a particular type isnot used in a binary, the Size and VirtualAddress are both 0.



96

Section directories

The sections consist of two major parts: first, a section description (of typeIMAGE_SECTION_HEADER) and then the raw section data. So after the data directorieswe find an array of 'NumberOfSections' section headers, ordered by the sections' RVAs.

A section header contains:

An array of IMAGE_SIZEOF_SHORT_NAME (8) bytes that make up the name (ASCII) ofthe section. If all of the 8 bytes are used there is no 0-terminator for the string! The nameis typically something like ".data" or ".text" or ".bss". There need not be a leading '.', thenames may also be "CODE" or "IAT" or the like. Please note that the names are not at allrelated to the contents of the section. A section named ".code" may or may not containthe executable code; it may just as well contain the import address table; it may also con-tain the code *and* the address table *and* the initialized data. To find information in thesections, you will have to look it up via the data directories of the optional header. Do notrely on the names, and do not assume that the section's raw data starts at the beginningof a section.

The next member of the IMAGE_SECTION_HEADER is a 32-bit-union of 'PhysicalAd-dress' and 'VirtualSize'. In an object file, this is the address the contents is relocated to; inan executable, it is the size of the contents. In fact, the field seems to be unused; Thereare linkers that enter the size, and there are linkers that enter the address, and I've alsofound a linker that enters a 0, and all the executables run like the gentle wind.

The next member is 'VirtualAddress', a 32-bit-value holding the RVA to the section's datawhen it is loaded in RAM.

Then we have got 32 bits of 'SizeOfRawData', which is the size of the secion's datarounded up to the next multiple of 'FileAlignment'.

Next is 'PointerToRawData' (32 bits), which is incredibly useful because it is the offsetfrom the file's beginning to the section's data. If it is 0, the section's data are not containedin the file and will be arbitrary at load time.



Then we have got 'PointerToRelocations' (32 bits) and 'PointerToLinenumbers' (also 32 bits),'NumberOfRelocations' (16 bits) and 'NumberOfLinenumbers' (also 16 bits). All of these areinformation that's only used for object files. Executables have a special base relocation direc-tory, and the line number information, if present at all, is usually contained in a special pur-pose debugging segment or elsewhere.

The last member of a section header is the 32 bits 'Characteristics', which is a bunch of flagsdescribing how the section's memory should be treated:If bit 5 (IMAGE_SCN_CNT_CODE) is set, the section contains executable code.

If bit 6 (IMAGE_SCN_CNT_INITIALIZED_DATA) is set, the section contains data that gets a defined value before execution starts. In other words: the section's data in the file is meaningful.

If bit 7 (IMAGE_SCN_CNT_UNINITIALIZED_DATA) is set, this section contains uninitial-ized data and will be initialized to all-0-bytes before execution starts. This is normally the BSS.

If bit 9 (IMAGE_SCN_LNK_INFO) is set, the section doesn't contain image data but comments, description or other documentation. This information is part of an object file and may be information for the linker, such as which libraries are needed.

If bit 11 (IMAGE_SCN_LNK_REMOVE) is set, the data is part of an object file's sec-tion that is supposed to be left out when the executable file is linked. Often com-bined with bit 9.

If bit 12 (IMAGE_SCN_LNK_COMDAT) is set, the section contains "common block data", which are packaged functions of some sort.

If bit 15 (IMAGE_SCN_MEM_FARDATA) is set, we have far data - whatever that means. This bit's meaning is unsure.

If bit 17 (IMAGE_SCN_MEM_PURGEABLE) is set, the section's data is purgeable - but I don't think that this is the same as "discardable", which has a bit of its own, see below. The same bit is apparently used to indicate 16-bit-information as there is also a define IMAGE_SCN_MEM_16BIT for it. This bit's meaning is unsure.

If bit 18 (IMAGE_SCN_MEM_LOCKED) is set, the section should not be moved in memory? Perhaps it indicates there is no relocation information? This bit's meaning is unsure.

If bit 19 (IMAGE_SCN_MEM_PRELOAD) is set, the section should be paged in before exe-cution starts? This bit's meaning is unsure.

Bits 20 to 23 specify an alignment that I have no information about. There are #defines IMAGE_SCN_ALIGN_16BYTES and the like. The only value I've ever seen used is 0, for the default 16-byte- alignment. I suspect that this is the alignment of objects in a library file or the like.

If bit 24 (IMAGE_SCN_LNK_NRELOC_OVFL) is set, the section contains some extended relocations that I don't know about.



98

If bit 25 (IMAGE_SCN_MEM_DISCARDABLE) is set, the section's data is not needed after the process has started. This is the case, for example, with the relocation information. I've seen it also for startup routines of drivers and services that are only executed once, and for import directories.

If bit 26 (IMAGE_SCN_MEM_NOT_CACHED) is set, the section's data should not be cached. Don't ask my why not. Does this mean to switch off the 2nd-level-cache?

If bit 27 (IMAGE_SCN_MEM_NOT_PAGED) is set, the section's data should not be paged out. This is interesting for drivers.

If bit 28 (IMAGE_SCN_MEM_SHARED) is set, the section's data is shared among all running instances of the image. If it is e.g. the initialized data of a DLL, all running instances of the DLL will at any time have the same variable contents. Note that only the first instance's section is initialized. Sections containing code are always shared copy-on-write (i.e. the sharing doesn't work if reloca-tions are necessary).

If bit 29 (IMAGE_SCN_MEM_EXECUTE) is set, the process gets 'execute'-access to the section's memory.

If bit 30 (IMAGE_SCN_MEM_READ) is set, the process gets 'read'-access to the sec-tion's memory.

If bit 31 (IMAGE_SCN_MEM_WRITE) is set, the process gets 'write'-access to the section's memory.

After the section headers we find the sections themselves. They are, in the file, aligned to'FileAlignment' bytes (that is, after the optional header and after each section's data therewill be padding bytes) and ordered by their RVAs. When loaded (in RAM), the sectionsare aligned to 'SectionAlignment' bytes.

As an example, if the optional header ends at file offset 981 and 'FileAlignment' is 512,the first section will start at byte 1024. Note that you can find the sections via the 'Pointer-ToRawData' or the 'VirtualAddress', so there is hardly any need to actually fuss aroundwith the alignments.



I will try to make an image of it all: +-------------------+

| DOS-stub |

+-------------------+

| file-header |

+-------------------+

| optional header |

|- - - - - - - - - -|

| |----------------+

| data directories | |

| | |

|(RVAs to direc- |-------------+ |

|tories in sections)| | |

| |---------+ | |

| | | | |

+-------------------+ | | |

| |-----+ | | |

| section headers | | | | |

| (RVAs to section |--+ | | | |

| borders) | | | | | |

+-------------------+<-+ | | | |

| | | <-+ | |

| section data 1 | | | |

| | | <-----+ |

+-------------------+<----+ |

| | |

| section data 2 | |

| | <--------------+

+-------------------+

There is one section header for each section, and each data directory will point to one of thesections (several data directories may point to the same section, and there may be sectionswithout data directory pointing to them).



10

Sections' raw data

general

All sections are aligned to 'SectionAlignment' when loaded in RAM, and 'FileAlignment' inthe file. The sections are described by entries in the section headers: You find the sec-tions in the file via 'PointerToRawData' and in memory via 'VirtualAddress'; the length is in'SizeOfRawData'.

There are several kinds of sections, depending on what's contained in them. In mostcases (but not in all) there will be at least one data directory in a section, with a pointer toit in the optional header's data directory array.

code section

First, I will mention the code section. The section will have, at least, the bits'IMAGE_SCN_CNT_CODE', 'IMAGE_SCN_MEM_EXECUTE' and'IMAGE_SCN_MEM_READ' set, and 'AddressOfEntryPoint' will point somewhere into thesection, to the start of the function that the developer wants to execute first.

'BaseOfCode' will normally point to the start of this section, but may point to somewherelater in the section if some non-code-bytes are placed before the code in the section. Nor-mally, there will be nothing but executable code in this section, and there will be only onecode section, but don't rely on this. Typical section names are ".text", ".code", "AUTO"and the like.

0 The Art Of Disassembly


data section

The next thing we'll discuss is the initialized variables; this section contains initialized staticvariables (like "static int i = 5;"). It will have, at least, the bits'IMAGE_SCN_CNT_INITIALIZED_DATA', 'IMAGE_SCN_MEM_READ' and'IMAGE_SCN_MEM_WRITE' set. Some linkers may place constant data into a section oftheir own that doesn't have the writeable-bit. If part of the data is shareable, or there are otherpeculiarities, there may be more sections with the apropriate section-bits set.

The section, or sections, will be in the range 'BaseOfData' up to 'BaseOfData'+'SizeOfInitial-izedData'. Typical section names are '.data', '.idata', 'DATA' and so on.

bss section

Then there is the uninitialized data (for static variables like "static int k;"); this section is quitelike the initialized data, but will have a file offset ('PointerToRawData') of 0 indicating its con-tents is not stored in the file, and 'IMAGE_SCN_CNT_UNINITIALIZED_DATA' is set insteadof 'IMAGE_SCN_CNT_INITIALIZED_DATA' to indicate that the contents should be set to 0-bytes at load-time. This means, there is a section header but no section in the file; the sectionwill be created by the loader and consist entirely of 0-bytes. The length will be 'SizeOfUnini-tializedData'. Typical names are '.bss', 'BSS' and the like.

These were the section data that are *not* pointed to by data directories. Their contents andstructure is supplied by the compiler, not by the linker. (The stack-segment and heap-seg-ment are not sections in the binary but created by the loader from the stacksize- and heap-size-entries in the optional header.)



10

copyright

To begin with a simple directory-section, let's look at the data directory'IMAGE_DIRECTORY_ENTRY_COPYRIGHT'. The contents is a copyright- or descrip-tion string in ASCII (not 0-terminated), like "Gonkulator control application, copyright (c)1848 Hugendubel & Cie". This string is, normally, supplied to the linker with the commandline or a description file. This string is not needed at runtime and may be discarded. It isnot writeable; in fact, the application doesn't need access at all. So the linker will find outif there is a discardable non-writeable section already and if not, create one (named'.descr' or the like). It will then stuff the string into the section and let the copyright-direc-tory-pointer point to the string. The 'IMAGE_SCN_CNT_INITIALIZED_DATA' bit shouldbe set.

exported symbols

(Note that the description of the export directory was faulty in versions of this text before1999-03-12. It didn't describe forwarders, exports by ordinal only, or exports with severalnames.)

The next-simplest thing is the export directory,'IMAGE_DIRECTORY_ENTRY_EXPORT'. This is a directory typically found in DLLs; itcontains the entry points of exported functions (and the addresses of exported objectsetc.). Executables may of course also have exported symbols but usually they don't. Thecontaining section should be "initialized data" and "readable". It should not be "discard-able" because the process might call "GetProcAddress()" to find a function's entry point atruntime. The section is normally called '.edata' if it is a separate thing; often enough, it ismerged into some other section like "initialized data".

The structure of the export table ('IMAGE_EXPORT_DIRECTORY') comprises a headerand the export data, that is: the symbol names, their ordinals and the offsets to their entrypoints.



First, we have 32 bits of 'Characteristics' that are unused and normally 0. Then there is a 32-bit-'TimeDateStamp', which presumably should give the time the table was created in thetime_t-format; alas, it is not always valid (some linkers set it to 0). Then we have 2 16-bit-words of version-info ('MajorVersion' and 'MinorVersion'), and these, too, are often enoughset to 0.

The next thing is 32 bits of 'Name'; this is an RVA to the DLL name as a 0-terminated ASCIIstring. (The name is necessary in case the DLL file is renamed - see "binding" at the importdirectory.) Then, we have got a 32-bit-'Base'. We'll come to that in a moment.

The next 32-bit-value is the total number of exported items ('NumberOfFunctions'). In additionto their ordinal number, items may be exported by one or several names. and the next 32-bit-number is the total number of exported names ('NumberOfNames'). In most cases, eachexported item will have exactly one corresponding name and it will be used by that name, butan item may have several associated names (it is then accessible by each of them), or it mayhave no name, in which case it is only accessible by its ordinal number. The use of unnamedexports (purely by ordinal) is discouraged, because all versions of the exporting DLL wouldhave to use the same ordinal numbering, which is a maintainance problem.

The next 32-bit-value 'AddressOfFunctions' is a RVA to the list of exported items. It points toan array of 'NumberOfFunctions' 32-bit-values, each being a RVA to the exported function orvariable.

There are 2 quirks about this list: First, such an exported RVA may be 0, in which case it isunused. Second, if the RVA points into the section containing the export directory, this is aforwarded export. A forwarded export is a pointer to an export in another binary; if it is used,the pointed-to export in the other binary is used instead. The RVA in this case points, as men-tioned, into the export directory's section, to a zero-terminated string comprising the name ofthe pointed-to DLL and the export name separated by a dot, like "otherdll.exportname", or theDLL's name and the export ordinal, like "otherdll.#19".

Now is the time to explain the export ordinal. An export's ordinal is the index into theAddressOfFunctions-Array (the 0-based position in this array) plus the 'Base' mentionedabove.

In most cases, the 'Base' is 1, which means the first export has an ordinal of 1, the secondhas an ordinal of 2 and so on.



10

After the 'AddressOfFunctions'-RVA we find a RVA to the array of 32-bit-RVAs to symbolnames 'AddressOfNames', and a RVA to the array of 16-bit-ordinals 'AddressOfNameOr-dinals'. Both arrays have 'NumberOfNames' elements. The symbol names may be miss-ing entirely, in which case the 'AddressOfNames' is 0. Otherwise, the pointed-to arraysare running parallel, which means their elements at each index belong together. The'AddressOfNames'-array consists of RVAs to 0-terminated export names; the names areheld in a sorted list (i.e. the first array member is the RVA to the alphabetically smallestname; this allows efficient searching when looking up an exported symbol by name).According to the PE specification, the 'AddressOfNameOrdinals'-array has the ordinalcorresponding to each name; however, I've found this array to contain the actual indexinto the 'AddressOfFunctions-Array instead.

I'll draw a picture about the three tables: AddressOfFunctions

|

|

|

v

exported RVA with ordinal 'Base'

exported RVA with ordinal 'Base'+1

...

exported RVA with ordinal 'Base'+'NumberOfFunctions'-1

AddressOfNames AddressOfNameOrdinals

| |

| |

| |

v v

RVA to first name <-> Index of export for first name

RVA to second name <-> Index of export for second name

... ...

RVA to name 'NumberOfNames' <-> Index of export for name 'NumberOfNames'



Some examples are in order.

To find an exported symbol by ordinal, subtract the 'Base' to get the index, follow the'AddressOfFunctions'-RVA to find the exports-array and use the index to find the exportedRVA in the array. If it does not point into the export section, you are done. Otherwise, it pointsto a string describing the exporting DLL and the name or ordinal therein, and you have to lookup the forwarded export there.

To find an exported symbol by name, follow the 'AddressOfNames'-RVA (if it is 0 there are nonames) to find the array of RVAs to the export names. Search your name in the list. Use thename's index in the 'AddressOfNameOrdinals'-Array and get the 16-bit-number correspond-ing to the found name. According to the PE spec, it is an ordinal and you need to subtract the'Base' to get the export index; according to my experiences it is the export index and youdon't subtract. Using the export index, you find the export RVA in the 'AddressOfFunctions'-Array, being either the exported RVA itself or a RVA to a string describing a forwarded export.

imported symbols

When the compiler finds a call to a function that is in a different executable (mostly in a DLL),it will, in the most simplistic case, not know anything about the circumstances and simply out-put a normal call-instruction to that symbol, the address of which the linker will have to fix, likeit does for any external symbol. The linker uses an import library to look up from which DLLwhich symnol is imported, and produces stubs for all the imported symbols, each of whichconsists of a jump-instruction; the stubs are the actual call-targets. These jump-instructionswill actually jump to an address that's fetched from the so-called import address table. Inmore sophisticated applications (when "__declspec(dllimport)" is used), the compiler knowsthe function is imported, and outputs a call to the address that's in the import address table,bypassing the jump.

Anyway, the address of the function in the DLL is always necessary and will be supplied bythe loader from the exporting DLL's export directory when the application is loaded. Theloader knows which symbols in what libraries have to be looked up and their addresses fixedby searching the import directory.



10

I will better give you an example. The calls with or without __declspec(dllimport) look likethis: source:

int symbol(char *);

__declspec(dllimport) int symbol2(char*);

void foo(void)

{

int i=symbol("bar");

int j=symbol2("baz");

}

assembly:

...

call _symbol ; without declspec(dllimport)

...

call [__imp__symbol2] ; with declspec(dllimport)

...

In the first case (without __declspec(dllimport)), the compiler didn't know that '_symbol'was in a DLL, so the linker has to provide the function '_symbol'. Since the function isn'tthere, it will supply a stub function for the imported symbol, being an indirect jump. Thecollection of all import-stubs is called the "transfer area" (also sometimes called a "tram-poline", because you jump there in order to jump to somewhere else). Typically this trans-fer area is located in the code section (it is not part of the import directory). Each of thefunction stubs is a jump to the actual function in the target DLLs. The transfer area lookslike this: _symbol: jmp [__imp__symbol]

_other_symbol: jmp [__imp__other__symbol]

...



This means: if you use imported symbols without specifying "__declspec(dllimport)" then thelinker will generate a transfer area for them, consisting of indirect jumps. If you do specify"__declspec(dllimport)", the compiler will do the indirection itself and a transfer area is notnecessary. (It also means: if you import variables or other stuff you must specify"__declspec(dllimport)", because a stub with a jmp instruction is appropriate for functionsonly.)

In any case the adress of symbol 'x' is stored at a location '__imp_x'. All these locationstogether comprise the so-called "import address table", which is provided to the linker by theimport libraries of the various DLLs that are used. The import address table is a list ofaddresses like this: __imp__symbol: 0xdeadbeef

__imp__symbol2: 0x40100

__imp__symbol3: 0x300100

...

This import address table is a part of the import directory, and it is pointed to by theIMAGE_DIRECTORY_ENTRY_IAT directory pointer (although some linkers don't set thisdirectory entry and it works nevertheless; apparently, the loader can resolve imports withoutusing the directory IMAGE_DIRECTORY_ENTRY_IAT). The addresses in this table areunknown to the linker; the linker inserts dummies (RVAs to the function names; see below formore information) that are patched by the loader at load time using the export directory of theexporting DLL. The import address table, and how it is found by the loader, will be describedin more detail later in this chapter.

Note that this description is C-specific; there are other application building environments thatdon't use import libraries. They all need to generate an import address table, though, whichthey use to let their programs access the imported objects and functions. C compilers tend touse import libraries because it is convenient for them - their linkers use libraries anyway.Other environments use e.g. a description file that lists the necessary DLL names and func-tion names (like the "module definition file"), or a declaration-style list in the source.

This is how imports are used by the program's code; now we'll look how an import directory ismade up so the loader can use it.



10

The import directory should reside in a section that's "initialized data" and "readable". Theimport directory is an array of IMAGE_IMPORT_DESCRIPTORs, one for each used DLL.The list is terminated by a IMAGE_IMPORT_DESCRIPTOR that's entirely filled with 0-bytes.

An IMAGE_IMPORT_DESCRIPTOR is a struct with these members: OriginalFirstThunk

An RVA (32 bit) pointing to a 0-terminated array of RVAs to

IMAGE_THUNK_DATAs, each describing one imported function. The

array will never change.

TimeDateStamp

A 32-bit-timestamp that has several purposes. Let's pretend that

the timestamp is 0, and handle the advanced cases later.

ForwarderChain

The 32-bit-index of the first forwarder in the list of imported

functions. Forwarders are also advanced stuff; set to all-bits-1

for beginners.

Name

A 32-bit-RVA to the name (a 0-terminated ASCII string) of the

DLL.

FirstThunk

An RVA (32 bit) to a 0-terminated array of RVAs to

IMAGE_THUNK_DATAs, each describing one imported function. The

array is part of the import address table and will change.

So each IMAGE_IMPORT_DESCRIPTOR in the array gives you the name of the export-ing DLL and, apart from the forwarder and timestamp, it gives you 2 RVAs to arrays ofIMAGE_THUNK_DATAs, using 32 bits. (The last member of each array is entirely filledwith 0-bytes to mark the end.)



Each IMAGE_THUNK_DATA is, for now, an RVA to a IMAGE_IMPORT_BY_NAME whichdescribes the imported function. The interesting point is now, the arrays run parallel, i.e.: theypoint to the same IMAGE_IMPORT_BY_NAMEs.

No need to be desparate, I will draw another picture. This is the essential contents of oneIMAGE_IMPORT_DESCRIPTOR: OriginalFirstThunk FirstThunk

| |

| |

| |

V V

0--> func1 <--0

1--> func2 <--1

2--> func3 <--2

3--> foo <--3

4--> mumpitz <--4

5--> knuff <--5

6-->0 0<--6 /* the last RVA is 0! */

where the names in the center are the yet to discuss IMAGE_IMPORT_BY_NAMEs. Each ofthem is a 16-bit-number (a hint) followed by an unspecified amount of bytes, being the 0-ter-minated ASCII name of the imported symbol.

The hint is an index into the exporting DLL's name table (see export directory above). Thename at that index is tried, and if it doesn't match then a binary search is done to find thename. (Some linkers don't bother to look up correct hints and simply specify 1 all the time, orsome other arbitrary number. This doesn't harm, it just makes the first attempt to resolve thename always fail, enforcing a binary search for each name.)

To summarize, if you want to look up information about the imported function "foo" from DLL"knurr", you first find the entry IMAGE_DIRECTORY_ENTRY_IMPORT in the data directo-ries, get an RVA, find that address in the raw section data and now have an array ofIMAGE_IMPORT_DESCRIPTORs. Get the member of this array that relates to the DLL"knurr" by inspecting the strings pointed to by the 'Name's.



11

When you have found the right IMAGE_IMPORT_DESCRIPTOR, follow its 'OriginalFirst-Thunk' and get hold of the pointed-to array of IMAGE_THUNK_DATAs; inspect the RVAsand find the function "foo".

Ok, now, why do we have *two* lists of pointers to the IMAGE_IMPORT_BY_NAMEs?Because at runtime the application doesn't need the imported functions' names but theaddresses. This is where the import address table comes in again. The loader will look upeach imported symbol in the export-directory of the DLL in question and replace theIMAGE_THUNK_DATA-element in the 'FirstThunk'-list (which until now also points to theIMAGE_IMPORT_BY_NAME) with the linear address of the DLL's entry point. Remem-ber the list of addresses with labels like "__imp__symbol"; the import address table,pointed to by the data directory IMAGE_DIRECTORY_ENTRY_IAT, is exactly the listpointed to by 'FirstThunk'. (In case of imports from several DLLs, the import address tablecomprises the 'FirstThunk'-Arrays of all the DLLs. The directory entryIMAGE_DIRECTORY_ENTRY_IAT may be missing, the imports will still work fine.) The'OriginalFirstThunk'-array remains untouched, so you can always look up the original listof imported names via the 'OriginalFirstThunk'-list.

The import is now patched with the correct linear addresses and looks like this: OriginalFirstThunk FirstThunk

| |

| |

| |

V V

0--> func1 0--> exported func1



3--> foo 3--> exported foo

4--> mumpitz 4--> exported mumpitz

5--> knuff 5--> exported knuff

6-->0 0<--6



This was the basic structure, for simple cases. Now we'll learn about tweaks in the importdirectories.

First, the bit IMAGE_ORDINAL_FLAG (that is: the MSB) of the IMAGE_THUNK_DATA in thearrays can be set, in which case there is no symbol-name-information in the list and the sym-bol is imported purely by ordinal. You get the ordinal by inspecting the lower word of theIMAGE_THUNK_DATA. The import by ordinals is discouraged; it is much safer to import byname, because the export ordinals might change if the exporting DLL is not in the expectedversion.

Second, there are the so-called "bound imports".

Think about the loader's task: when a binary that it wants to execute needs a function from aDLL, the loader loads the DLL, finds its export directory, looks up the function's RVA and cal-culates the function's entry point. Then it patches the so-found address into the 'FirstThunk'-list. Given that the programmer was clever and supplied unique preferred load addresses forthe DLLs that don't clash, we can assume that the functions' entry points will always be thesame. They can be computed and patched into the 'FirstThunk'-list at link-time, and that'swhat happens with the "bound imports". (The utility "bind" does this; it is part of the Win32SDK.)

Of course, one must be cautious: The user's DLL may have a different version, or it may benecessary to relocate the DLL, thus invalidating the pre-patched 'FirstThunk'-list; in this case,the loader will still be able to walk the 'OriginalFirstThunk'-list, find the imported symbols andre-patch the 'FirstThunk'-list. The loader knows that this is necessary if a) the versions of theexporting DLL don't match or b) the exporting DLL had to be relocated.

To decide whether there were relocations is no problem for the loader, but how to find out ifthe versions differ? This is where the 'TimeDateStamp' of theIMAGE_IMPORT_DESCRIPTOR comes in. If it is 0, the import-list has not been bound, andthe loader must fix the entry points always. Otherwise, the imports are bound, and 'TimeDat-eStamp' must match the 'TimeDateStamp' of the exporting DLL's 'FileHeader'; if it doesn'tmatch, the loader assumes that the binary is bound to a "wrong" DLL and will re-patch theimport list.

There is an additional quirk about "forwarders" in the import-list. A DLL can export a symbolthat's not defined in the DLL but imported from another DLL; such a symbol is said to be for-warded (see the export directory description above).



11

Now, obviously you can't tell if the symbol's entry point is valid by looking into the times-tamp of a DLL that doesn't actually contain the entry point. So the forwarded symbols'entry points must always be fixed up, for safety reasons. In the import list of a binary,imports of forwarded symbols need to be found so the loader can patch them.

This is done via the 'ForwarderChain'. It is an index into the thunk- lists; the import at theindexed position is a forwarded export, and the contents of the 'FirstThunk'-list at thisposition is the index of the *next* forwarded import, and so on, until the index is "-1" whichindicates there are no more forwards. If there are no forwarders at all, 'ForwarderChain' is-1 itself.

This was the so-called "old-style" binding.

At this point, we should sum up what we have had so far :-)

Ok, I will assume you have found the IMAGE_DIRECTORY_ENTRY_IMPORT and youhave followed it to find the import-directory, which will be in one of the sections. Nowyou're at the beginning of an array of IMAGE_IMPORT_DESCRIPTORs the last of whichwill be entirely 0-bytes-filled.

To decipher one of the IMAGE_IMPORT_DESCRIPTORs, you first look into the 'Name'-field, follow the RVA and thusly find the name of the exporting DLL. Next you decidewhether the imports are bound or not; 'TimeDateStamp' will be non-zero if the imports arebound. If they are bound, now is a good time to check if the DLL version matches yoursby comparing the 'TimeDateStamp's. Now you follow the 'OriginalFirstThunk'-RVA to goto the IMAGE_THUNK_DATA-array; walk down this array (it is be 0-terminated), andeach member will be the RVA of a IMAGE_IMPORT_BY_NAME (unless the hi-bit is set inwhich case you don't have a name but are left with a mere ordinal). Follow the RVA, andskip 2 bytes (the hint), and now you have got a 0-terminated ASCII-string that's the nameof the imported function.

To find the supplied entry point addresses in case it is a bound import, follow the 'First-Thunk' and walk it parallel to the 'OriginalFirstThunk'-array; the array-members are thelinear addresses of the entry points (leaving aside the forwarders-topic for a moment).



There is one thing I didn't mention until now: Apparently there are linkers that exhibit a bugwhen they build the import directory (I've found this bug being in use by a Borland C linker).These linkers set the 'OriginalFirstThunk' in the IMAGE_IMPORT_DESCRIPTOR to 0 andcreate only the 'FirstThunk'-array. Obviously, such import directories cannot be bound (elsethe necessary information to re-fix the imports were lost - you couldn't find the functionnames). In this case, you will have to follow the 'FirstThunk'-array to get the imported symbolnames, and you will never have pre-patched entry point addresses. I have found a TIS docu-ment ([6]) describing the import directory in a way that is compatible to this bug, so that papermay be the origin of the bug.The TIS document specifies:

IMPORT FLAGS

TIME/DATE STAMP

MAJOR VERSION - MINOR VERSION

NAME RVA

IMPORT LOOKUP TABLE RVA

IMPORT ADDRESS TABLE RVA

as opposed to the structure used elsewhere:

OriginalFirstThunk

TimeDateStamp

ForwarderChain

Name

FirstThunk

The last tweak about the import directories is the so-called "new style" binding (it is describedin [3]), which can also be done with the "bind"-utility. When this is used, the 'TimeDateStamp'is set to all-bits-1 and there is no forwarderchain; all imported symbols get their addresspatched, whether they are forwarded or not. Still, you need to know the DLLs' version, andyou need to distinguish forwarded symbols from ordinary ones. For this purpose, theIMAGE_DIRECTORY_ENTRY_BOUND_IMPORT directory is created. This will, as far as Icould find out, *not* be in a section but in the header, after the section headers and before thefirst section. (Hey, I didn't invent this, I'm only describing it!) This directory tells you, for eachused DLL, from which other DLLs there are forwarded exports.



11

The structure is an IMAGE_BOUND_IMPORT_DESCRIPTOR, comprising (in this order):A 32-bit number, giving you the 'TimeDateStamp' of the DLL;

a 16-bit-number 'OffsetModuleName', being the offset from the beginning

of the directory to the 0-terminated name of the DLL;

a 16-bit-number 'NumberOfModuleForwarderRefs' giving you the number of

DLLs that this DLL uses for its forwarders.

Immediatly following this struct you find 'NumberOfModuleForwarderRefs' structs that tellyou the names and versions of the DLLs that this DLL forwards from. These structs are'IMAGE_BOUND_FORWARDER_REF's: A 32-bit-number 'TimeDateStamp'; a 16-bit-number 'OffsetModuleName', being the offset from the beginning of the directory to the 0-terminated name of the forwarded-from DLL; 16 unused bits.

Following the 'IMAGE_BOUND_FORWARDER_REF's is the next'IMAGE_BOUND_IMPORT_DESCRIPTOR' and so on; the list is terminated by an all-0-bits-IMAGE_BOUND_IMPORT_DESCRIPTOR.

Sorry for the inconvenience, but that's what it looks like :-)

Now, if you have a new-bound import directory, you load all the DLLs, use the directorypointer IMAGE_DIRECTORY_ENTRY_BOUND_IMPORT to find theIMAGE_BOUND_IMPORT_DESCRIPTOR, scan through it and check if the 'TimeDateS-tamp's of the loaded DLLs match the ones given in this directory. If not, fix them in the'FirstThunk'-array of the import directory.



resources

The resources, such as dialog boxes, menus, icons and so on, are stored in the data direc-tory pointed to by IMAGE_DIRECTORY_ENTRY_RESOURCE. It is in a section that has, atleast, the bits 'IMAGE_SCN_CNT_INITIALIZED_DATA' and 'IMAGE_SCN_MEM_READ' set.

A resource base is a 'IMAGE_RESOURCE_DIRECTORY'; it contains several'IMAGE_RESOURCE_DIRECTORY_ENTRY's each of which in turn may point to a'IMAGE_RESOURCE_DIRECTORY'. This way, you get a tree of'IMAGE_RESOURCE_DIRECTORY's with 'IMAGE_RESOURCE_DIRECTORY_ENTRY'sas leafs; these leafs point to the actual resource data.

In real life, the situation is somewhat relaxed. Normally you won't find convoluted trees youcan't possibly sort out. The hierarchy is, normally, like this: one directory is the root. It pointsto directories, one for each resource type. These directories point to subdirectories, each ofwhich will have a name or an ID and point to a directory of the languages provided for thisresource; for each language you will find one resource entry, which will finally point to thedata. (Note that multi-language-resources don't work on Win95, which always uses the sameresource if it is available in several languages - I didn't check which one, but I guess it's thefirst it encounters. They do work on NT.)

The tree, without the pointer to the data, may look like this: (root)

|

+----------------+------------------+

| | |

menu dialog icon

| | |

+-----+-----+ +-+----+ +-+----+----+

| | | | | | |

"main" "popup" 0x10 "maindlg" 0x100 0x110 0x120

| | | | | | |

+---+-+ | | | | | |

| | default english default def. def. def.

german english



11

A IMAGE_RESOURCE_DIRECTORY comprises:

32 bits of unused flags called 'Characteristics';

32 bits 'TimeDateStamp' (again in the common time_t representation),

giving you the time the resource was created (if the entry is set);

16 bits 'MajorVersion' and 16 bits 'MinorVersion', thusly allowing you

to maintain several versions of the resource;

16 bits 'NumberOfNamedEntries' and another 16 bits 'NumberOfIdEntries'.

Immediatly following such a structure are

'NumberOfNamedEntries'+'NumberOfIdEntries' structs which are of the

format 'IMAGE_RESOURCE_DIRECTORY_ENTRY', those with the names coming first.

They may point to further 'IMAGE_RESOURCE_DIRECTORY's or they point to

the actual resource data.

A IMAGE_RESOURCE_DIRECTORY_ENTRY consists of:

32 bits giving you the id of the resource or the directory it describes;

32 bits offset to the data or offset to the next sub-directory.

The meaning of the id depends on the level in the tree; the id may be a number (if the hi-bit is clear) or a name (if the hi-bit is set). If it is a name, the lower 31 bits are the offsetfrom the beginning of the resource section's raw data to the name (the name consists of16 bits length and trailing wide characters, in unicode, not 0-terminated).



If you are in the root-directory, the id, if it is a number, is the resource-type: 1: cursor

2: bitmap

3: icon

4: menu

5: dialog

6: string table

7: font directory

8: font

9: accelerators

10: unformatted resource data

11: message table

12: group cursor

14: group icon

16: version information

Any other number is user-defined. Any resource-type with a type-name is always user-defined.

If you are one level deeper, the id is the resource-id (or resource-name).

If you are another level deeper, the id must be a number, and it is the language-id of the spe-cific instance of the resource; for example, you can have the same dialog in australianenglish, canadian french and swiss german localized forms, and they all share the sameresource-id. The system will choose the dialog to load based on the thread's locale, which inturn will usually reflect the user's "regional setting". (If the resource cannot be found for thethread locale, the system will first try to find a resource for the locale using a neutral sublan-guage, e.g. it will look for standard french instead of the user's canadian french; if it still can'tbe found, the instance with the smallest language id will be used. As noted, all this works onlyon NT.) To decipher the language id, split it into the primary language id and the sublanguageid using the macros PRIMARYLANGID() and SUBLANGID(), giving you the bits 0 to 9 or 10to 15, respectivly. The values are defined in the file "winresrc.h". Language-resources areonly supported for accelerators, dialogs, menus, rcdata or stringtables; other resource-typesshould be LANG_NEUTRAL/SUBLANG_NEUTRAL.



11

To find out whether the next level below a resource directory is another directory, youinspect the hi-bit of the offset. If it is set, the remaining 31 bits are the offset from thebeginning of the resource section's raw data to the next directory, again in the formatIMAGE_RESOURCE_DIRECTORY with trailing IMAGE_RESOURCE_DIRECTORY_ENTRYs.

If the bit is clear, the offset is the distance from the beginning of the resource section's rawdata to the resource's raw data description, a IMAGE_RESOURCE_DATA_ENTRY. Itconsists of 32 bits 'OffsetToData' (the offset to the raw data, counting from the beginningof the resource section's raw data), 32 bits of 'Size' of the data, 32 bits 'CodePage' and 32unused bits. (The use of codepages is discouraged, you should use the 'language'-fea-ture to support multiple locales.)

The raw data format depends on the resource type; descriptions can be found in the MSSDK documentation. Note that any string in resources is always in UNICODE except foruser defined resources, which are in the format the developer chooses, obviously.



relocations

The last data directory I will describe is the base relocation directory. It is pointed to by theIMAGE_DIRECTORY_ENTRY_BASERELOC entry in the data directories of the optionalheader. It is typically contained in a section if its own, with a name like ".reloc" and the bitsIMAGE_SCN_CNT_INITIALIZED_DATA, IMAGE_SCN_MEM_DISCARDABLE and IMAGE_SCN_MEM_READ set.

The relocation data is needed by the loader if the image cannot be loaded to the preferredload address 'ImageBase' mentioned in the optional header. In this case, the fixed addressessupplied by the linker are no longer valid, and the loader has to apply fixups for absoluteaddresses used for locations of static variables, string literals and so on.

The relocation directory is a sequence of chunks. Each chunk contains the relocation infor-mation for 4 KB of the image. A chunk starts with a 'IMAGE_BASE_RELOCATION' struct. Itconsists of 32 bits 'VirtualAddress' and 32 bits 'SizeOfBlock'. It is followed by the chunk'sactual relocation data, being 16 bits each.

The 'VirtualAddress' is the base RVA that the relocations of this chunk need to be applied to;the 'SizeOfBlock' is the size of the entire chunk in bytes.

The number of trailing relocations is ('SizeOfBlock'-sizeof(IMAGE_BASE_RELOCATION))/2The relocation information ends when you encounter a IMAGE_BASE_RELOCATION structwith a 'VirtualAddress' of 0.

Each 16-bit-relocation information consists of the relocation position in the lower 12 bits and arelocation type in the high 4 bits. To get the relocation RVA, you need to add theIMAGE_BASE_RELOCATION's 'VirtualAddress' to the 12-bit-position. The type is one of: IMAGE_REL_BASED_ABSOLUTE (0)

This is a no-op; it is used to align the chunk to a 32-bits-

border. The position should be 0.

IMAGE_REL_BASED_HIGH (1)

The high 16 bits of the relocation must be applied to the 16

bits of the WORD pointed to by the offset, which is the high

word of a 32-bit-DWORD.

IMAGE_REL_BASED_LOW (2)

The low 16 bits of the relocation must be applied to the 16



12

bits of the WORD pointed to by the offset, which is the low

word of a 32-bit-DWORD.

IMAGE_REL_BASED_HIGHLOW (3)

The entire 32-bit-relocation must be applied to the entire 32

bits in question. This (and the no-op '0') is the only

relocation type I've actually found in binaries.

IMAGE_REL_BASED_HIGHADJ (4)

This is one for the tough. Read yourself (from [6]) and make

sense out of it if you can:

"Highadjust. This fixup requires a full 32-bit value. The high

16-bits is located at Offset, and the low 16-bits is located in

the next Offset array element (this array element is included in

the Size field). The two need to be combined into a signed

variable. Add the 32-bit delta. Then add 0x8000 and store the

high 16-bits of the signed variable to the 16-bit field at

Offset."

IMAGE_REL_BASED_MIPS_JMPADDR (5)

Unknown

IMAGE_REL_BASED_SECTION (6)

Unknown

IMAGE_REL_BASED_REL32 (7)

Unknown



As an example, if you find the relocation information to be 0x00004000 (32 bits, starting RVA)

0x00000010 (32 bits, size of chunk)

0x3012 (16 bits reloc data)


0x30f6 (16 bits reloc data)


0x00000000 (next chunk's RVA)

0xff341234

you know the first chunk describes relocations starting at RVA 0x4000 and is 16 bytes long.Because the header uses 8 bytes and one relocation uses 2 bytes, there are (16-8)/2=4 relo-cations in the chunk. The first relocation is to be applied to the DWORD at 0x4012, the nextto the DWORD at 0x4080, and the third to the DWORD at 0x40f6. The last relocation is a no-op. The next chunk has a RVA of 0 and finishes the list.

Now, how do you do a relocation? You know that the image *is* relocated to the preferredload address 'ImageBase' in the optional header; you also know the address you did load theimage to. If they match, you don't need to do anything. If they don't match, you calculate thedifference actual_base-preferred_base and add that value (signed, it may be negative) to therelocation positions, which you will find with the method described above.



12

Acknowledgments

Thanks go to David Binette for his debugging and proof-reading. (The remaining errorsare entirely mine.) Also thanks to wotsit.org for letting me put the file on their site.

Copyright

This text is copyright 1999 by B. Luevelsmeyer. It is freeware, and you may use it for anypurpose but on your own risk. It contains errors and it is incomplete. You have beenwarned.

Bug reports

Send any bug reports (or other comments) to [email protected]



Literature[1]

"Peering Inside the PE: A Tour of the Win32 Portable Executable File

Format" (M. Pietrek), in: Microsoft Systems Journal 3/1994

[2]

"Why to Use _declspec(dllimport) & _declspec(dllexport) In Code", MS

Knowledge Base Q132044

[3]

"Windows Q&A" (M. Pietrek), in: Microsoft Systems Journal 8/1995

[4]

"Writing Multiple-Language Resources", MS Knowledge Base Q89866

[5]

"The Portable Executable File Format from Top to Bottom" (Randy Kath),

in: Microsoft Developer Network

[6]

Tool Interface Standard (TIS) Formats Specification for Windows Version

1.0 (Intel Order Number 241597, Intel Corporation 1993)



12

Appendix: hello world

In this appendix I will show how to make programs by hand. The example will use Intel-assembly, because I don't speak DEC Alpha.

The program will be the equivalent of #include <stdio.h>

int main(void)

{

puts(hello,world);

return 0;

}

First, I translate it to use Win32 functions instead of the C runtime: #define STD_OUTPUT_HANDLE -11UL

#define hello "hello, world\n"

__declspec(dllimport) unsigned long __stdcall

GetStdHandle(unsigned long hdl);

__declspec(dllimport) unsigned long __stdcall

WriteConsoleA(unsigned long hConsoleOutput,

const void *buffer,

unsigned long chrs,

unsigned long *written,

unsigned long unused

);

static unsigned long written;

void startup(void)

{

WriteConsoleA(GetStdHandle(STD_OUTPUT_HANDLE),hello,sizeof(hello)-1,&written,0);

return;

}



Now I will fumble out the assembly: startup:

; parameters for WriteConsole(), backwards

6A 00 push 0x00000000

68 ?? ?? ?? ?? push offset _written

6A 0D push 0x0000000d

68 ?? ?? ?? ?? push offset hello

; parameter for GetStdHandle()

6A F5 push 0xfffffff5

2E FF 15 ?? ?? ?? ?? call dword ptr cs:__imp__GetStdHandle@4

; result is last parameter for WriteConsole()

50 push eax

2E FF 15 ?? ?? ?? ?? call dword ptr cs:__imp__WriteConsoleA@20

C3 ret

hello:

68 65 6C 6C 6F 2C 20 77 6F 72 6C 64 0A "hello, world\n"

_written:

00 00 00 00

That was the compiler part. Anyone can do that. From now on we play linker, which is muchmore interesting :-)

I need to find the functions WriteConsoleA() and GetStdHandle(). They happen to be in"kernel32.dll". (That was the 'import library' part.)

Now I can start to make the executable. Question marks will take the place of yet-to-find-outvalues; they will be patched afterwards.



12

First the DOS-stub, starting at 0x0 and being 0x40 bytes long: 00 | 4d 5a 00 00 00 00 00 00 00 00 00 00 00 00 00 00

10 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

20 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

30 | 00 00 00 00 00 00 00 00 00 00 00 00 40 00 00 00

As you can see, this isn't really a MS-DOS program. It's just the header with the signature"MZ" at the beginning and the e_lfanew pointing immediatly after the header, without anycode. That's because it isn't intended to run on MS-DOS; it's just here because the speci-fication requires it.

Then the PE signature, starting at 0x40 and being 0x4 bytes long: 50 45 00 00

Now the file-header, which will start at byte 0x44 and is 0x14 bytes long: Machine 4c 01 ; i386

NumberOfSections 02 00 ; code and data

TimeDateStamp 00 00 00 00 ; who cares?

PointerToSymbolTable 00 00 00 00 ; unused

NumberOfSymbols 00 00 00 00 ; unused

SizeOfOptionalHeader e0 00 ; constant

Characteristics 02 01 ; executable on 32-bit-machine



And the optional header, which will start at byte 0x58 and is 0x60 bytes long: Magic 0b 01 ; constant

MajorLinkerVersion 00 ; I'm version 0.0 :-)

MinorLinkerVersion 00 ;

SizeOfCode 20 00 00 00 ; 32 bytes of code

SizeOfInitializedData ?? ?? ?? ?? ; yet to find out

SizeOfUninitializedData 00 00 00 00 ; we don't have a BSS

AddressOfEntryPoint ?? ?? ?? ?? ; yet to find out

BaseOfCode ?? ?? ?? ?? ; yet to find out

BaseOfData ?? ?? ?? ?? ; yet to find out

ImageBase 00 00 10 00 ; 1 MB, chosen arbitrarily

SectionAlignment 20 00 00 00 ; 32-bytes-alignment

FileAlignment 20 00 00 00 ; 32-bytes-alignment

MajorOperatingSystemVersion 04 00 ; NT 4.0

MinorOperatingSystemVersion 00 00 ;

MajorImageVersion 00 00 ; version 0.0

MinorImageVersion 00 00 ;

MajorSubsystemVersion 04 00 ; Win32 4.0

MinorSubsystemVersion 00 00 ;

Win32VersionValue 00 00 00 00 ; unused?

SizeOfImage ?? ?? ?? ?? ; yet to find out

SizeOfHeaders ?? ?? ?? ?? ; yet to find out

CheckSum 00 00 00 00 ; not used for non-drivers

Subsystem 03 00 ; Win32 console

DllCharacteristics 00 00 ; unused (not a DLL)

SizeOfStackReserve 00 00 10 00 ; 1 MB stack

SizeOfStackCommit 00 10 00 00 ; 4 KB to start with

SizeOfHeapReserve 00 00 10 00 ; 1 MB heap

SizeOfHeapCommit 00 10 00 00 ; 4 KB to start with

LoaderFlags 00 00 00 00 ; unknown

NumberOfRvaAndSizes 10 00 00 00 ; constant



12

As you can see, I plan to have only 2 sections, one for code and one for all the rest (data,constants and import directory). There will be no relocations and no other stuff likeresources. Also I won't have a BSS segment and stuff the variable 'written' into the initial-ized data. The section alignment is the same in the file and in RAM (32 bytes); this helpsto keep the task easy, otherwise I'd have to calculate RVAs back and forth too much.

Now we set up the data directories, beginning at byte 0xb8 and being 0x80 bytes long: Address Size

00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_EXPORT (0)

?? ?? ?? ?? ?? ?? ?? ?? ; IMAGE_DIRECTORY_ENTRY_IMPORT (1)

00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_RESOURCE (2)

00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_EXCEPTION (3)

00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_SECURITY (4)

00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_BASERELOC (5)

00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_DEBUG (6)

00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_COPYRIGHT (7)

00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_GLOBALPTR (8)

00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_TLS (9)

00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_LOAD_CONFIG (10)

00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_BOUND_IMPORT (11)

00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_IAT (12)

00 00 00 00 00 00 00 00 ; 13

00 00 00 00 00 00 00 00 ; 14

00 00 00 00 00 00 00 00 ; 15

Only the import directory is in use.



Next are the section headers. First we make the code section, which will contain the abovementioned assembly. It is 32 bytes long, and so will be the code section. The header beginsat 0x138 and is 0x28 bytes long: Name 2e 63 6f 64 65 00 00 00 ; ".code"

VirtualSize 00 00 00 00 ; unused

VirtualAddress ?? ?? ?? ?? ; yet to find out

SizeOfRawData 20 00 00 00 ; size of code

PointerToRawData ?? ?? ?? ?? ; yet to find out

PointerToRelocations 00 00 00 00 ; unused

PointerToLinenumbers 00 00 00 00 ; unused

NumberOfRelocations 00 00 ; unused

NumberOfLinenumbers 00 00 ; unused

Characteristics 20 00 00 60 ; code, executable, readable

The second section will contain the data. The header begins at 0x160 and is 0x28 bytes long: Name 2e 64 61 74 61 00 00 00 ; ".data"


VirtualAddress ?? ?? ?? ?? ; yet to find out

SizeOfRawData ?? ?? ?? ?? ; yet to find out

PointerToRawData ?? ?? ?? ?? ; yet to find out





Characteristics 40 00 00 c0 ; initialized, readable, writeable



13

The next byte is 0x188, but the sections need to be aligned to 32 bytes (because I choseso), so we need padding bytes up to 0x1a0: 00 00 00 00 00 00 ; padding

00 00 00 00 00 00

00 00 00 00 00 00

00 00 00 00 00 00

Now the first section, being the code section with the above mentioned assembly, *does*come. It begins at byte 0x1a0 and is 0x20 bytes long: 6A 00 ; push 0x00000000

68 ?? ?? ?? ?? ; push offset _written

6A 0D ; push 0x0000000d

68 ?? ?? ?? ?? ; push offset hello_string

6A F5 ; push 0xfffffff5

2E FF 15 ?? ?? ?? ?? ; call dword ptr cs:__imp__GetStdHandle@4

50 ; push eax

2E FF 15 ?? ?? ?? ?? ; call dword ptr cs:__imp__WriteConsoleA@20

C3 ; ret

Because of the previous section's length we don't need any padding before the next sec-tion (data), and here it comes, beginning at 0x1c0: 68 65 6C 6C 6F 2C 20 77 6F 72 6C 64 0A ; "hello, world\n"

00 00 00 ; padding to align _written

00 00 00 00 ; _written



Now all that's left is the import directory. It will import 2 functions from "kernel32.dll", and it'simmediatly following the variables in the same section. First we will align it to 32 bytes: 00 00 00 00 00 00 00 00 00 00 00 00 ; padding

It begins at 0x1e0 with the IMAGE_IMPORT_DESCRIPTOR: OriginalFirstThunk ?? ?? ?? ?? ; yet to find out

TimeDateStamp 00 00 00 00 ; unbound

ForwarderChain ff ff ff ff ; no forwarders

Name ?? ?? ?? ?? ; yet to find out

FirstThunk ?? ?? ?? ?? ; yet to find out

We need to terminate the import-directory with a 0-bytes-entry (we are at 0x1f4): OriginalFirstThunk 00 00 00 00 ; terminator

TimeDateStamp 00 00 00 00 ;

ForwarderChain 00 00 00 00 ;

Name 00 00 00 00 ;

FirstThunk 00 00 00 00 ;

Now there's the DLL name left, and the 2 thunks, and the thunk-data, and the functionnames. But we will be finished real soon now!

The DLL name, 0-terminated, beginning at 0x208: 6b 65 72 6e 65 6c 33 32 2e 64 6c 6c 00 ; "kernel32.dll"

00 00 00 ; padding to 32-bit-boundary

The original first thunk, starting at 0x218: AddressOfData ?? ?? ?? ?? ; RVA to function name "WriteConsoleA"

AddressOfData ?? ?? ?? ?? ; RVA to function name "GetStdHandle"

00 00 00 00 ; terminator



13

The first thunk is exactly the same list and starts at 0x224:(__imp__WriteConsoleA@20, at 0x224)

AddressOfData ?? ?? ?? ?? ; RVA to function name "WriteConsoleA"

(__imp__GetStdHandle@4, at 0x228)

AddressOfData ?? ?? ?? ?? ; RVA to function name "GetStdHandle"


Now what's left is the two function names in the shape of anIMAGE_IMPORT_BY_NAME. We are at byte 0x230.

01 00 ; ordinal, need not be correct

57 72 69 74 65 43 6f 6e 73 6f 6c 65 41 00 ; "WriteConsoleA"


47 65 74 53 74 64 48 61 6e 64 6c 65 00 ; "GetStdHandle"

Ok, that's about all. The next byte, which we don't really need, is0x24f. We need to fill the section with padding up to 0x260:

00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ; padding

00



We are done. Now that we know all the byte-offsets, we can apply fixups to all thoseaddresses and sizes that were indicated as "unknown" with '??'-marks. I won't force you toread that step-by-step (it's quite straightforward), and simply present the result:DOS-header, starting at 0x0:

00 | 4d 5a 00 00 00 00 00 00 00 00 00 00 00 00 00 00

10 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

20 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

30 | 00 00 00 00 00 00 00 00 00 00 00 00 40 00 00 00

signature, starting at 0x40:

50 45 00 00

file-header, starting at 0x44:

Machine 4c 01 ; i386

NumberOfSections 02 00 ; code and data

TimeDateStamp 00 00 00 00 ; who cares?

PointerToSymbolTable 00 00 00 00 ; unused

NumberOfSymbols 00 00 00 00 ; unused

SizeOfOptionalHeader e0 00 ; constant

Characteristics 02 01 ; executable on 32-bit-machine

optional header, starting at 0x58:

Magic 0b 01 ; constant

MajorLinkerVersion 00 ; I'm version 0.0 :-)

MinorLinkerVersion 00 ;

SizeOfCode 20 00 00 00 ; 32 bytes of code

SizeOfInitializedData a0 00 00 00 ; data section size

SizeOfUninitializedData 00 00 00 00 ; we don't have a BSS

AddressOfEntryPoint a0 01 00 00 ; beginning of code section

BaseOfCode a0 01 00 00 ; RVA to code section

BaseOfData c0 01 00 00 ; RVA to data section

ImageBase 00 00 10 00 ; 1 MB, chosen arbitrarily

SectionAlignment 20 00 00 00 ; 32-bytes-alignment

FileAlignment 20 00 00 00 ; 32-bytes-alignment

MajorOperatingSystemVersion 04 00 ; NT 4.0



13

MinorOperatingSystemVersion 00 00 ;

MajorImageVersion 00 00 ; version 0.0

MinorImageVersion 00 00 ;

MajorSubsystemVersion 04 00 ; Win32 4.0

MinorSubsystemVersion 00 00 ;

Win32VersionValue 00 00 00 00 ; unused?

SizeOfImage c0 00 00 00 ; sum of all section sizes

SizeOfHeaders a0 01 00 00 ; offset to 1st section

CheckSum 00 00 00 00 ; not used for non-drivers

Subsystem 03 00 ; Win32 console

DllCharacteristics 00 00 ; unused (not a DLL)

SizeOfStackReserve 00 00 10 00 ; 1 MB stack

SizeOfStackCommit 00 10 00 00 ; 4 KB to start with

SizeOfHeapReserve 00 00 10 00 ; 1 MB heap

SizeOfHeapCommit 00 10 00 00 ; 4 KB to start with

LoaderFlags 00 00 00 00 ; unknown

NumberOfRvaAndSizes 10 00 00 00 ; constant

data directories, starting at 0xb8:

Address Size

00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_EXPORT (0)

e0 01 00 00 6f 00 00 00 ; IMAGE_DIRECTORY_ENTRY_IMPORT (1)

00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_RESOURCE (2)

00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_EXCEPTION (3)

00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_SECURITY (4)

00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_BASERELOC (5)

00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_DEBUG (6)

00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_COPYRIGHT (7)

00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_GLOBALPTR (8)

00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_TLS (9)

00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_LOAD_CONFIG (10)

00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_BOUND_IMPORT (11)

00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_IAT (12)

00 00 00 00 00 00 00 00 ; 13

00 00 00 00 00 00 00 00 ; 14



00 00 00 00 00 00 00 00 ; 15

section header (code), starting at 0x138:

Name 2e 63 6f 64 65 00 00 00 ; ".code"


VirtualAddress a0 01 00 00 ; RVA to code section

SizeOfRawData 20 00 00 00 ; size of code

PointerToRawData a0 01 00 00 ; file offset to code section





Characteristics 20 00 00 60 ; code, executable, readable

section header (data), starting at 0x160:

Name 2e 64 61 74 61 00 00 00 ; ".data"


VirtualAddress c0 01 00 00 ; RVA to data section

SizeOfRawData a0 00 00 00 ; size of data section

PointerToRawData c0 01 00 00 ; file offset to data section





Characteristics 40 00 00 c0 ; initialized, readable, writeable

(padding)

00 00 00 00 00 00 ; padding

00 00 00 00 00 00

00 00 00 00 00 00

00 00 00 00 00 00

code section, starting at 0x1a0:

6A 00 ; push 0x00000000

68 d0 01 10 00 ; push offset _written



13

6A 0D ; push 0x0000000d

68 c0 01 10 00 ; push offset hello_string

6A F5 ; push 0xfffffff5

2E FF 15 28 02 10 00 ; call dword ptr cs:__imp__GetStdHandle@4

50 ; push eax

2E FF 15 24 02 10 00 ; call dword ptr cs:__imp__WriteConsoleA@20

C3 ; ret

data section, beginning at 0x1c0:

68 65 6C 6C 6F 2C 20 77 6F 72 6C 64 0A ; "hello, world\n"

00 00 00 ; padding to align _written

00 00 00 00 ; _written

padding:

00 00 00 00 00 00 00 00 00 00 00 00 ; padding

IMAGE_IMPORT_DESCRIPTOR, starting at 0x1e0:

OriginalFirstThunk 18 02 00 00 ; RVA to orig. 1st thunk

TimeDateStamp 00 00 00 00 ; unbound

ForwarderChain ff ff ff ff ; no forwarders

Name 08 02 00 00 ; RVA to DLL name

FirstThunk 24 02 00 00 ; RVA to 1st thunk

terminator (0x1f4):

OriginalFirstThunk 00 00 00 00 ; terminator

TimeDateStamp 00 00 00 00 ;

ForwarderChain 00 00 00 00 ;

Name 00 00 00 00 ;

FirstThunk 00 00 00 00 ;

The DLL name, at 0x208:

6b 65 72 6e 65 6c 33 32 2e 64 6c 6c 00 ; "kernel32.dll"

00 00 00 ; padding to 32-bit-boundary

original first thunk, starting at 0x218:

AddressOfData 30 02 00 00 ; RVA to function name "WriteConsoleA"

AddressOfData 40 02 00 00 ; RVA to function name "GetStdHandle"


first thunk, starting at 0x224:

AddressOfData 30 02 00 00 ; RVA to function name "WriteConsoleA"



AddressOfData 40 02 00 00 ; RVA to function name "GetStdHandle"


IMAGE_IMPORT_BY_NAME, at byte 0x230:


57 72 69 74 65 43 6f 6e 73 6f 6c 65 41 00 ; "WriteConsoleA"

IMAGE_IMPORT_BY_NAME, at byte 0x240:


47 65 74 53 74 64 48 61 6e 64 6c 65 00 ; "GetStdHandle"

(padding)

00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ; padding

00

First unused byte: 0x260

Alas, this works on NT but didn't on windows 95. windows95 can't run applications with a sec-tion alignment of 32 bytes, it needs an alignment of 4 KB and, apparently, a file alignment of512 bytes. So for windows95 you'll have to insert a large number of 0-bytes (for padding) andadjust the RVAs. Thanks go to D. Binette for testing on windows95.



13

Lesson 2 -What knowledge do I need to code a disassembler?Well, this is not easy to say. You need as well good coding knowledge as you have toknow the theoretical conncept of important parts.

Here is a list of what you should really know:- Assembly knowledge in the win32 environment

- OOP and how it works

- A good understanding how parsers work

- Knowledge of the PE-file-structure

- How SEH works

- Opcodes and Mnemonics

- Linked lists

- How to use a debugger

- Maybe knowledge of trees and graphs if you want to add a polymorphic engine

- Basic understanding how a disassembler should work


Lesson 3 - What problems do I have to expect during development?

Lesson 3 -What problems do I have to expect during development?The first main problem you will meet is your time… coding a disassembler needs much timeduring development. Many people claim that writing a disassembler is waste of time. Well,lets waste time! The knowledge you will receive from doing so is immense !

Another problem are opcodes and mnemonics which have to be identified by our disassem-bler. You can find the one or other opcode/mnemonic list with your search engine, but a fulllist of all opcodes is approximately 64 MB size. So when you load the list into memory, youalways should respect this size!

Just look at this: you have a hex-value and want to have the corresponding mnemonic. Soyou have to search the opcode list (64MB) in memory. You have to do this for every mne-monic and for all hex-values you receive which are not a correct mnemonic (parsing prob-lem). Can you see now what speed problem you will get? Maybe we should look for a betterway to do this checks.

Next the identification of API Calls will take some more time. A mnemonic can have a corre-sponding API Call, so we should need a kind of list to find them. This is the same as I men-tioned above.

When you want to add at the end a polymorphic engine or a randomized garbage producer,you need to write values back to the file. You have to keep an eye at the filessize and the PE-file. It is simple to see that there have to some "corrections" of the file after doing this manip-ulation!



14

Lesson 4 -Modularity and it´s importance8

This is about the physical structure of the program. Many languages, and C++ is noexception, allow the separate compilation of modules which are then linked together toform the executable code. The basic conventions of C still apply, but may be a little moreformally expressed: Header (.h) files are effectively the interfaces of the different mod-ules, and the .c or .cpp files contain the implementation. As to what should constitute amodule, much depends on physical constraints. For a small, relatively simple project itmay be quite satisfactory to include all classes and objects in one module, whereas in alarge system life will be more difficult.

In many senses modularity almost comes for free. A general rule is to group logicallyrelated classes and objects in the same module, setting up interfaces only to those partsother modules must see. Remember that the goal is twofold: the simplifying of documen-tation under the divide and rule principle, and the elimination of unnecessary compilation.The two ideals are COHESION, that is, groups of logically related abstractions; andLOOSE COUPLING, that is, minimal dependencies between modules.

Once, then, the key abstractions have been identified, it is a relatively simple step todivide physically the implementation into coherent modules. Abstraction combined withencapsulation ensures that a given module will include all relevant functions and data.Encapsulation ensures that the modules are unaware of, and hence unaffected by,changes to the implementation details in other modules. This is a major advantage.Regardless of good intentions, in any large system programmers are often led to makecoding decisions which depend on the internal implementation of another module; per-haps even relying on side-effects.

Modularity may of course be affected by other needs: separate processors in a multi-pro-cessor system; segment size limits; dynamic calling behaviour in virtual memory systems(consequent on the need for late binding); the building of libraries where reusability is themain objective; and work allocation in large project teams. In any event it is important torealise that modularity is purely about the physical design: it is the use of classes andobjects that form the logical basis of the project.

8. Found at http://tutorials.freeskills.com/read/id/286


Lesson 5 - OOP - Magic possibilities or overbloated system?

Lesson 5 -OOP - Magic possibilities or overbloated system?OOP is one of the most discussed topics when you discuss High-Level-Languages. But dowe really need OOP here ? Why is OOP an interesting option when we just code in pureassembly ?

Well, first OOP is a coding-concept. It organizes datastructures in memory leading to anabstract model of our data. Positive with OOP is that data is well organized and has somevery important features like inheritance. Even for win32 assembly it is possible to codeobjects as datastructures. We will see this later.

When we code our simple disassembler (later), we will use no OOP-concept. We code it plainand simple. Later, when we go to the complex disassembler-engine, we may need OOP tomanipulate our data in memory and to reorganize it. As well it will be better and clearer cod-ing (well, if you understood the OOP-concept).



14

Objects9

Objects are the central idea behind OOP. The idea is quite simple.

An object is a bundle of variables and related methods.

A method is similar to a procedure; we'll come back to these later.

The basic idea behind an object is that of simulation. Most programs are written with verylittle reference to the real world objects the program is designed to work with; in object ori-ented methodology, a program should be written to simulate the states and activities ofreal world objects. This means that apart from looking at data structures when modellingan object, we must also look at methods associated with that object, in other words, func-tions that modify the objects attributes.

A few examples should help explain this concept. First, we turn to one of my favouritepastimes...

9. Copied from http://www.quiver.freeserve.co.uk/OOP1.htm



Drink!

Say we want to write a program about a pint of beer. If we were writing this program in Mod-ula-2, we could write something like this: TYPE BeerType = RECORD

BeerName: STRING;

VolumeInPints: REAL;

Colour: ColourType;

Proof: REAL;

PintsNeededToGetYouFull: CARDINAL;

...

END;

Now lets say we want to initialise a pint of beer, and take a sip from it. In Modula-2, we mightcode this as:

VAR MyPint: BeerType;

BEGIN

...

(* Initialise (i.e. buy) a pint: *)

MyPint.BeerName := "Harp";

MyPint.VolumeInPints := 1.00;

...

(* Take a sip *)

MyPint.VolumeInPints := MyPint.VolumeInPints - 0.1;

...

We have constructed this entire model based entirely on data types, that is we defined Beer-Type as a record structure, and gave that structure various names, eg. Name. This is thenorm for procedural programming.



14

This is however, not how we look at things when we want to program using objects. If youremember how we defined an object at the start of this section, you will remember that wemust not only deal with data types, but we must also deal with methods.

A method is an operation which can modify an objects behaviour. In other words, it issomething that will change an object by manipulating it's variables.

This means that when we take a real world object, in this case a pint of beer, when wewant to model it using computational objects, we not only look at the data structure that itconsists of, but also all possible operations that we might want to perform on that data.For our example, we should also define the following methods associated with the Beer-Type object:

InitialiseBeer - this should allow us to give our beer a name, a volume, etc.

GetVolume - to see how much beer we have left!

Take_A_Sip - for lunchtime pints...

Take_A_Gulp - for Lavery's pints...

Sink_Pint - for post exam pints...

There are loads more methods we could define - we might want a function GetBeerNameto help us order another pint for example. Now, some definitions. An object variable is asingle variable from an object's data structure, for example BeerName is one of Beer-Type's object variables. Now the important bit from this section:

Only an object's methods can modify it's variables

There are a few exceptions, but we'll cover them much later. What this means in ourexample is that unlike the Modula code, we cannot directly modify BeerType's variables -we cannot set BeerName to "Tennents" directly. We must use the object's methods to dothis. In practice, what this means is that we must think very carefully when we definemethods. Say in the above example we discover when writing the main program that weneed to be able to take a drink of arbitrary size; we cannot do this with the above defini-tion, we can only take a sip, a gulp etc. We must go back and define a new method asso-



ciated with BeerType, say Take_Drink which will take a parameter representing the amount ofbeer we wish to drink.

Another Example

We'll now deal with a real-life example which will help us understand some more object con-cepts. We will design an object to emulate a counter.

A counter is a variable in a program that is used to hold a value. If you don't know that thenyou shouldn't be reading this! To make things very simple, we'll assume that our counter hasonly three operations associated with it:

- Initialising the counter to a value

- Incrementing the counter by one

- Getting the current value of the counter

So, when we come to implement the above using objects we will define three methods thatdo the above.



14

You may be thinking that we could implement this very simply in Modula-2 using definitionand implementation modules obtaining the same results as if we used an object orientedlanguage. Well, we nearly can:

DEFINITION MODULE Counter;

PROCEDURE InitialiseCounter(InitialValue: INTEGER);

PROCEDURE IncrementCounter;

PROCEDURE GetCounterValue(): INTEGER;

END Counter.

IMPLEMENTATION MODULE Counter;

VAR MyCounter: INTEGER;

PROCEDURE InitialiseCounter(InitialValue: INTEGER);

BEGIN

MyCounter := InitialValue;

END InitialiseCounter;

PROCEDURE IncrementCounter;

BEGIN

INC(MyCounter);

END IncrementCounter;

PROCEDURE GetCounterValue(): INTEGER;

BEGIN

RETURN MyCounter;

END GetCounterValue;

BEGIN

MyCounter := 0;

END Counter.

Because Modula-2 is not object oriented, this will only satisfy one of the requirements foran object oriented language - encapsulation. This has been covered before; it simplymeans that we have implemented information hiding, i.e. we cannot directly accessMyCounter from any module that imports Counter. But being object oriented means a lotmore than just encapsulation, as we'll see next...



Classes10

Say we wanted to extend the counter example discussed previously. Perhaps in our Modula-2 program we need three counters. We could define an array of MyCounter and work throughthat. Or say we needed up to 1000 counters. Then we could also declare an array, but thatwould waste a lot of memory if we only used a few counters. Perhaps if we needed an infiniteamount of counters we could put them in a linked list and allocate memory as required.

The point of all this is that we are now talking in terms of data structures; all of the above dis-cussion has nothing to do with the behaviour of the counter itself. When programming withobjects we can ignore anything not directly concerning the behaviour or state of an object; weinstead turn our attention to classes.

A class is a blueprint for an object.

What this basically means is that we provide a blueprint, or an outline of an object. This blue-print is valid whether we have one or one thousand such objects. A class does not representan object; it represents all the information a typical object should have as well as all the meth-ods it should have. A class can be considered to be an extremely extended TYPE declara-tion, since not only are variables held but methods too.

10.Copied from http://www.quiver.freeserve.co.uk/OOP1.htm



14

C++

As an example, lets give the C++ class definition for our counter object. class Counter {

private:

int MyCounter

public:

Counter() {

MyCounter = 0;

}

void InitialiseCounter(int value) {

MyCounter = value;

}

void IncrementCounter(void) {

MyCounter++;

}

int GetCounterValue(void) {

return (MyCounter);

}

}



So, a lot to go through for this little example. You really need to understand the fundamentalsof C before the example will make any sense.

- In the private section, all the object's variables should be placed. These define the state of the object. As the name suggests, the variables are going to be private, that is they cannot be accessed from outside the class declara-tion. This is encapsulation.

- The public section contains all the object's methods. These methods, as the name suggests, can be accessed outside the class declaration. The methods are the only means of communication with the object.

- The methods are implemented as C functions or procedures; the three methods should be easy to understand.

- All class definitions must also have one public method that has the same name as the class itself, in this case Counter. This method is called the class con-structor, and will be explained soon.

- Functions and procedures can also be placed in the private section; these will not be accessible to the outside world but only within the class declaration. This can be used to provide support routines to the public routines.



15

Instantiation

This is an awful big word for a powerfully simple concept. All we have done so far is tocreate a class, i.e. a specification for our object; we have not created our object yet. Tocreate an object that simulates a counter in C++ then, all we have to do is declare in ourmain program: Counter i;

Although this seems just like an ordinary variable declaration, this is much more. Thevariable it now represents an instance of the counter type; a counter object. We can nowcommunicate with this object by calling it's methods, for example we can set the counterto the value '50' by callingi.InitialiseCounter(50);.

We can increment the counteri.IncrementCounter();

and we can get the counter valuevalue = i.GetCounterValue();

When we first instantiate an object (i.e. when we first declare, or create it), the class con-structor is called. The class constructor is a method with the same name as the class def-inition. This method should contain any start-up code for the object; any initialisation ofobject variables should appear within this method. In the counter example, whenever wecreate a new counter object, the first thing that happens to the object is that the variableMyCounter is initialised to zero.

Remember the question posed at the very start? The power of objects starts to kick innow. Say we require another counter within our program. All we have to do is declare anew object, say: Counter j;



The new counter object will have nothing to do with the previous object. What this means isthat i and j are two distinct objects, each with their own separate values. We can incrementthem independently, for example. Should we need 1000 counter objects we could declare anarray of counter objects: Counter loads[1000];

and then increment one of them using a call such as loads[321].InitialiseCounter();



15

Java

The equivalent Java class definition for the counter example follows. It is remarkably sim-ilar to the C++ definition, and differs only in syntax.

class Counter extends Object {

private int MyCounter;

Counter() {

MyCounter = 0;

}

public void InitialiseCounter(int value) {

MyCounter = value;

}

public void IncrementCounter(void) {

MyCounter++;

}

public int GetCounterValue(void) {

return (MyCounter);

}

}

A few brief notes about the differences:

All new classes must be defined with the extension extends Object. This defines thesuperclass; this will be dealt with in the next section.

There are no public or private sections, instead all variables and methods are prefixedwith the appropriate qualifier.

The class constructor definition remains the same.



Instantiating objects in Java is slightly different; the designers knew that the C++ method ofdeclaring a new object was far too similar to how new variables are declared, so objects aredeclared differently:

Counter i;

i = new Counter();

Basically we define a variable to reference the object in the first line. Then we actually createan instance of the object by a call to new in the second line. Accessing object methods isdone in the exact same way in Java as in C++.



15

So which...?

A quick diversion from OOP here! At this point you might think it doesn't matter whetheryou use C++ or Java, they both implement object oriented technology. Well, C++ can beused to design programs without implementing any objects; C++ can be used as anextended C. In Java, you must implement any non-trivial program using objects. This isbecause Java has no support for structures (record types) or pointers; all these must bereplaced by object variables and methods. So, if you are using Java, you need to under-stand object methodology; with C++ this is optional.

Basically, both these languages have hundreds of other features that I don't have timeeven to begin to explain; as long as you have a basic understanding of object technolo-gies and the

C language, you should find both rather easy to learn.

Why Bother?

The process of designing and programming objects seems very cumbersome, so whybother? Well, it's difficult to see from such a small example, but for larger projects, OOPtechniques allow unlimited flexibility. Objects are used because:

- Encapsulation; in our example we cannot alter the value of the counter other than by incrementing it or setting it to a initial value. This reduces pos-sible bugs.

- Modularity; Different programmers or teams can work on different indepen-dent objects.

- Inheritance; this is covered in the next section.

Basically, objects provide a secure and easily upgradable path for program developers.Already, a considerable amount of developers are moving from normal procedural designand embracing object oriented technology.

The next section should be easy to follow if you understood this one! By the way, the rea-son the next few examples are only in Java is because I don't know enough about C++ toprogram them!



Inheritance11

Another big word for a simple concept. To help explain this, we'll go back to our beer exam-ple. Say we want to define a new class to represent a pint of an imported French beer. Thisclass would have all the variables and methods of the normal beer class, but it would havethe following additional information: A variable representing the price of the beer

Two methods to set and get the price of the beer

(We need this information because we are students; everyone knows the price of Harp, butwe would like to know the price of this expensive beer before we order it!)

It would be rather tedious to define a new class, FrenchBeerType which had all the variablesand methods of BeerType plus a few more. Instead, we can define FrenchBeerType to be asubclass of BeerType.

A subclass is a class definition which takes functionality from a previous class definition.

What this means is that we only define the additional information that the FrenchBeerTypeclass has.

Informally then, we would create a new class, FrenchBeerType, and tell our compiler that it isa subclass of BeerType. In the class definition, we would include only the following informa-tion: A variable BeerPrice

A method SetBeerPrice

A method GetBeerPrice

We do not need to include any information about BeerName for example; all this is automati-cally inherited. This means that FrenchBeerType has all the attributes of BeerType plus a fewadditional ones. All this talk of beer is making me mad for a pint...

11.Copied from http://www.quiver.freeserve.co.uk/OOP1.htm



15

Counters, Counters, Counters...

Back to the counter example then! The counter we had in the last section is fine for mostcounting purposes. But say in a program we require a counter that can not only be incre-mented, but can be decremented too. Since this new counter is so similar in behaviour toour previous counter, it would be mad to define a brand new class with everything thatCounter has plus a new method. Instead, we'll define a new class ReverseCounter that isa subclass of Counter. We'll do this in Java. class ReverseCounter extends Counter

{

public void DecrementCounter(void) {

MyCounter--;

}

}

The extends clause indicates the superclass of a class definition. A superclass is the"parent" of a subclass; in our beer analogy, BeerType is the superclass of FrenchBeer-Type, so if we were defining this in Java we would use class FrenchBeerType extendsBeerType. Basically, we are just saying that we want ReverseCounter to be a subclass ofCounter. When we define a brand new class that is not a subclass of anything (as we didwhen we defined Counter) we use the superclass Object to indicate we want the defaultsuperclass.

We have defined ReverseCounter to be a subclass of Counter. This means that if weinstantiate a ReverseCounter object, we can use any method that the class Counter pro-vided, as well as the new methods provided. For example, if i is an object of the Rever-seCounter class, then we can both increment it and decrement it; i.IncrementCounter();and i.DecrementCounter; respectively.

Inheritance is a powerful tool. Unlike our simple example, inheritance can be passed onfrom generation to generation; we could define a class SuperDuperReverseCounter forexample, that is a subclass of ReverseCounter which could provide added variables ormethods.



Bugs, bugs, bugs...

If you tried to compile the above example and found it wasn't compiling, don't worry! There isa semi-deliberate mistake left in the code, which I am very usefully going to use to stress apoint.

When defining a class you must consider any possible subclass.

When we defined the Counter class we didn't even know what a subclass was, so we couldbe forgiven for breaking this rule then, but not from now on! If we go back to how the classwas defined: class Counter extends Object {

private int MyCounter;

...

...

}

We can see that the variable MyCounter is defined to be of type private. In Java, this meansthat the variable becomes very, very private indeed; in fact, it is only accessible from insidethe class from which it is defined. It is not available to any other class, including it's sub-classes. So when we reference MyCounter from inside ReverseCounter the Java compilerwill kick up a fuss, since we are outside the scope of the variable.

So, we should have realised at the time of writing the Counter class that subclasses mightneed to get at this variable too. To fix this, all we have to do is change the qualifier ofMyCounter to: protected int MyCounter;

A variable with a protected qualifier means that it can only be accessed from within the classin which it is defined, as well as all subclasses of this class. This seems appropriate for ourpurposes.



15

Lesson 6 - Linked lists - a powerfull tool12

Although linked lists sounds kind of scary, don't worry they are really easy to use onceyou've got a little practice under your belt! When I first learned this odd way of storingdata, I really thought that I wouldn't be using them again. I certainly learned differently!Linked lists form the foundation of many data storing schemes in my game!

They are really nice when you don't know how many of a data type you will need, anddon't want to waste space. They are like having a dynamically allocated string that fluctu-ates in size as the program runs. Before I really confuse you lets get into a better explana-tion!

12.This tutorial was taken from http://www.inversereality.org/tutorials/c++/linkedlists.html and was written by Justin Deltener


Lesson 6 - Linked lists - a powerfull tool

A linked list is a chain of structs or records called nodes. Each node has at least two mem-bers, one of which points to the next item or node in the list! These are defined as SingleLinked Lists because they only point to the next item, and not the previous. Those that dopoint to both are called Doubly Linked Lists or Circular Linked Lists. Please note that there isa distinct difference betweeen Double Linked lists and Circular Linked lists. I won't go into anydepth on it because it doesn't concern this tutorial. According to this definition, we could haveour record hold anything we wanted! The only drawback is that each record must be aninstance of the same structure. This means that we couldn't have a record with a char point-ing to another structure holding a short, a char array, and a long. Again, they have to beinstances of the same structure for this to work. Another cool aspect is that each structurecan be located anywhere in memory, each node doesn't have to be linear in memory!

Notice that we define a default constructor for our structure that sets Next equal to NULL.This is because we need to know when we have reached the end of our linked list. Each Nextitem that is NOT equal to NULL means that it is pointing to another allocated instance. If itdoes equal NULL, then we have reached the end of our list.

typedef struct List { long Data; List* Next; List() {Next=NULL; Data=0; } }; typedef List* ListPtr;



16

Starting upFirst off, we need to set our Link pointers to some know location in memory. We will cre-ate a temp pointer, allocate an instance, then assign our pointers. Something like this:

We can forget about doing anything with temp after this because Head will always bepointing to the memory allocated by it, until it is deleted. Just like we discussed, we cre-ated a temp pointer, allocated an instance of our structure, then assigned both Head andTail to our new instance. This beginning point is very crucial. We must allocate at leastone instance right away so our pointers are actually pointing to something relevant! Wellnow we have the smallest possible linked list, where head = tail. Pointer usage in linkedlists make them a little hard to learn at first, but once you think of the uses of pointers itstarts to come together. Now that we have our Head and Tail pointers actually pointing tosomething that is physically there, let's cover how to add on additional nodes into our list.A linked list with one node is kind of boring :)

SLList:: SLList() { Head = new List; Tail=Head; CurrentPtr = Head;}



Adding a NodeWe can actually add nodes in two possible places, the beginning or the end, although thestandard seems to be the end. This makes our linked list act kind of like a que with the headnode being the oldest and end pointing to the newest objects. This brings up an interestingsubject also. How will we use our linked list? This is what makes the linked list so powerful.We could use it as a priority list where the oldest objects get a higher precedence untildeleted from the list. We could also use it as a master listing of items that need to be kepttrack of at one time, deleting object when they need to be, without using any precedencescheme. Here's some code that will add a node onto the end of the list, and then move theend pointer so that it really does point at the end.

Here we add a node onto the end of our list, then move the Tail pointer to point to the newinstance! After this function we can always access our new node through Tail since we allo-cated a new instance, then made Tail point to it!

void SLList::AddANode(){Tail->Next = new List; Tail=Tail->Next; }



16

Traversing the ListThis is actually the most difficult part in dealing with Singly Linked Lists. This is becausewe can't immediately access the previous node should we need to, like when we want todelete a node and reconnect the node before to the node after the one being deleted.One easier way is to create a function that will traverse through the list a given number ofnodes. This way, we can keep track of which one we are on should we need to delete it,then we could pass the node to the function and get a pointer to the previous node!Something like this :

ListPtr SLList::Previous(long index) { ListPtr temp=Head; for(long count=0;count<index-1;count++) {temp=temp->Next; } return temp; } ListPtr SLList::Previous(ListPtr index) {ListPtr temp=Head; if(index==Head) //special case, index IS the head :) { return Head; } while(temp->Next != index) { temp=temp->Next; } return temp; }



If we know that we've gone into our list a certain number of nodes, we can pass that numberto our Previous function and get a pointer to the previous node. This works well, but is hard todebug should we be off in our counter etc. I created a second version which lets you pass thenode you are currently at as an argument, then we can be absolutely certain that we will getthe previous node! I use the second version a LOT more. Also notice that the second haserror checking. If we are currently at the Head node and try to go back one, it simply returnsHead instead of returning garbage.

While creating our neeto class, I decided to use a class node pointer which we declared atCurrentPtr. To that end, I created two functions that move our pointer forward and back onenode. If we are at the head and try to go back one node (into nothing), then the functiondoesn't move our pointer. Likewise if we are at the end of the list and try to advance to thenext node (nothing), it doesn't move our pointer.

void SLList::Advance() { if(CurrentPtr->Next != NULL) { CurrentPtr=CurrentPtr->Next; } } void SLList::Rewind() { if(CurrentPtr != Head) { CurrentPtr=Previous(CurrentPtr); } }



16

Deleting a NodeWhen deleting nodes from a linked list, there are 3 different cases to decide from. Thenode to be deleted is the head node, it's a middle node (somewhere between the head ortail, but not either) or it could be the tail node. Each requires a small change to take intoaccount when deleting the node. Let's go over each one in depth.

CurrentNode is actually corpse

void SLList::DeleteANode(ListPtr corpse) //<-- i thought it was funny :) ListPtr temp;

if(corpse == Head) //case 1 corpse = Head {temp=Head; Head=Head->Next; delete temp; } else if(corpse == Tail) //case 2 corpse is at the end { temp = Tail; Tail=Previous(Tail); Tail->Next=NULL; delete temp; } else //case 3 corpse is in middle somewhere {temp=Previous(corpse); temp->Next=corpse->Next; delete corpse; }

CurrentPtr=Head; //Reset the class tempptr



Case 1: CurrentNode = Head Node

In this case, the node to be deleted is actually the Head node! This is a special case becausethere is no previous node to connect. We simply use our temp pointer to remember whereHead is pointing at, advance the Head to the next position, then delete our saved location!Simple huh!

Case 2: CurrentNode = End node

In this case, the node to be deleted is actually the Tail node! This is a special case becausewe have a previous node, but no node afterwards to connect to. We save the old location ofTail using temp, set Tail equal to the previous node, set the Next pointer of Tail equal to NULLsince it is at the end, then delete our temp pointer!

Case 3: CurrentNode is somwhere in between

In this case, there is a node before and a node after our current node. All we need to do isconnect the previous node to the node after our current node. We set temp equal to our pre-vious node and set the Next pointer to the node after our current one (corpse). Once they areconnected, we can simply delete our current pointer! That's all there is to deleting nodes!



16

Before ExitBefore we can exit our program, we have to make certain that all of our dynamically allo-cated structures or nodes are deleted, otherwise we will have a memory leak! To fix this,we can build a routine to de-allocate any remaining nodes. Let's make it automatic andplace it in the class destructor!

This traverses through the list de-allocating nodes as it moves along until it has reachedthe end! That's all there is to it!

SLList:: ~SLList() { ListPtr temp = Head; CurrentPtr = Head; while(CurrentPtr != NULL) {CurrentPtr = CurrentPtr->Next; delete temp; temp=CurrentPtr; } }


Lesson 7 - Trees and Graphs

Lesson 7 - Trees and GraphsJust as using an array to store a sequence makes you pay for indexing even when you don't need it (suggesting a linked list if you need flexibility), using a sorted array is a clunky sort of bargain if you need to muck with the sequence on anything like a regular basis. There are sorting algorithms that are fast on an already-mostly-sorted array, but even then you'll wind up shifting huge pieces of array around to add or remove even a single element. Binary trees can cheaply store a sorted sequence, with searching (and even indexing if you need it), and let you add, remove, or muck with nodes at will.



16

OverviewBinary trees are the result of the same sort of relaxation that leads from an array to a linked list. We don't really need the indexing that sorted arrays impose on sorted data; if we throw it away, we're left with only the hierarchy of middle elements that binary_search() traverses as it executes.

There are also much more paranoid binary-tree implementations that constantly juggle the tree in bizarre ways, such that it is mathematically guaranteed that the tree will never become too badly unbalanced (for some formal definition of "too badly"); two popular fla-vors of this are splay trees and red-black trees. This approach involves quite a bit of over-head, though, and adds complexity; in practice, it's rarely worth worrying about this unless you can't avoid feeding an already-sorted list to your tree. It's exceedingly rare for a tree to become unbalanced enough to make a difference by accident.



Reconstruction of Binary Trees from Traversals13

A collection of the three traversals is unique for a binary tree. But are all the three required?Hereafter we talk about binary trees whose keys are alphabetic characters for convenience.First and foremost we are thinking in terms of saving space.

There is a comfortable space saving representation called the linear or array representationof a binary tree. Here, if an array a[1...n] is used to represent a binary tree, a[k] has it's leftchild value at a[2k] and right child value at a[2k+1]. But here there should be a value whichspecifies that there is no such node. e.g., if a[k] does not have a left child a[2k]=null value.But it has to exist and hence space is wasted for all the non existent nodes. In a balancedbinary tree e.g.,AVL array representation is very efficient but not in random ones.

We shall base our discussion on the assumption that two traversals represent a binary treeuniquely. Any inconsistencies to this assumption (they do exist) shall be sited as and whenthey are dealt with.

The following is the structure (C style) used for representing a node in the ongoing discus-sion:.

13.Taken from http://www.geocities.com/acmearticles/treerec.htm

struct node{ char data; struct node *left,*right; };



17

A pseudocode of the three traversals follows:Inorder(x):Inorder(x.left),Visit(x),Inorder(x.right)

Preorder(x):Visit(x),Preorder(x.left),Preorder(x.right)

Postorder(x):Postorder(x.left),Postorder(x.right),Visit(x)

Hereafter we represent the the traversals by the data in the nodes given in the order theyare visited. So a traversal consists of as many characters as there are nonempty nodes inthe tree. The algorithms for tree reconstruction from two traversals are presented below.Some facts that are made use of in reconstruction are presented first. 1.The first data in preorder traversal represents the root.

2.The last data in postorder traversal represents the root.

3.The traversals can be split into three parts as

Preorder traversal= root(Preorder of root.left)(Preorder of root.right)

Inorder traversal= (inorder of root.left)root(inorder of root.right)

Postorder traversal= (postorder of root.left)(postorder of root.right)root



Given Inorder and Preorder traversals

The first element of the preorder traversal represents the root. Let the position of that elementin the inorder traversal be i. The string of characters from the first element to the element at(i-1) constitutes the inorder traversal of the left subtree and the string of characters beyond itill the end represents the inorder traversal of the right subtree. Now as there as as manycharacters in preorder traversal as there are in inorder traversal, the preorder traversal caneasily be split apart to the root and the preorder traversals of the left and right subtreesrespectively.

C code for the same is given below.

The above function returns a pointer to the root of the three whose inorder and preorder tra-versals are given as the first and parameters respectively and their length as the third param-eter. The length of the inorder and preorder traversals will be the same.

struct node * buildtree(char *in,char *pre,int len) { int i,lenright,lenleft; struct node *p; if(!len)return NULL; p=(struct node *)malloc(sizeof(struct node)); p->data=pre[0];p->left=NULL;p->right=NULL; if(len==1)return p; for(i=0;in[i]!=pre[0];i++); lenright=len-i-1; lenleft=len-lenright-1; p->left=buildtree(in,pre+1,lenleft); p->right=buildtree(in+lenleft+1,pre+lenleft+1,lenright); return p; }



17

Given Inorder and Postorder traversals

The algorithm for this reconstruction is almost same as that of the above but the differ-ence lies in the fact that the root occurs last in the postorder traversal. Finding the relativepositions etc. are similar to that as in the above algorithm.

C code is given below.

The above function returns a pointer to the root of the tree whose inorder and postordertraversals and their length is given as the first second and third parameters respectively.

struct node * buildtree(char *in,char *post,int len) { int i,lenright,lenleft; struct node *p; if(!len)return NULL; p=(struct node *)malloc(sizeof(struct node)); p->data=post[len-1];p->left=NULL;p->right=NULL; if(len==1)return p; for(i=0;in[i]!=post[len-1];i++); lenright=len-i-1; lenleft=len-lenright-1; p->left=buildtree(in,post,lenleft); p->right=buildtree(in+lenleft+1,post+lenleft,lenright); return p; }



Now what about the combination of the postorder and preorder traversals, does it represent aunique tree? It should, isn't it. But let us take this particular case. If the left subtree of the rootis empty, then the preorder and postorder each can be split to two parts as below.Preorder traversal= root(Preorder of root.right)

Postorder traversal= (postorder of root.right)root

Consider another case where the right subtree of the root is empty.

Preorder traversal= root(Preorder of root.left)

Postorder traversal= (postorder of root.left)root

These two cases cannot be distinguished from one another. If preorder:abcd and pos-torder:cdba. We can infer that a is the root and that b is the root of one of it's subtrees andthat one of it's subtrees is empty. But which one is empty? Left or right!!!

This is enough proof that a combination of preorder and postorder traversals does not repre-sent a unique binary tree. But then does it mean that inorder traversal carries more informa-tion of the binary tree than postorder or preorder traversals.? Or is our assumption that acombination of inorder traversal and one of the other two represents a unique subtree wrong?



17

Non - Recursive algorithm for Binary Tree Reconstruction from inorder and postorder traversals

Step 1: select the first element in Inorder traversal Step 2: find its position in Postorder traversal, say pos Step 3: if same position then { make it the root and make the partially complete sub-tree its left subtree. select next element in Inorder. } else { all elements in Postorder traversal from present element to the present root forms the right subtree of the root. Insert these elements just as inserting into a binary search tree. if (present element in Inorder = last element in Postorder) Go to step 5 next element selected in Inorder is the element at pos+1. } Step 4: go to step 2 Step 5: end.



A code in C for the same is given below. The code assumes the following. Assume that thetree holds data values of type character. Then the Inorder traversal of the same would be astring of characters. Number the characters in inorder traversal in the order that they appearin the inorder traversal. e.g., if the inorder traversal is "abgf" then a->1,b->2,g->3,f->4. So theinorder traversal is reduced to a sorted order of numbers 1 to n where n is the number ofnodes in the tree. Now the mapped values are substituted in the postorder traversal. e.g., ifthe postorder traversal is "bafg" then the modifies postorder traversal will be {2,1,4,3}. ArraysP and I denote these modified inorder and postorder traversals. Note that after this transfor-mation the modified inorder traversal does not carry any data regarding the structure of thetree. The inorder and postorder traversals are stored in P and I from position 1 onwards.P[0]=I[0]=0. The following functions have been used in the code below: findpos(int i,int *p)returns the position of i in the array p. and insert(int k,node *root) This function inserts ele-ment k into the binary tree rooted at root in the fashion of binary search tree insertion. root isa global variable that holds a pointer to the root of the binary tree to be constructed. update-root(int k,node *root) It constructs a new tree with root holding k and the present tree rootedat root is made it's left subtree. n holds the number of non-null elements in the tree



17

At last (phew!), the code follows

The above code was subject to a lot of testing and it worked

void Buildtree() { int presentposition=1,nextposition,position_in_postorder,previousposition,temp; insert(I[presentposition],root); previouselement=0; //initally NULL while(1) { nextposition=prsentposition+1; position_in_postorder=findpos(presentposition,P); temp=position_in_postorder; if(position_in_postorder == presentposition) { if (findpos(I[nextposition],P)>presentposition) { updateroot(I[nextposition],root); previousposition=presentposition; presentposition++; } } else { position_in_postorder--; while(P[position_in_postorder]!=previousposition) { insert(P[position_in_postorder],root); position_in_postorder--; } if(temp>=n) return; //ending condition else { previousposition=presentposition; presentposition=temp+1; updateroot(presentposition,root); } } } }


Lesson 8 - Parsing or how to loop through bytes

Lesson 8 - Parsing or how to loop through bytes14

For the word parse is computer science parlance for the act of separating computer input intomeaningful parts for subsequent processing actions.

14.This lesson is taken from http://www.kilowattsoftware.com/tutorial/rexx/parseTutorial.htm (Kilowatt Software'sClassic Rexx Tutorial) and is adapted and modified for this course. Recommended Reading: http://www.cs.vu.nl/~dick/PTAPG.html (Parsing Techniques - A Practical Guide)

parse vt., vi. parsed, pars'ing [Now Rare] 1. to separate (a sentence) into its parts, explaining the grammatical form, function, and interrelation of each part 2. to describe the form, part of speech, and function of (a word in a sentence)



17

Preparing to parse Let us learn about parsing by analyzing the following reduction of Descartes' famous quote:

Here is a program that parses the words in the phrase. When a value consists of wordsthat are separated by only one space, and there are no leading or trailing spaces, thevalue is easy to parse into a known number of words as follows.

I think I am

parse value 'I think I am' with word1 word2 word3 word4 say "'"word1"'" say "'"word2"'" say "'"word3"'" say "'"word4"'"

This shows: 'I' 'think,' 'I' 'am'



Here is another program that parses the above phrase.

This simple program achieved the desired result. The program is a Rexx parsing idiom. Ineach loop iteration, the parse instruction extracts the first word in the phrase, and assigns theremaining words (after the first word) to the phrase variable. The loop concludes when all ofthe words in the phrase have been processed.

When there are more words in the value, than there are variables in the template, the trailingwords are assigned to the last variable in the template. Here is an example.

phrase = 'I think I am' do while phrase <> '' parse var phrase word phrase say "'"word"'" end This shows: 'I' 'think,' 'I' 'am'

parse value 'Sam likes peaches and cream' with subject verb object say 'subject:' subject say 'verb:' verb say 'object:' object

This shows: subject: Sam verb: likes object: peaches and cream



18

Now let's make Descartes' quote a little more challenging. Additional spaces in the origi-nal phrase, and punctuation characters, introduce various difficulties.

Here is the same phrase with spaces represented as dots: · , so they can be seen!

The first parsing challenge is to extract the words within the quote. Let's try to do it withthe words and word built-in functions.

This simple program worked well, although the second word includes a trailing comma. Inaddition, the period is considered a word.

I think, I am .

···I··think,··I am··.··

phrase = '···I··think,··I am··.··'' do i=1 for words( phrase ) say "'"word( i )"'" end

This shows: 'I' 'think,' 'I' 'am' '.'



The following is an initial attempt to parse the words in the phrase.

Notice the spaces before and after the period.

parse value '···I··think,··I am··.··' with word1 word2 word3 word4 word5 say "'"word1"'" say "'"word2"'" say "'"word3"'" say "'"word4"'" say "'"word5"'" This shows: 'I' 'think,' 'I' 'am' '··.··'



18

The following program achieves a better result.

This was our second parsing program. It worked fairly well, although the second wordincludes a trailing comma. In addition, the period is considered a word. This time thereare no spaces before and after the period.

phrase = '···I··think,··I am··.··'' do while phrase <> '' parse var phrase word phrase say "'"word"'" end This shows: 'I' 'think,' 'I' 'am' '.'



Now let's successfully parse the phrase into words.

The above program translated punctuation characters to spaces, and then stripped spaces.Any characters remaining after these operations were considered a word.

phrase = '···I··think,··I am··.··'' do while phrase <> '' parse var phrase word phrase word = strip( translate( word, , ',.;":?()' ) ) if word <> '' then say "'"word"'" end This shows: 'I' 'think' 'I' 'am'



18

How does parsing work ? The parse statement divides a source string into constitutent parts and assigns these to variables, as directed by the parsing template.

The following picture introduces how parsing is performed, with multiple space dividers between the variables to assign.



While the template is processed from left to right, several current positions in the sourcestring are maintained. The motion of these positions is guided by the division specifiers withinthe template. In the picture above, the positions are those that would be in effect after thetemplate's verb term is processed. The object term will be processed next. The previous startposition locates the 'l' in 'likes'. The current end position locates the space between 'likes' and'peaches'. The next start position locates the 'p' in 'peaches'. With these positions establishedthe value 'likes' is assigned to variable verb. When the object term is processed, it is the onlyterm remaining. Consequently, the remainder of the source string is assigned to the objectvariable -- it receives the value: 'peaches and cream'.

If a relative position division specifier followed the verb term, the verb variable would receivethat many characters after the previous start position and all positions would be advanced tothat relative position. Study the following effect:

parse value 'Sam likes peaches and cream' with subject verb +2 object say 'subject:' subject say 'verb:' verb say 'object:' object This shows: subject: Sam verb: li object: kes peaches and cream



18

The following is another illustration that shows how parsing is performed, with a literal pat-tern divider between the variables to assign.

The literal pattern in this example is a quoted comma -- ',' . The previous start positionlocates the 't' in 'think'. The current end position locates the ','. The next start positionlocates the space between the comma and the 't' in 'therefore'. With these positionsestablished the value 'I think' is assigned to variable precondition. When the conse-quence term is processed, it is the only term remaining. Consequently, the remainder ofthe source string is assigned to the consequence variable -- it receives the value: ' there-fore I am'. This value contains a leading space.



If a relative position division specifier followed the ',' literal pattern, The next start positionwould be that many characters after the comma in the source string.

This advanced one character position after the comma. As a result, the consequence vari-able receives the value 'therefore I am' without a leading space

parse value 'I think, therefore I am' with precondition ',' +1 consequence



18

Parsing Expressions by Recursive Descent15 Parsing expressions by recursive descent poses two classic problems

1.how to get the abstract syntax tree to follow the precedence and associativity of opera-tors and

2.how to do so efficiently when there are many levels of precedence.

The classic solution to the first problem does not solve the second. I will present the clas-sic solution, a well known alternative known as the "Shunting Yard Algorithm", and a lesswell known one that I have called "Precedence Climbing".

15.This Lesson was taken from http://www.engr.mun.ca/~theo/Misc/exp_parsing.htm and was written by The-odore Norvell



Precedence and associativityConsider the following example grammar, G, E --> E "+" E

| E "-" E

| "-" E

| E "*" E

| E "/" E

| E "^" E

| "(" E ")"

| v

in which v is a terminal representing identifiers and constants.

We want to build a parser that will 1.Produce an error message if its input is not in the language of this grammar.

2.Produce an "abstract syntax tree" (AST) reflecting the structure of the input, if the input is in the language of the grammar.

Each (correct) input will have a single AST based on the following precedence and associa-tivity rules: Parentheses have precedence over all other operators.

^ (exponentiation) has precedence over /, *, -, and +.

* and / have precedence over - and +.

Unary - has precedence over binary - and +.

^ is right associative while all other operators are left associative.

For example the first three rules tell us that a ^ b * c ^ d + e ^ f / g ^ (h + i)

parses to the tree +( *( ^(a,b), ^(c,d) ), /( ^(e,f), ^(g,+(h,i)) ) )

while the last rule tells us that a - b - c

parses to -(-(a,b),c) rather than -(a,-(b,c)), whereas a ^ b ^ c

parses to ^(a, ^(c,b)) rather than ^(^(a,b), c).



19

Aside: I am assuming that the desired output of the parser is an abstract syntax tree (AST). The same considerations arise if the output is to be some other form such as reverse-polish notation (RPN), calls to an analyzer and code generator (for one-pass compilers), or a numerical result (as in a calculator). All the algorithms I present are easily modified for these forms of output.



Recursive-descent parsingThe idea of recursive-descent parsing is to transform each nonterminal of a grammar into asubroutine that will recognize exactly that nonterminal in the input.

Left recursive grammars, such as G, are unsuitable because a left-recursive production leadsto an infinite recursion in the recursive-descent parser. While the parser may be partially cor-rect, it may not terminate.

We can transform G to a non-left-recursive grammar G1 as follows: E --> P {B P}

P --> v | "(" E ")" | U P

B --> "+" | "-" | "*" | "/" | "^"

U --> "-"

The braces "{" and "}" represent zero or more repetitions of what is inside of them. Thus youcan think of E as having an infinity of alternative: E --> P | P B P | P B P B P | ... ad infinitum

The language described by this grammar is the same as that of grammar G: L(G1) = L(G).

Not only is left recursion eliminated, but each choice can be made by looking at the nexttoken in the input.

Let's look at a recursive descent recognizer based on this grammar. I call this algorithm a rec-ognizer because all it does is to recognize whether the input is in the language of the gram-mar or not. That is it does not produce an abstract syntax tree, or any other form of outputthat represents the contents of the input.

I'll assume that the following subroutines exist:

""next" returns the next token of input or special marker "end" to represent that there are nomore input tokens. "next" does not alter the input stream.

""consume" reads one token. When "next=end", consume is still allowed, but has no effect.



19

""error" stops the parsing process and reports an error.

In using these, let's construct a subroutine "Expect", which I will use throughout this essayexpect( tok ) is

if next = tok

consume

else

error

We will now write a subroutine called "Erecognizer". If it does not call "error", then theinput was an expression according to the above grammar. If it does call "error", then theinput contained a syntax error, e.g. unmatched parentheses, a missing operator or oper-and, etc.Erecognizer is

E()

expect( end )

E is

P

while next is a binary operator

consume

P

P is

if next is a v

consume

else if next = "("

consume

E

expect( ")" )

else if next is a unary operator

consume

P

else

error



Notice how the structure of the recognition algorithm mirrors the structure of the grammar.This is the essence of recursive descent parsing.

The difference between a recognizer and a parser is that a parser produces some kind of out-put that reflects the structure of the input. Next we will look at a way to modify the above rec-ognition algorithm to be a parsing algorithm. It will build an AST, according to the precedenceand associativity rules, using a method known as the "shunting yard" algorithm.



19

The shunting yard algorithmThe idea of the shunting yard algorithm is to keep operators on a stack until we are surewe have parsed both their operands. The operands are kept on a second stack. Theshunting yard algorithm can be used to directly evaluate expressions as they are parsed(it is commonly used in electronic calculators for this task), to create a reverse Polishnotation translation of an infix expression, or to create an abstract syntax tree. I'll createan abstract syntax tree, so my operand stacks will contain trees.

When parsing for example x*y+z, we push x on the operand stack, * on the operatorstack, and y on the operand stack. When the + is read, we compare it to the top of theoperator stack, which is *. Since the + has lower precedence than *, we know that bothoperands to the * have been read and, in fact, will be on top of the operand stack. Theoperands are popped, a new tree is built, *(a,b), and it is pushed on the operand stack.Then the + is pushed on the operator stack. At the end of an expression the remainingoperators are put into trees with their operands and that is that.

In addition to "next", "consume". "end", "error", and "expect", which are explained in theprevious section, I will assume that the following subroutines and constants exist: - binary" converts a token matched by B to an operator.

- unary" converts a token matched by U to an operator. We require that functions "unary" and "binary" have disjoint ranges.

- mkLeaf" converts a token matched by v to a tree.

- mkNode" takes an operator and one or two trees and returns a tree.

- push", "pop", "top": the usual stack operations.

- empty": an empty stack

- sentinel" is a value that is not in the range of either unary or binary.



In the algorithm that follows I compare operators and the sentinel with a > sign. This compar-ison is defined as follows: - binary(x) > binary(y), if x has higher precedence than y, or x is left associative and x and y have equal precedence.

- unary(x) > binary(y), if x has precedence higher or equal to y's

- op > unary(y), never (where op is any unary or binary operator)

- sentinel > op, never (where op is any unary or binary operator)

- op > sentinel (where op is any unary or binary operator): This case doesn't arise.

Now we define the following subroutines:

Aside: I hope the pseudo-code notation is fairly clear. I'll just comment that I'm assuming thatparameters are passed by reference, so only 2 stacks are created throughout the executionof EParser. Eparser is

var operators : Stack of Operator <- empty

var operands : Stack of Tree <- empty

push( operators, sentinel )

E( operators, operands )

expect( end )

return top( operands )

E( operators, operands ) is

P( operators, operands )

while next is a binary operator

pushOperator( binary(next), operators, operands )

consume


while top(operators) not= sentinel

popOperator( operators, operands )

P( operators, operands ) is

if next is a v

push( operands, mkLeaf( v ) )

consume

else if next = "("

consume



19

push( operators, sentinel )

E( operators, operands )

expect( ")" )

pop( operators )

else if next is a unary operator

pushOperator( unary(next), operators, operands )

consume


else

error

popOperator( operators, operands ) is

if top(operators) is binary

const t1 <- pop( operands )

const t0 <- pop( operands )

push( operands, mkNode( pop(operators), t0, t1 ) )

else

push( operands, mkNode( pop(operators), pop(operands) ) )

pushOperator( op, operators, operands ) is

while top(operators) > op

popOperator( operators, operands )

push( op, operators )

The Shunting Yard Algorithm appears to have been invented by Edsger Dijkstra around1960 in connection with one of the first Algol compilers.



The classic solutionThe classic solution to recursive-descent parsing of expressions is to create a new nontermi-nal for each level of precedence as follows. G2: E --> T {( "+" | "-" ) T}

T --> F {( "*" | "/" ) F}

F --> P ["^" F]

P --> v | "(" E ")" | "-" T

(The brackets [ and ] enclose an optional part of the production. As before, the braces { and }enclose parts of the productions that may be repeated 0 or more times. The unquoted paren-theses ( and ) serve only to group elements in a production.)

Grammar G2 describes the same language as the previous two grammars: L(G2) = L(G1) =L(G)

The grammar is ambiguous; for example, -x*y has two parse trees. The ambiguity is resolvedby staying in each loop (in the productions for E and T) as long as possible and by taking theoption if possible (in the production for F). With that policy in place, all choices can be madeby looking only at the next token of input.

Note that the left-associative and the right-associative operators are treated differently; left-associative operators are consumed in a loop, while right-associative operators are handledwith right-recursive productions. This is to make the tree building a bit easier.



19

We can transform this grammar to a parser written in pseudo code.Eparser is

var t : Tree

t <- E

expect( end )

return t

E is

var t : Tree

t <- T

while next = "+" or next = "-"

const op <- binary(next)

consume

const t1 <- T

t <- mkNode( op, t, t1 )

return t

T is

var t : Tree

t <- F

while next = "*" or next = "/"


consume

const t1 <- F

t <- mkNode( op, t, t1 )

return t

F is

var t : Tree

t <- P

if next = "^"

consume

const t1 <- F

return mkNode( binary("^"), t, t1)

else

return t

P is

var t : Tree



if next = "("

consume

t <- E

expect( ")" )

return t

else if next = "-"

consume

t <- F

return mkNode( unary("-"), t)

else if next is a v

return mkLeaf( next )

else

error

It may be worthwhile to trace this algorithm on a few example inputs.

Although this is the classic solution, it has a few drawbacks: - The size of the code is proportional to the number of precedence levels.

- The speed of the algorithm is proportional to the number of precedence levels.

- The number of precedence levels is built in.

When there are a large number of precedence levels, as in the C and C++ languages, thefirst two disadvantages become problematic. In Pascal the number of precedence levels wasdeliberately kept small because, I suspect, its designer, Niklaus Wirth, was aware of theshortcomings of this method when the number of precedence levels is large.

The size problem can be overcome by creating one subroutine that is parameterized by pre-cedence level rather than writing a separate routine for each level. But the speed problemremains. Note that the number of calls to parse an expression consisting of a single identifieris proportional to the number of levels of precedence.

I'm not sure who invented what I am calling the classic algorithm.



20

Precedence climbingA method that solves all the listed problems of the classic solution, while being simpler than the shunting-yard algorithm is what I call "precedence climbing".

Consider the input sequence a ^ b * c + d + e

The E subroutine of the classic solution will deal with this by three calls to T, and by con-suming the 2 "+"s, building a tree

+(+(result of first call, result of second call), result of third call)

We say that this loop directly consumes the two "+" operators.

The precedence climbing algorithm has a similar loop, but it always directly consumes the first binary operator, then it consumes the next binary operator that is of lower prece-dence, then the next operator that is of lower precedence than that. When it consumes a left-associative operator, the same loop will also consume the next operator of equal pre-cedence. Let me rewrite the example with operators written at different heights according to their precedence: + +

*

^

a b c d e

One loop can consume all 4 operators, creating the tree

+(+(*(^(result of first call, result of second call) result of 3rd call), result of 4th call), result of 5th call)



Each operator is assigned a precedence number. To make things more interesting lets add a few more binary operators and use the following precedence tables:

Unary operators

- 3

Binary operators

|| 0 Left Associative

&& 1 Left Associative

= 2 Left Associative

+, - 3 Left Associative

*, / 4 Left Associative

^ 5 Right Associative



20

We use the following grammar G3 in which nonterminal Exp is parameterized by a prece-dence level. The idea is that Exp(p) recognizes expressions which contain no binaryoperators (other than in parentheses) with precedence less than pE --> Exp(0)

Exp(p) --> P {B Exp(q)}

P --> U Exp(q) | "(" E ")" | v

B --> "+" | "-" | "*" |"/" | "^" | "||" | "&&" | "="

U --> "-"

The loop implied by the braces, { and }, in the production for Exp(p) presents a problem:when should the loop be exited? This choice is resolved as follows: - If the next token is a binary operator and the precedence of that operator is greater or equal to p,

then the loop is (re)entered.

- Otherwise the loop is exited.

In the productions for Exp(p) and P, the recursive use of Exp is parameterized, by a valueq. So there is a second choice to resolve: how is q chosen? The value of q is chosenaccording to the previous operator: - In the binary operator case:

oif the binary operator is left associative, q = the precedence of the operator + 1,

oif the binary operator is right associative, q = the precedence of the opera-tor.

- After unary operators,

oq=the precedence of the operator.

Consider what will happen in parsing the expression, a * b - c * d - e * f = g * h - i * j - k *l. To make things clearer, I'll present this expression 2 dimensionally to show the prece-dences of the operators:

2 =

3 - - - -

4 * * * * * *

a b c d e f g h i j k l

! ! ! !



The call to Exp(0) will consume exactly the operators indicated by a ! . The sub-expressions:a, b, c*d, e*f, and g*h-i*k-k*l will be parsed by calls to P and Exp(5), Exp(4), Exp(4) andExp(3) respectively.

What about right-associative operators? Consider an expression

a^b^c

Because of the different way right-associative operators are treated, Exp(0) will only con-sume the first ^, as the second will be gobbled up by a recursive call to Exp(5).



20

A recursive-descent parser based on this method looks something like this:Eparser is

var t : Tree

t <- Exp( 0 )

expect( end )

return t

Exp( p ) is

var t : Tree

t <- P

while next is a binary operator and prec(binary(next)) >= p


consume

const q <- case associativity(op)

of Right: prec( op )

Left: 1+prec( op )

const t1 <- Exp( q )

t <- mkNode( op, t, t1)

return t

P is

if next is a unary operator

const op <- unary(next)

consume

q <- prec( op )

const t <- Exp( q )

return mkNode( op, t )

else if next = "("

consume

const t <- Exp( 0 )

expect ")"

return t

else if next is a v

return mkLeaf( next )

else

error



I first saw this algorithm described by Keith Clarke in a posting to comp.compilers many yearsago. Most recently I used it in a JavaCC parser for a subset of C++. I've also used it in aparser based on monadic parsing written in Haskell. I'd be happy to mail either grammar toanyone who is interested.



20

How to parse and scan through hex bytesIn general our parser has to be very simple:

We suppose that our file is saved as a linked list in memory. Each node of the linked listcontains one hex-value of our file, e.g. 2F or something else.

Our main problem is to "scan" this linked list and to shout "hello" when we found a seriesof hex-values which gives us a working mnemonic code. The problem HOW we check aworking mnemonic is not important at this place, we discuss this later in Lesson 9 -Opcodes and Mnemonics (pages 108 ff).

Well, life could be easy. Imagine this:

Each hex-value correspondents to one hex-mnemonic. Wow, how easy. So we just takeeach node in our linked list and translate it.

But reality looks different!

Realize this:55 -' PUSH EBP

but8B45 10 -' MOV EAX,DWORD PTR SS:[EBP+10]

Can you see our problem ? The mnemonics have a different size!

We will discuss first the theory and the pseudo-code for this problem. Please let me men-tion that this short chapter will not describe theoretical problems like CF-grammars orsimilar.

Later I will give you a assembly-code for this in Lesson 6 - Parsing (pages 218 ff.). Linkedlists in general can be found in Lesson 5 - Linked lists (pages 199 ff.)



So how could a pseudo-code look ?1.Initialise our linked list and set pointer at the first node

2.add hex from the node to our stack

3.check the stack if we have a working mnemonic

4.if we found one, print this mnemonic and clear the stack and go to the next node and goto 2

5.if not: goto next node of the linked list and goto 2

So this is easy. But will it work ? Well, yes and no. In theory it looks good, but when you lookat the pseudo-code you can see that there is no end-condition set! So in real life this code willcrash when there are no more hex-values.

Next we have not proved if our stack is bigger than 15 values. If we can not find a workingmnemomic with these values, we have done something wrong: either our parser does notwork or our opcode list is to short and does not contain a corresponding value in these 15nodes or the file has some strange mnemonics in it!

Why do we have to check for these 15 values?

Because the mnemonics have a length between 1 and 15 bytes. So simple.

Let us suppose that we have a "good" file, where all hex-values result in a working code withno problems.



20

So an advanced pseudo-code would look like this:



This looks better. We have improved the parsing algorithm and added some important fea-tures:

- Checking if a single node is equal to a single opcode. We do this only if the stack is empty.

- Error-handling if we can not get a value from a node.

- Checking if our stack (which grows with each loop) contains a valid complex opcode.

- Checking if we reached the end of the linked list

Again we suppose that the linked list contains values which fit with our opcode-list. Thismeans that all hex-values (incl. cominations) can be translated somehow to a disassembledlist.

Can you feel our big problem ?

If we miss an opcode the parser may produce a wrong disassembly or stop during the dis-aasembly!

This will be the main problem of one of the next chapters

We will later give a source-code as a working main-frame parsing algorithm. Feel free to opti-mize this algorithm, ours is just a startup for simplicity.

You have now some background knowledge about parsing and its problems for us. There arewhole books describing the parsing problem, but do not read them until you really need toknow what CF-grammars are…

No, this pseudo-algorithm is for sure not the best, but it is simplified for easier understanding.

At the next page you find an answer to a very common problem: how to parse the command-line under MASM. This is where I leave you alone with your thoughts…



21

A small algorithm which parses the commandline for a filename16

It can be used for example to open a txt file directly with your own editor, etc. It checks thefollowing possibilities:

(1): AppName.exe

(2): "AppName.exe" CommandLine

(3): AppName.exe "CommandLine"

(4): "AppName.exe" "CommandLine"

Are there some more ways and/or errors?code:

ProcessCommandLinePROClpCmdLine:DWORD

pushad

movedi, lpCmdLine

xorecx, ecx

dececx

moval, [edi]

incedi

.IF al == 22h

repnzscasb

incedi

moval, [edi]

.IF al == 22h

incedi

pushedi

16.Source by Rennsemmel, http://board.win32asmcommunity.net/showthread.php?s=&threadid=7464



repnzscasb

decedi

movbyte ptr [edi], 0

popedi

.ENDIF

.ELSE

@@:incedi

moval, [edi]

.IF !al

popad

xoreax, eax

ret

.ENDIF

cmpal, 22h

jnz@B

incedi

pushedi

repnzscasb

decedi

movbyte ptr [edi], 0

popedi

.ENDIF

movlpCmdLine, edi

popad

moveax, lpCmdLine

ret

ProcessCommandLineENDP



21

A simple Hex-Dump algorithm

This is a small and easy algorithm to dump a file as hex-values. It was coded by Hutch17,so respect his work.

Original thread:I prototyped this algo in PowerBASIC inline but as it was a simple port with only a few fiddles, I converted it to MASM notation as it may be use-ful to a few people.

The algo takes a file read into a buffer, its length and the buffer to write the hex dump to.

Important with this algo is to allocate the file length TIMES 4 as the destination buffer as the hex dump is longer than the original data.

The formatting imposed limitations on the efficiency of this algo, every second WORD size write is misaligned which will reduce its speed but it is a lot fasater than the one I replaced and I could not se another way to maintain alighment without making the formatting unacceptable so I kept it as it is.

Regards,

[email protected]

17.http://board.win32asmcommunity.net



code:

; #########################################################################

HexDump proc lpString:DWORD,lnString:DWORD,lpbuffer:DWORD

LOCAL lcnt:DWORD

push ebx

push esi

push edi

jmp over_table

align 16

hex_table:

db "00","01","02","03","04","05","06","07","08","09","0A","0B","0C","0D","0E","0F"

db "10","11","12","13","14","15","16","17","18","19","1A","1B","1C","1D","1E","1F"

db "20","21","22","23","24","25","26","27","28","29","2A","2B","2C","2D","2E","2F"

db "30","31","32","33","34","35","36","37","38","39","3A","3B","3C","3D","3E","3F"

db "40","41","42","43","44","45","46","47","48","49","4A","4B","4C","4D","4E","4F"

db "50","51","52","53","54","55","56","57","58","59","5A","5B","5C","5D","5E","5F"

db "60","61","62","63","64","65","66","67","68","69","6A","6B","6C","6D","6E","6F"

db "70","71","72","73","74","75","76","77","78","79","7A","7B","7C","7D","7E","7F"

db "80","81","82","83","84","85","86","87","88","89","8A","8B","8C","8D","8E","8F"

db "90","91","92","93","94","95","96","97","98","99","9A","9B","9C","9D","9E","9F"

db "A0","A1","A2","A3","A4","A5","A6","A7","A8","A9","AA","AB","AC","AD","AE","AF"

db "B0","B1","B2","B3","B4","B5","B6","B7","B8","B9","BA","BB","BC","BD","BE","BF"

db "C0","C1","C2","C3","C4","C5","C6","C7","C8","C9","CA","CB","CC","CD","CE","CF"

db "D0","D1","D2","D3","D4","D5","D6","D7","D8","D9","DA","DB","DC","DD","DE","DF"



21

db "E0","E1","E2","E3","E4","E5","E6","E7","E8","E9","EA","EB","EC","ED","EE","EF"

db "F0","F1","F2","F3","F4","F5","F6","F7","F8","F9","FA","FB","FC","FD","FE","FF"

over_table:

lea ebx, hex_table ; get base address of table

mov esi, lpString ; address of source string

mov edi, lpbuffer ; address of output buffer

mov eax, esi

add eax, lnString

mov ecx, eax ; exit condition for byte read

mov lcnt, 0

xor eax, eax ; prevent stall

; %%%%%%%%%%%%%%%%%%%%%%% loop code %%%%%%%%%%%%%%%%%%%%%%%%%%%%%

hxlp:

mov al, [esi] ; get BYTE

inc esi

inc lcnt

mov dx, [ebx+eax*2] ; put WORD from table into DX

mov [edi], dx ; write 2 byte string to buffer

add edi, 2

mov BYTE PTR [edi], 32 ; add space

inc edi

cmp lcnt, 8 ; test for half to add "-"

jne @F

mov WORD PTR [edi], " -"

add edi, 2



@@:

cmp lcnt, 16 ; break line at 16 characters

jne @F

dec edi ; overwrite last space

mov WORD PTR [edi], 0A0Dh ; write CRLF to buffer

add edi, 2

mov lcnt, 0

@@:

cmp esi, ecx ; test exit condition

jl hxlp

; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

inc edi

mov BYTE PTR [edi], 0 ; append terminator

pop edi

pop esi

pop ebx

ret

HexDump endp

; #########################################################################



21

Lesson 9 - Opcodes and Mnemonics18

This chapter discusses the low-level implementation of the 80x86 instruction set. Itdescribes how the Intel engineers decided to encode the instructions in a numeric format(suitable for storage in memory) and it discusses the trade-offs they had to make whendesigning the CPU. This chapter also presents a historical background of the designeffort so you can better understand the compromises they had to make.

18.This chapter is part of “Art Of Assembly” (http://webster.cs.ucr.edu/Page_AoAWin/HTML/ISA.html#1013164) - The best free assembly-book


Lesson 9 - Opcodes and Mnemonics

The Importance of the Design of the Instruction Set In this chapter we will be exploring one of the most interesting and important aspects of CPUdesign: the design of the CPU's instruction set. The instruction set architecture (or ISA) is oneof the most important design issues that a CPU designer must get right from the start. Fea-tures like caches, pipelining, superscalar implementation, etc., can all be grafted on to a CPUdesign long after the original design is obsolete. However, it is very difficult to change theinstructions a CPU executes once the CPU is in production and people are writing softwarethat uses those instructions. Therefore, one must carefully choose the instructions for a CPU.

You might be tempted to take the "kitchen sink" approach to instruction set design1 andinclude as many instructions as you can dream up in your instruction set. This approach failsfor several reasons we'll discuss in the following paragraphs. Instruction set design is theepitome of compromise management. Good CPU design is the process of selecting what tothrow out rather than what to leave in. It's easy enough to say "let's include everything." Thehard part is deciding what to leave out once you realize you can't put everything on the chip.

Nasty reality #1: Silicon real estate. The first problem with "putting it all on the chip" is thateach feature requires some number of transistors on the CPU's silicon die. CPU designerswork with a "silicon budget" and are given a finite number of transistors to work with. Thismeans that there aren't enough transistors to support "putting all the features" on a CPU. Theoriginal 8086 processor, for example, had a transistor budget of less than 30,000 transistors.The Pentium III processor had a budget of over eight million transistors. These two budgetsreflect the differences in semiconductor technology in 1978 vs. 1998.

Nasty reality #2: Cost. Although it is possible to use millions of transistors on a CPU today,the more transistors you use the more expensive the CPU. Pentium IV processors, for exam-ple, cost hundreds of dollars (circa 2002). A CPU with only 30,000 transistors (also circa2002) would cost only a few dollars. For low-cost systems it may be more important to shavesome features and use fewer transistors, thus lowering the CPU's cost.

Nasty reality #3: Expandability. One problem with the "kitchen sink" approach is that it's verydifficult to anticipate all the features people will want. For example, Intel's MMX and SIMDinstruction enhancements were added to make multimedia programming more practical onthe Pentium processor. Back in 1978 very few people could have possibly anticipated theneed for these instructions.



21

Nasty reality #4: Legacy Support. This is almost the opposite of expandability. Often it isthe case that an instruction the CPU designer feels is important turns out to be less usefulthan anticipated. For example, the LOOP instruction on the 80x86 CPU sees very littleuse in modern high-performance programs. The 80x86 ENTER instruction is anothergood example. When designing a CPU using the "kitchen sink" approach, it is often com-mon to discover that programs almost never use some of the available instructions.Unfortunately, you cannot easily remove instructions in later versions of a processorbecause this will break some existing programs that use those instructions. Generally,once you add an instruction you have to support it forever in the instruction set. Unlessvery few programs use the instruction (and you're willing to let them break) or you canautomatically simulate the instruction in software, removing instructions is a very difficultthing to do.

Nasty reality #5: Complexity. The popularity of a new processor is easily measured byhow much software people write for that processor. Most CPU designs die a quick deathbecause no one writes software specific to that CPU. Therefore, a CPU designer mustconsider the assembly programmers and compiler writers who will be using the chip uponintroduction. While a "kitchen sink" approach might seem to appeal to such programmers,the truth is no one wants to learn an overly complex system. If your CPU does everythingunder the sun, this might appeal to someone who is already familiar with the CPU. How-ever, pity the poor soul who doesn't know the chip and has to learn it all at once.

These problems with the "kitchen sink" approach all have a common solution: design asimple instruction set to begin with and leave room for later expansion. This is one of themain reasons the 80x86 has proven to be so popular and long-lived. Intel started with arelatively simple CPU and figured out how to extend the instruction set over the years toaccommodate new features.



Basic Instruction Design Goals In a typical Von Neumann architecture CPU, the computer encodes CPU instructions asnumeric values and stores these numeric values in memory. The encoding of these instruc-tions is one of the major tasks in instruction set design and requires very careful thought.

To encode an instruction we must pick a unique numeric opcode value for each instruction(clearly, two different instructions cannot share the same numeric value or the CPU will not beable to differentiate them when it attempts to decode the opcode value). With an n-bit num-ber, there are 2n different possible opcodes, so to encode m instructions you will need anopcode that is at least log2(m) bits long.

Encoding opcodes is a little more involved than assigning a unique numeric value to eachinstruction. Remember, we have to use actual hardware (i.e., decoder circuits) to figure outwhat each instruction does and command the rest of the hardware to do the specified task.Suppose you have a seven-bit opcode. With an opcode of this size we could encode 128 dif-ferent instructions. To decode each instruction individually requires a seven-line to 128-linedecoder - an expensive piece of circuitry. Assuming our instructions contain certain patterns,we can reduce the hardware by replacing this large decoder with three smaller decoders.

If you have 128 truly unique instructions, there's little you can do other than to decode eachinstruction individually. However, in most architectures the instructions are not completelyindependent of one another. For example, on the 80x86 CPUs the opcodes for "mov( eax,ebx );" and "mov( ecx, edx );" are different (because these are different instructions) but theseinstructions are not unrelated. They both move data from one register to another. In fact, theonly difference between them is the source and destination operands. This suggests that wecould encode instructions like MOV with a sub-opcode and encode the operands using otherstrings of bits within the opcode.



22

For example, if we really have only eight instructions, each instruction has two operands,and each operand can be one of four different values, then we can encode the opcode asthree packed fields containing three, two, and two bits (see Figure 5.1). This encodingonly requires the use of three simple decoders to completely determine what instructionthe CPU should execute. While this is a bit of a trivial case, it does demonstrate one veryimportant facet of instruction set design - it is important to make opcodes easy to decodeand the easiest way to do this is to break up the opcode into several different bit fields,each field contributing part of the information necessary to execute the full instruction.The smaller these bit fields, the easier it will be for the hardware to decode and executethem2.

Figure 5.1 Separating an Opcode into Separate Fields to Ease Decoding



Although Intel probably went a little overboard with the design of the original 8086 instructionset, an important design goal is to keep instruction sizes within a reasonable range. CPUswith unnecessarily long instructions consume extra memory for their programs. This tends tocreate more cache misses and, therefore, hurts the overall performance of the CPU. There-fore, we would like our instructions to be as compact as possible so our programs' code usesas little memory as possible.

It would seem that if we are encoding 2n different instructions using n bits, there would bevery little leeway in choosing the size of the instruction. It's going to take n bits to encodethose 2n instructions, you can't do it with any fewer. You may, of course, use more than n bits;and believe it or not, that's the secret to reducing the size of a typical program on the CPU.

Before discussing how to use longer instructions to generate shorter programs, a shortdigression is necessary. The first thing to note is that we generally cannot choose an arbitrarynumber of bits for our opcode length. Assuming that our CPU is capable of reading bytesfrom memory, the opcode will probably have to be some even multiple of eight bits long. If theCPU is not capable of reading bytes from memory (e.g., most RISC CPUs only read memoryin 32 or 64 bit chunks) then the opcode is going to be the same size as the smallest objectthe CPU can read from memory at one time (e.g., 32 bits on a typical RISC chip). Any attemptto shrink the opcode size below this data bus enforced lower limit is futile. Since we're dis-cussing the 80x86 architecture in this text, we'll work with opcodes that must be an even mul-tiple of eight bits long.

Another point to consider here is the size of an instruction's operands. Some CPU designers(specifically, RISC designers) include all operands in their opcode. Other CPU designers(typically CISC designers) do not count operands like immediate constants or address dis-placements as part of the opcode (though they do usually count register operand encodingsas part of the opcode). We will take the CISC approach here and not count immediate con-stant or address displacement values as part of the actual opcode.



22

With an eight-bit opcode you can only encode 256 different instructions. Even if we don'tcount the instruction's operands as part of the opcode, having only 256 different instruc-tions is somewhat limiting. It's not that you can't build a CPU with an eight-bit opcode,most of the eight-bit processors predating the 8086 had eight-bit opcodes, it's just thatmodern processors tend to have far more than 256 different instructions. The next stepup is a two-byte opcode. With a two-byte opcode we can have up to 65,536 differentinstructions (which is probably enough) but our instructions have doubled in size (notcounting the operands, of course).

If reducing the instruction size is an important design goal3 we can employ some tech-niques from data compression theory to reduce the average size of our instructions. Thebasic idea is this: first we analyze programs written for our CPU (or a CPU similar to oursif no one has written any programs for our CPU) and count the number of occurrences ofeach opcode in a large number of typical applications. We then create a sorted list ofthese opcodes from most-frequently-used to least-frequently-used. Then we attempt todesign our instruction set using one-byte opcodes for the most-frequently-used instruc-tions, two-byte opcodes for the next set of most-frequently-used instructions, and three(or more) byte opcodes for the rarely used instructions. Although our maximum instruc-tion size is now three or more bytes, most of the actual instructions appearing in a pro-gram will use one or two byte opcodes, so the average opcode length will be somewherebetween one and two bytes (let's call it 1.5 bytes) and a typical program will be shorterthan had we chosen a two byte opcode for all instructions (see Figure 5.2).



Figure 5.2 Encoding Instructions Using a Variable-Length Opcode



22

Although using variable-length instructions allows us to create smaller programs, it comesat a price. First of all, decoding the instructions is a bit more complicated. Before decod-ing an instruction field, the CPU must first decode the instruction's size. This extra stepconsumes time and may affect the overall performance of the CPU (by introducing delaysin the decoding step and, thereby, limiting the maximum clock frequency of the CPU).Another problem with variable length instructions is that it makes decoding multipleinstructions in a pipeline quite difficult (since we cannot trivially determine the instructionboundaries in the prefetch queue). These reasons, along with some others, is why mostpopular RISC architectures avoid variable-sized instructions. However, for our purpose,we'll go with a variable length approach since saving memory is an admirable goal.

Before actually choosing the instructions you want to implement in your CPU, now wouldbe a good time to plan for the future. Undoubtedly, you will discover the need for newinstructions at some point in the future, so reserving some opcodes specifically for thatpurpose is a real good idea. If you were using the instruction encoding appearing in Fig-ure 5.2 for your opcode format, it might not be a bad idea to reserve one block of 64 one-byte opcodes, half (4,096) of the two-byte instructions, and half (1,048,576) of the three-byte opcodes for future use. In particular, giving up 64 of the very valuable one-byteopcodes may seem extravagant, but history suggests that such foresight is rewarded.

The next step is to choose the instructions you want to implement. Note that althoughwe've reserved nearly half the instructions for future expansion, we don't actually have toimplement instructions for all the remaining opcodes. We can choose to leave a goodnumber of these instructions unimplemented (and effectively reserve them for the futureas well). The right approach is not to see how quickly we can use up all the opcodes, butrather to ensure that we have a consistent and complete instruction set given the compro-mises we have to live with (e.g., silicon limitations). The main point to keep in mind here isthat it's much easier to add an instruction later than it is to remove an instruction later. Sofor the first go-around, it's generally better to go with a simpler design rather than a morecomplex design.



The first step is to choose some generic instruction types. For a first attempt, you should limitthe instructions to some well-known and common instructions. The best place to look for helpin choosing these instructions is the instruction sets of other processors. For example, mostprocessors you find will have instructions like the following: Data movement instructions (e.g., MOV)

Arithmetic and logical instructions (e.g., ADD, SUB, AND, OR, NOT)

Comparison instructions

A set of conditional jump instructions (generally used after the compare instruc-tions)

Input/Output instructions

Other miscellaneous instructions

Your goal as the designer of the CPU's initial instruction set is to chose a reasonable set ofinstructions that will allow programmers to efficiently write programs (using as few instruc-tions as possible) without adding so many instructions you exceed your silicon budget or vio-late other system compromises. This is a very strategic decision, one that CPU designersshould base on careful research, experimentation, and simulation. The job of the CPUdesigner is not to create the best instruction set, but to create an instruction set that is optimalgiven all the constraints.

Once you've decided which instructions you want to include in your (initial) instruction set, thenext step is to assign opcodes for them. The first step is to group your instructions into setsby common characteristics of those instructions. For example, an ADD instruction is probablygoing to support the exact same set of operands as the SUB instruction. So it makes sense toput these two instructions into the same group. On the other hand, the NOT instruction gener-ally requires only a single operand4 as does a NEG instruction. So you'd probably put thesetwo instructions in the same group but a different group than ADD and SUB.



22

Once you've grouped all your instructions, the next step is to encode them. A typicalencoding will use some bits to select the group the instruction falls into, it will use somebits to select a particular instruction from that group, and it will use some bits to determinethe types of operands the instruction allows (e.g., registers, memory locations, and con-stants). The number of bits needed to encode all this information may have a directimpact on the instruction's size, regardless of the frequency of the instruction. For exam-ple, if you need two bits to select a group, four bits to select an instruction within thatgroup, and six bits to specify the instruction's operand types, you're not going to fit thisinstruction into an eight-bit opcode. On the other hand, if all you really want to do is pushone of eight different registers onto the stack, you can use four bits to select the PUSHinstruction and three bits to select the register (assuming the encoding in Figure 5.2 theeighth and H.O. bit would have to contain zero).

Encoding operands is always a problem because many instructions allow a large numberof operands. For example, the generic 80x86 MOV instruction requires a two-byteopcode5. However, Intel noticed that the "mov( disp, eax );" and "mov( eax, disp );"instructions occurred very frequently. So they created a special one byte version of thisinstruction to reduce its size and, therefore, the size of those programs that use thisinstruction frequently. Note that Intel did not remove the two-byte versions of theseinstructions. They have two different instructions that will store EAX into memory or loadEAX from memory. A compiler or assembler would always emit the shorter of the twoinstructions when given an option of two or more instructions that wind up doing exactlythe same thing.

Notice an important trade-off Intel made with the MOV instruction. They gave up an extraopcode in order to provide a shorter version of one of the MOV instructions. Actually, Intelused this trick all over the place to create shorter and easier to decode instructions. Backin 1978 this was a good compromise (reducing the total number of possible instructionswhile also reducing the program size). Today, a CPU designer would probably want touse those redundant opcodes for a different purpose, however, Intel's decision was rea-sonable at the time (given the high cost of memory in 1978).

To further this discussion, we need to work with an example. So the next section will gothrough the process of designing a very simple instruction set as a means of demonstrat-ing this process.



The Y86 Hypothetical Processor Because of enhancements made to the 80x86 processor family over the years, Intel's designgoals in 1978, and advances in computer architecture occurring over the years, the encodingof 80x86 instructions is very complex and somewhat illogical. Therefore, the 80x86 is not agood candidate for an example architecture when discussing how to design and encode aninstruction set. However, since this is a text about 80x86 assembly language programming,attempting to present the encoding for some simpler real-world processor doesn't makesense. Therefore, we will discuss instruction set design in two stages: first, we will develop asimple (trivial) instruction set for a hypothetical processor that is a small subset of the 80x86,then we will expand our discussion to the full 80x86 instruction set. Our hypothetical proces-sor is not a true 80x86 CPU, so we will call it the Y86 processor to avoid any accidental asso-ciation with the Intel x86 family.

The Y86 processor is a very stripped down version of the x86 CPUs. First of all, the Y86 onlysupports one operand size - 16 bits. This simplification frees us from having to encode thesize of the operand as part of the opcode (thereby reducing the total number of opcodes wewill need). Another simplification is that the Y86 processor only supports four 16-bit registers:AX, BX, CX, and DX. This lets us encode register operands with only two bits (versus thethree bits the 80x86 family requires to encode eight registers). Finally, the Y86 processorsonly support a 16-bit address bus with a maximum of 65,536 bytes of addressable memory.These simplifications, plus a very limited instruction set will allow us to encode all Y86instructions using a single byte opcode and a two-byte displacement/offset (if needed).

The Y86 CPU provides 20 instructions. Seven of these instructions have two operands, eightof these instructions have a single operand, and five instructions have no operands at all. Theinstructions are MOV (two forms), ADD, SUB, CMP, AND, OR, NOT, JE, JNE, JB, JBE, JA,JAE, JMP, BRK, IRET, HALT, GET, and PUT. The following paragraphs describe how each ofthese work.



22

The MOV instruction is actually two instruction classes merged into the same instruction.The two forms of the mov instruction take the following forms: mov( reg/memory/constant, reg );

mov( reg, memory );

where reg is any of AX, BX, CX, or DX; constant is a numeric constant (using hexadeci-mal notation), and memory is an operand specifying a memory location. The next sectiondescribes the possible forms the memory operand can take. The "reg/memory/constant"operand tells you that this particular operand may be a register, memory location, or aconstant.

The arithmetic and logical instructions take the following forms: add( reg/memory/constant, reg );

sub( reg/memory/constant, reg );

cmp( reg/memory/constant, reg );

and( reg/memory/constant, reg );

or( reg/memory/constant, reg );

not( reg/memory );

Note: the NOT instruction appears separately because it is in a different class than theother arithmetic instructions (since it supports only a single operand).

The ADD instruction adds the value of the first operand to the second (register) operand,leaving the sum in the second (register) operand. The SUB instruction subtracts the valueof the first operand from the second, leaving the difference in the second operand. TheCMP instruction compares the first operand against the second and saves the result ofthis comparison for use with one of the conditional jump instructions (described in amoment). The AND and OR instructions compute the corresponding bitwise logical oper-ation on the two operands and store the result into the first operand. The NOT instructioninverts the bits in the single memory or register operand.



The control transfer instructions interrupt the sequential execution of instructions in memoryand transfer control to some other point in memory either unconditionally, or after testing theresult of the previous CMP instruction. These instructions include the following: ja dest; -- Jump if above (i.e., greater than)

jae dest; -- Jump if above or equal (i.e., greater than or equal)

jb dest; -- Jump if below (i.e., less than)

jbe dest; -- Jump if below or equal (i.e., less than or equal)

je dest; -- Jump if equal

jne dest; -- Jump if not equal

jmp dest; -- Unconditional jump

iret; -- Return from an interrupt

The first six instructions let you check the result of the previous CMP instruction for greaterthan, greater or equal, less than, less or equal, equality, or inequality6. For example, if youcompare the AX and BX registers with a "cmp( ax, bx );" instruction and execute the JAinstruction, the Y86 CPU will jump to the specified destination location if AX was greater thanBX. If AX was not greater than BX, control will fall through to the next instruction in the pro-gram.

The JMP instruction unconditionally transfers control to the instruction at the destinationaddress. The IRET instruction returns control from an interrupt service routine, which we willdiscuss later.

The GET and PUT instructions let you read and write integer values. GET will stop andprompt the user for a hexadecimal value and then store that value into the AX register. PUTdisplays (in hexadecimal) the value of the AX register.

The remaining instructions do not require any operands, they are HALT and BRK. HALT ter-minates program execution and BRK stops the program in a state that it can be restarted.

The Y86 processors require a unique opcode for every different instruction, not just theinstruction classes. Although "mov( bx, ax );" and "mov( cx, ax );" are both in the same class,they must have different opcodes if the CPU is to differentiate them. However, before lookingat all the possible opcodes, perhaps it would be a good idea to learn about all the possibleoperands for these instructions.



23

Addressing Modes on the Y86 The Y86 instructions use five different operand types: registers, constants, and threememory addressing schemes. Each form is called an addressing mode. The Y86 proces-sor supports the register addressing mode7, the immediate addressing mode, the indirectaddressing mode, the indexed addressing mode, and the direct addressing mode. Thefollowing paragraphs explain each of these modes.

Register operands are the easiest to understand. Consider the following forms of theMOV instruction: mov( ax, ax );

mov( bx, ax );

mov( cx, ax );

mov( dx, ax );

The first instruction accomplishes absolutely nothing. It copies the value from the AX reg-ister back into the AX register. The remaining three instructions copy the values of BX,CX and DX into AX. Note that these instructions leave BX, CX, and DX unchanged. Thesecond operand (the destination) is not limited to AX; you can move values to any ofthese registers.

Constants are also pretty easy to deal with. Consider the following instructions: mov( 25, ax );

mov( 195, bx );

mov( 2056, cx );

mov( 1000, dx );

These instructions are all pretty straightforward; they load their respective registers withthe specified hexadecimal constant8.



There are three addressing modes which deal with accessing data in memory. The followinginstructions demonstrate the use of these addressing modes: mov( [1000], ax );

mov( [bx], ax );

mov( [1000+bx], ax );

The first instruction above uses the direct addressing mode to load AX with the 16 bit valuestored in memory starting at location $1000.

The "mov( [bx], ax );" instruction loads AX from the memory location specified by the contentsof the bx register. This is an indirect addressing mode. Rather than using the value in BX, thisinstruction accesses to the memory location whose address appears in BX. Note that the fol-lowing two instructions: mov( 1000, bx );

mov( [bx], ax );

are equivalent to the single instruction: mov( [1000], ax );

Of course, the second sequence is preferable. However, there are many cases where theuse of indirection is faster, shorter, and better. We'll see some examples of this a little later.

The last memory addressing mode is the indexed addressing mode. An example of thismemory addressing mode is mov( [1000+bx], ax );

This instruction adds the contents of BX with $1000 to produce the address of the memoryvalue to fetch. This instruction is useful for accessing elements of arrays, records, and otherdata structures.



23

Encoding Y86 Instructions Although we could arbitrarily assign opcodes to each of the Y86 instructions, keep inmind that a real CPU uses logic circuitry to decode the opcodes and act appropriately onthem. A typical CPU opcode uses a certain number of bits in the opcode to denote theinstruction class (e.g., MOV, ADD, SUB), and a certain number of bits to encode each ofthe operands.

A typical Y86 instruction takes the form shown in Figure 5.3. The basic instruction iseither one or three bytes long. The instruction opcode consists of a single byte that con-tains three fields. The first field, the H.O. three bits, defines the instruction. This provideseight combinations. As you may recall, there are 20 different instructions; we cannotencode 20 instructions with three bits, so we'll have to pull some tricks to handle the otherinstructions. As you can see in Figure 5.3, the basic opcode encodes the MOV instruc-tions (two instructions, one where the rr field specifies the destination, one where themmm field specifies the destination), and the ADD, SUB, CMP, AND, and OR instruc-tions. There is one additional instruction field: special. The special instruction class pro-vides a mechanism that allows us to expand the number of available instruction classes,we will return to this expansion opcode shortly.

Figure 5.3 Basic Y86 Instruction Encoding



To determine a particular instruction's opcode, you need only select the appropriate bits forthe iii, rr, and mmm fields. The rr field contains the destination register (except for the MOVinstruction whose iii field is %111) and the mmm field encodes the source operand. For exam-ple, to encode the "mov( bx, ax );" instruction you would select iii=110 ("mov( reg, reg );),rr=00 (AX), and mmm=001 (BX). This produces the one-byte instruction %11000001 or $C0.

Some Y86 instructions require more than one byte. For example, the instruction "mov([1000], ax );" loads the AX register from memory location $1000. The encoding for theopcode is %11000110 or $C6. However, the encoding for the "mov( [2000], ax );" instruction'sopcode is also $C6. Clearly these two instructions do different things, one loads the AX regis-ter from memory location $1000 while the other loads the AX register from memory location$2000. To encode an address for the [xxxx] or [xxxx+bx] addressing modes, or to encode theconstant for the immediate addressing mode, you must follow the opcode with the 16-bitaddress or constant, with the L.O. byte immediately following the opcode in memory and theH.O. byte after that. So the three byte encoding for "mov( [1000], ax );" would be $C6, $00,$10 and the three byte encoding for "mov( [2000], ax );" would be $C6, $00, $20.

The special opcode allows the x86 CPU to expand the set of available instructions. Thisopcode handles several zero and one-operand instructions as shown in Figure 5.4 and Fig-ure 5.5.

Figure 5.4 Single Operand Instruction Encodings



23

Figure 5.5 Zero Operand Instruction Encodings

There are four one-operand instruction classes. The first encoding (00) further expandsthe instruction set with a set of zero-operand instructions (see Figure 5.5). The secondopcode is also an expansion opcode that provides all the Y86 jump instructions (see Fig-ure 5.6). The third opcode is the NOT instruction. This is the bitwise logical not operationthat inverts all the bits in the destination register or memory operand. The fourth single-operand opcode is currently unassigned. Any attempt to execute this opcode will halt theprocessor with an illegal instruction error. CPU designers often reserve unassignedopcodes like this one to extend the instruction set at a future date (as Intel did when mov-ing from the 80286 processor to the 80386).



Figure 5.6 Jump Instruction Encodings

There are seven jump instructions in the x86 instruction set. They all take the following form: jxx address;

The JMP instruction copies the 16-bit value (address) following the opcode into the IP regis-ter. Therefore, the CPU will fetch the next instruction from this target address; effectively, theprogram "jumps" from the point of the JMP instruction to the instruction at the target address.

The JMP instruction is an example of an unconditional jump instruction. It always transferscontrol to the target address. The remaining six instructions are conditional jump instructions.They test some condition and jump if the condition is true; they fall through to the next instruc-tion if the condition is false. These six instructions, JA, JAE, JB, JBE, JE, and JNE let you testfor greater than, greater than or equal, less than, less than or equal, equality, and inequality.You would normally execute these instructions immediately after a CMP instruction since itsets the less than and equality flags that the conditional jump instructions test. Note that thereare eight possible jump opcodes, but the x86 uses only seven of them. The eighth opcode isanother illegal opcode.



23

The last group of instructions, the zero operand instructions, appear in Figure 5.5. Threeof these instructions are illegal instruction opcodes. The BRK (break) instruction pausesthe CPU until the user manually restarts it. This is useful for pausing a program duringexecution to observe results. The IRET (interrupt return) instruction returns control froman interrupt service routine. We will discuss interrupt service routines later. The HALT pro-gram terminates program execution. The GET instruction reads a hexadecimal valuefrom the user and returns this value in the AX register; the PUT instruction outputs thevalue in the AX register.



Hand Encoding Instructions Keep in mind that the Y86 processor fetches instructions as bit patterns from memory. Itdecodes and executes those bit patterns. The processor does not execute instructions of theform "mov( ax, bx );" (that is, a string of characters that are readable by humans). Instead, itexecutes the bit pattern $C1 from memory. Instructions like "mov( ax, bx );" and "add( 5, cx );"are human-readable representations of these instructions that we must first convert intomachine code (that is, the binary representation of the instruction that the machine actuallyexecutes). In this section we will explore how to manually accomplish this task.

The first step is to chose an instruction to convert into machine code. We'll start with a verysimple example, the "add( cx, dx );" instruction. Once you've chosen the instruction, you lookup the instruction in one of the figures of the previous section. The ADD instruction is in thefirst group (see Figure 5.3) and has an iii field of %101. The source operand is CX, so themmm field is %010 and the destination operand is DX so the rr field is %11. Merging thesebits produces the opcode %10111010 or $BA.

Figure 5.7 Encoding ADD( cx, dx );



23

Now consider the "add( 5, ax );" instruction. Since this instruction has an immediatesource operand, the mmm field will be %111. The destination register operand is AX(%00) so the full opcode becomes $10100111 or $A7. Note, however, that this does notcomplete the encoding of the instruction. We also have to include the 16-bit constant$0005 as part of the instruction. The binary encoding of the constant must immediatelyfollow the opcode in memory, so the sequence of bytes in memory (from lowest addressto highest address) is $A7, $05, $00. Note that the L.O. byte of the constant follows theopcode and the H.O. byte of the constant follows the L.O. byte. This sequence appearsbackwards because the bytes are arranged in order of increasing memory address andthe H.O. byte of a constant always appears in the highest memory address.

Figure 5.8 Encoding ADD( 5, ax );

The "add( [2ff+bx], cx );" instruction also contains a 16-bit constant associated with theinstruction's encoding - the displacement portion of the indexed addressing mode. Toencode this instruction we use the following field values: iii=%101, rr=%10, andmmm=%101. This produces the opcode byte %10110101 or $B5. The complete instruc-tion also requires the constant $2FF so the full instruction is the three-byte sequence$B5, $FF, $02.



Figure 5.9 Encoding ADD( [$2ff+bx], cx );

Now consider the "add( [1000], ax );" instruction. This instruction adds the 16-bit contents ofmemory locations $1000 and $1001 to the value in the AX register. Once again, iii=%101 forthe ADD instruction. The destination register is AX so rr=%00. Finally, the addressing modeis the displacement-only addressing mode, so mmm=%110. This forms the opcode%10100110 or $A6. The instruction is three bytes long since it must encode the displacement(address) of the memory location in the two bytes following the opcode. Therefore, the com-plete three-byte sequence is $A6, $00, $10.



24

Figure 5.10 Encoding ADD( [1000], ax );

The last addressing mode to consider is the register indirect addressing mode, [bx]. The"add( [bx], bx );" instruction uses the following encoded values: mmm=%101, rr=%01(bx), and mmm=%100 ([bx]). Since the value in the BX register completely specifies thememory address, there is no need for a displacement field. Hence, this instruction is onlyone byte long.

Figure 5.11 Encoding the ADD( [bx], bx ); Instruction

You use a similar approach to encode the SUB, CMP, AND, and OR instructions as youdo the ADD instruction. The only difference is that you use different values for the iii fieldin the opcode.



The MOV instruction is special because there are two forms of the MOV instruction. Youencode the first form (iii=%110) exactly as you do the ADD instruction. This form copies aconstant or data from memory or a register (the mmm field) into a destination register (the rrfield).

The second form of the MOV instruction (iii=%111) copies data from a source register (rr) to adestination memory location (that the mmm field specifies). In this form of the MOV instruc-tion, the source/destination meanings of the rr and mmm fields are reversed so that rr is thesource field and mmm is the destination field. Another difference is that the mmm field mayonly contain the values %100 ([bx]), %101 ([disp+bx]), and %110 ([disp]). The destination val-ues cannot be %000..%011 (registers) or %111 (constant). These latter five encodings areillegal (the register destination instructions are handled by the other MOV instruction andstoring data into a constant doesn't make any sense).

The Y86 processor supports a single instruction with a single memory/register operand - theNOT instruction. The NOT instruction has the syntax: "not( reg );" or "not( mem );" wheremem represents one of the memory addressing modes ([bx], [disp+bx], or [disp]). Note thatyou may not specify a constant as the operand of the NOT instruction.

Since the NOT instruction has only a single operand, it only uses the mmm field to encodethis operand. The rr field, combined with the iii field, selects the NOT instruction (iii=%000and rr=%10). Whenever the iii field contains zero this tells the CPU that special decoding isnecessary for the instruction. In this case, the rr field specifies whether we have the NOTinstruction or one of the other specially decoded instructions.

To encode an instruction like "not( ax );" you would simply specify %000 for iii and %10 for therr fields. Then you would encode the mmm field the same way you would encode this field forthe ADD instruction. Since mmm=%000 for AX, the encoding of "not( ax );" would be%00010000 or $10.



24

Figure 5.12 Encoding the NOT( ax ); Instruction

The NOT instruction does not allow an immediate (constant) operand, hence the opcode%00010111 ($17) is an illegal opcode.

The Y86 conditional jump instructions also use a special encoding. These instructions arealways three bytes long. The first byte (the opcode) specifies which conditional jumpinstruction to execute and the next two bytes specify where the CPU transfers if the con-dition is met. There are seven different Y86 jump instructions, six conditional jumps andone unconditional jump. These instructions set mmm=%000, rr=%01, and use the mmmfield to select one of the seven possible jumps; the eighth possible opcode is an illegalopcode (see Figure 5.6). Encoding these instructions is relatively straight-forward. Onceyou pick the instruction you want to encode, you've determined the opcode (since there isa single opcode for each instruction). The opcode values fall in the range $08..$0E ($0Fis the illegal opcode).

The only field that requires some thought is the 16-bit operand that follows the opcode.This field holds the address of the target instruction to which the (un)conditional jumptransfers if the condition is true (e.g., JE transfers control to this address if the previousCMP instruction found that its two operands were equal). To properly encode this fieldyou must know the address of the opcode byte of the target instruction. If you've alreadyconverted the instruction to binary form and stored it into memory, this isn't a problem;just specify the address of that instruction as the operand of the condition jump. On theother hand, if you haven't yet written, converted, and placed that instruction into memory,knowing its address would seem to require a bit of divination. Fortunately, you can figureout the target address by computing the lengths of all the instructions between the currentjump instruction you're encoding and the target instruction. Unfortunately, this is an ardu-ous task.



The best solution is to write all your instructions down on paper, compute their lengths (whichis easy, all instructions are one or three bytes long depending on the presence of a 16-bitoperand), and then assign an appropriate address to each instruction. Once you've done this(and, assuming you haven't made any mistakes) you'll know the starting address for eachinstruction and you can fill in target address operands in your (un)conditional jump instruc-tions as you encode them. Fortunately, there is a better way to do this, as you'll see in thenext section.

The last group of instructions, the zero operand instructions, are the easiest to encode. Sincethey have no operands they are always one byte long and the instruction uniquely specifiesthe opcode for the instruction. These instructions always have iii=%000, rr=%00, and mmmspecifies the particular instruction opcode (see Figure 5.5). Note that the Y86 CPU leavesthree of these instructions undefined (so we can use these opcodes for future expansion).



24

Using an Assembler to Encode Instructions Of course, hand coding machine language programs as demonstrated in the previoussection is impractical for all but the smallest programs. Certainly you haven't had to doanything like this when writing HLA programs. The HLA compiler lets you create a text filecontaining human readable forms of the instructions. You might wonder why we can writesuch code for the 80x86 but not for the Y86. The answer is to use an assembler or com-piler for the Y86. The job of an assembler/compiler is to read a text file containing humanreadable text and translate that text into the binary encoded representation for the corre-sponding machine language program.

An assembler or compiler is nothing special. It's just another program that executes onyour computer system. The only thing special about an assembler or compiler is that ittranslates programs from one form (source code) to another (machine code). A typicalY86 assembler, for example, would read lines of text with each line containing a Y86instruction, it would parse9 each statement and then write the binary equivalent of eachinstruction to memory or to a file for later execution.

Assemblers have two big advantages over coding in machine code. First, they automati-cally translate strings like "ADD( ax, bx );" and "MOV( ax, [1000]);" to their correspondingbinary form. Second, and probably even more important, assemblers let you attach labelsto statements and refer to those labels within jump instructions; this means that you don'thave to know the target address of an instruction in order to specify that instruction as thetarget of a jump or conditional jump instruction. Windows users have access to a verysimple Y86 assembler10 that lets you specify up to 26 labels in a program (using the sym-bols 'A'..'Z'). To attach a label to a statement, you simply preface the instruction with thelabel and a colon, e.g., L: mov( 0, ax );

To transfer control to a statement with a label attached to it, you simply specify the labelname as the operand of the jump instruction, e.g., jmp L;



The assembler will compute the address of the label and fill in the address for you wheneveryou specify the label as the operand of a jump or conditional jump instruction. The assemblercan do this even if it hasn't yet encountered the label in the program's source file (i.e., thelabel is attached to a later instruction in the source file). Most assemblers accomplish thismagic by making two passes over the source file. During the first pass the assembler deter-mines the starting address of each symbol and stores this information in a simple databasecalled the symbol table. The assembler does not emit any machine code during this firstpass. Then the assembler makes a second pass over the source file and actually emits themachine code. During this second pass it looks up all label references in the symbol tableand uses the information it retrieves from this database to fill in the operand fields of theinstructions that refer to some symbol.



24

Extending the Y86 Instruction Set The Y86 CPU is a trivial CPU, suitable only for demonstrating how to encode machineinstructions. However, like any good CPU the Y86 design does provide the capability forexpansion. So if you wanted to improve the CPU by adding new instructions, the ability toaccomplish this exists in the instruction set.

There are two standard ways to increase the number of instructions in a CPU's instructionset. Both mechanisms require the presence of undefined (or illegal) opcodes on the CPU.Since the Y86 CPU has several of these, we can expand the instruction set.

The first method is to directly use the undefined opcodes to define new instructions. Thisworks best when there are undefined bit patterns within an opcode group and the newinstruction you want to add falls into that same group. For example, the opcode%00011mmm falls into the same group as the NOT instruction. If you decided that youreally needed a NEG (negate, take the two's complement) instruction, using this particularopcode for this purpose makes a lot of sense because you'd probably expect the NEGinstruction to use the same syntax (and, therefore, decoding) as the NOT instruction.

Likewise, if you want to add a zero-operand instruction to the instruction set, there arethree undefined zero-operand instructions that you could use for this purpose. You'd justappropriate one of these opcodes and assign your instruction to it.

Unfortunately, the Y86 CPU doesn't have that many illegal opcodes open. For example, ifyou wanted to add the SHL, SHR, ROL, and ROR instructions (shift and rotate left andright) as single-operand instructions, there is insufficient space in the single operandinstruction opcodes to add these instructions (there is currently only one open opcodeyou could use). Likewise, there are no two-operand opcodes open, so if you wanted toadd an XOR instruction or some other two-operand instruction, you'd be out of luck.

A common way to handle this dilemma (one the Intel designers have employed) is to usea prefix opcode byte. This opcode expansion scheme uses one of the undefined opcodesas an opcode prefix byte. Whenever the CPU encounters a prefix byte in memory, it readsand decodes the next byte in memory as the actual opcode. However, it does not treatthis second byte as it would any other opcode. Instead, this second opcode byte uses acompletely different encoding scheme and, therefore, lets you specify as many newinstructions as you can encode in that byte (or bytes, if you prefer).



For example, the opcode $FF is illegal (it corresponds to a "mov( dx, const );" instruction) sowe can use this byte as a special prefix byte to further expand the instruction set11.

Figure 5.13 Using a Prefix Byte to Extend the Instruction Set



24

Encoding 80x86 Instructions The Y86 processor is simple to understand, easy to hand encode instructions for it, and agreat vehicle for learning how to assign opcodes. It's also a purely hypothetical deviceintended only as a teaching tool Therefore, you can now forget all about the Y86, it'sserved its purpose. Now it's time to take a look that the actual machine instruction formatfor the 80x86 CPU family.

They don't call the 80x86 CPU a Complex Instruction Set Computer for nothing. Althoughmore complex instruction encodings do exist, no one is going to challenge the assertionthat the 80x86 has a complex instruction encoding. The generic 80x86 instruction takesthe form shown in Figure 5.14. Although this diagram seems to imply that instructions canbe up to 16 bytes long, in actuality the 80x86 will not allow instructions greater than 15bytes in length.

Figure 5.14 80x86 Instruction Encoding



The prefix bytes are not the "opcode expansion prefix" that the previous sections in this chap-ter discussed. Instead, these are special bytes to modify the behavior of existing instructions(rather than define new instructions). We'll take a look at a couple of these prefix bytes in a lit-tle bit, others we'll leave for discussion in later chapters. The 80x86 certainly supports morethan four prefix values, however, an instruction may have a maximum of four prefix bytesattached to it. Also note that the behavior of many prefix bytes are mutually exclusive and theresults are undefined if you put a pair of mutually exclusive prefix bytes in front of an instruc-tion.

The 80x86 supports two basic opcode sizes: a standard one-byte opcode and a two-byteopcode consisting of a $0F opcode expansion prefix byte and a second byte specifying theactual instruction. One way to view these opcode bytes is as an eight-bit extension of the iiifield in the Y86 encoding. This provides for up to 512 different instruction classes (althoughthe 80x86 does not yet use them all). In reality, various instruction classes use certain bits inthis opcode for decidedly non-instruction-class purposes. For example, consider the ADDinstruction opcode. It takes the form shown in Figure 5.15.

Note that bit number zero specifies the size of the operands the ADD instruction operatesupon. If this field contains zero then the operands are eight bit registers and memory loca-tions. If this bit contains one then the operands are either 16-bits or 32-bits. Under 32-bitoperating systems the default is 32-bit operands if this field contains a one. To specify a 16-bitoperand (under Windows or Linux) you must insert a special "operand-size prefix byte" infront of the instruction.

Bit number one specifies the direction of the transfer. If this bit is zero, then the destinationoperand is a memory location (e.g., "add( al, [ebx]);" If this bit is one, then the destinationoperand is a register (e.g., "add( [ebx], al );" You'll soon see that this direction bit creates aproblem that results in one instruction have two different possible opcodes.



25

Figure 5.15 80x86 ADD Opcode



Encoding Instruction Operands

The "mod-reg-r/m" byte (in Figure 5.14) specifies a basic addressing mode. This byte con-tains the following fields:

Figure 5.16 MOD-REG-R/M Byte

The REG field specifies an 80x86 register. Depending on the instruction, this can be eitherthe source or the destination operand. Many instructions have the "d" (direction) field in theiropcode to choose whether this operand is the source (d=0) or the destination (d=1) operand.This field is encoded using the bit patterns found in the following table:



25

For certain (single operand) instructions, the REG field may contain an opcode extensionrather than a register value (the R/M field will specify the operand in this case).

The MOD and R/M fields combine to specify the other operand in a two-operand instruc-tion (or the only operand in a single-operand instruction like NOT or NEG). Remember,the "d" bit in the opcode determines which operand is the source and which is the desti-nation. The MOD and R/M fields together specify the following addressing modes:

REG Value Register if data size is eight bits

Register if data size is 16-bits

Register if data size is 32 bits

%000 al ax eax

%001 cl cx ecx

%010 dl dx edx

%011 bl bx ebx

%100 ah sp esp

%101 ch bp ebp

%110 dh si esi

%111 bh di edi



MODR/MAddressing Mode

%00%000[eax]

%01%000[eax+disp8]

%10%000[eax+disp32]

%11%000register (al/ax/eax)1

%00%001[ecx]

%01%001[ecx+disp8]

%10%001[ecx+disp32]

%11%001register (cl/cx/ecx)

%00%010[edx]

%01%010[edx+disp8]

%10%010[edx+disp32]

%11%010register (dl/dx/edx)

%00%011[ebx]

%01%011[ebx+disp8]

%10%011[ebx+disp32]

%11%011register (bl/bx/ebx)

%00%100SIB Mode

%01%100SIB + disp8 Mode

%10%100SIB + disp32 Mode

%11%100register (ah/sp/esp)

MOD Meaning

%00 Register indirect addressing mode or SIB with no displacement (when R/M=%100) or Displacement only addressing mode (when R/M=%101).

%01 One-byte signed displacement follows addressing mode byte(s).

%10 Four-byte signed displacement follows addressing mode byte(s).

%11 Register addressing mode.



25

%00%101Displacement Only Mode

(32-bit displacement)

%01%101[ebp+disp8]

%10%101[ebp+disp32]

%11%101register (ch/bp/ebp)

%00%110[esi]

%01%110[esi+disp8]

%10%110[esi+disp32]

%11%110register (dh/si/esi)

%00%111[edi]

%01%111[edi+disp8]

%10%111[edi+disp32]

%11%111register (bh/di/edi)

1The size bit in the opcode specifies eight or 32-bit register size. To select a 16-bit registerrequires a prefix byte.

There are a couple of interesting things to note about this table. First of all, note that thereare two forms of the [reg+disp] addressing modes: one form with an eight-bit displace-ment and one form with a 32-bit displacement. Addressing modes whose displacementfalls in the range -128..+127 require only a single byte displacement after the opcode;hence these instructions will be shorter (and sometimes faster) than instructions whosedisplacement value is outside this range. It turns out that many offsets are within thisrange, so the assembler/compiler can generate shorter instructions for a large percent-age of the instructions.

The second thing to note is that there is no [ebp] addressing mode. If you look in the tableabove where this addressing mode logically belongs, you'll find that it's slot is occupied bythe 32-bit displacement only addressing mode. The basic encoding scheme for address-ing modes didn't allow for a displacement only addressing mode, so Intel "stole" theencoding for [ebp] and used that for the displacement only mode. Fortunately, anythingyou can do with the [ebp] addressing mode you can do with the [ebp+disp8] addressingmode by setting the eight-bit displacement to zero. True, the instruction is a little bitlonger, but the capabilities are still there. Intel (wisely) chose to replace this addressingmode because they anticipated that programmers would use this addressing mode lessoften than the other register indirect addressing modes (for reasons you'll discover in alater chapter).



Another thing you'll notice missing from this table are addressing modes of the form[ebx+edx*4], the so-called scaled indexed addressing modes. You'll also notice that the tableis missing addressing modes of the form [esp], [esp+disp8], and [esp+disp32]. In the slotswhere you would normally expect these addressing modes you'll find the SIB (scaled indexbyte) modes. If these values appear in the MOD and R/M fields then the addressing mode isa scaled indexed addressing mode with a second byte (the SIB byte) following the MOD-REG-R/M byte that specifies the registers to use (note that the MOD field still specifies thedisplacement size of zero, one, or four bytes). The following diagram shows the layout of thisSIB byte and the following tables explain the values for each field.

Figure 5.17 SIB (Scaled Index Byte) Layout

Scale Value Index*Scale Value

%00 Index*1

%01 Index*2

%10 Index*4

%11 Index*8



25

ndex Register

000 EAX

001 ECX

010 EDX

011 EBX

100 Illegal

101 EBP

110 ESI

111 EDI



The MOD-REG-R/M and SIB bytes are complex and convoluted, no question about that. Thereason these addressing mode bytes are so convoluted is because Intel reused their 16-bitaddressing circuitry in the 32-bit mode rather than simply abandoning the 16-bit format in the32-bit mode. There are good hardware reasons for this, but the end result is a complexscheme for specifying addressing modes.

Part of the reason the addressing scheme is so convoluted is because of the special casesfor the SIB and displacement-only modes. You will note that the intuitive encoding of theMOD-REG-R/M byte does not allow for a displacement-only mode. Intel added a quickkludge to the addressing scheme replacing the [EBP] addressing mode with the displace-ment-only mode. Programmers who actually want to use the [EBP] addressing mode have touse [EBP+0] instead. Semantically, this mode produces the same result but the instruction isone byte longer since it requires a displacement byte containing zero.

Base Register

%000 EAX

%001 ECX

%010 EDX

%011 EBX

%100 ESP

%101 Displacement-only if MOD = %00, EBP if MOD = %01 or %10

%110 ESI

%111 EDI



25

You will also note that if the REG field of the MOD-REG-R/M byte contains %100 andMOD does not contain %11 then the addressing mode is an "SIB" mode rather than theexpected [ESP], [ESP+disp8], or [ESP+disp32] mode. The SIB mode is used when anaddressing mode uses one of the scaled indexed registers, i.e., one of the followingaddressing modes: [reg32+eax*n] MOD = %00

[reg32+ebx*n] Note: n = 1, 2, 4, or 8.

[reg32+ecx*n]

[reg32+edx*n]

[reg32+ebp*n]

[reg32+esi*n]

[reg32+edi*n]

[disp+reg8+eax*n] MOD = %01

[disp+reg8+ebx*n]

[disp+reg8+ecx*n]

[disp+reg8+edx*n]

[disp+reg8+ebp*n]

[disp+reg8+esi*n]

[disp+reg8+edi*n]

[disp+reg32+eax*n] MOD = %10

[disp+reg32+ebx*n]

[disp+reg32+ecx*n]

[disp+reg32+edx*n]

[disp+reg32+ebp*n]

[disp+reg32+esi*n]

[disp+reg32+edi*n]

[disp+eax*n] MOD = %00 and BASE field contains %101

[disp+ebx*n]

[disp+ecx*n]

[disp+edx*n]

[disp+ebp*n]

[disp+esi*n]

[disp+edi*n]



In each of these addressing modes, the MOD field of the MOD-REG-R/M byte specifies thesize of the displacement (zero, one, or four bytes). This is indicated via the modes "SIBMode," "SIB + disp8 Mode," and "SIB + disp32 Mode." The Base and Index fields of the SIBbyte select the base and index registers, respectively. Note that this addressing mode doesnot allow the use of the ESP register as an index register. Presumably, Intel left this particularmode undefined to provide the ability to extend the addressing modes in a future version ofthe CPU (although extending the addressing mode sequence to three bytes seems a bitextreme).

Like the MOD-REG-R/M encoding, the SIB format redefines the [EBP+index*scale] mode asa displacement plus index mode. Once again, if you really need this addressing mode, youwill have to use a single byte displacement value containing zero to achieve the same result.



26

Encoding the ADD Instruction: Some Examples To figure out how to encode an instruction using this complex scheme, some examplesare warranted. So let's take a lot at how to encode the 80x86 ADD instruction using vari-ous addressing modes. The ADD opcode is $00, $01, $02, or $03, depending on thedirection and size bits in the opcode (see Figure 5.15). The following figures eachdescribe how to encode various forms of the ADD instruction using different addressingmodes.

Figure 5.18 Encoding the ADD( al, cl ); Instruction

There is an interesting side effect of the operation of the direction bit and the MOD-REG-R/M organization: some instructions have two different opcodes (and both are legal). Forexample, we could encode the "add( al, cl );" instruction from Figure 5.18 as $02, $C8 byreversing the AL and CL registers in the REG and R/M fields and then setting the d bit inthe opcode (bit #1). This issue applies to instructions with two register operands.



Figure 5.19 Encoding the ADD( eax, ecx ); instruction

Note that we can also encode "add( eax, ecx );" using the bytes $03, $C8.



26

Figure 5.20 Encoding the ADD( disp, edx ); Instruction



Figure 5.21 Encoding the ADD( [ebx], edi ); Instruction



26

Figure 5.22 Encoding the ADD( [esi+disp8], eax ); Instruction



Figure 5.23 Encoding the ADD ( [ebp+disp32], ebx); Instruction



26

Figure 5.24 Encoding the ADD( [disp32 +eax*1], ebp ); Instruction



Figure 5.25 Encoding the ADD( [ebx + edi * 4], ecx ); Instruction



26

Encoding Immediate Operands You may have noticed that the MOD-REG-R/M and SIB bytes don't contain any bit combi-nations you can use to specify an immediate operand. The 80x86 uses a completely dif-ferent opcode to specify an immediate operand. Figure 5.26 shows the basic encoding foran ADD immediate instruction.

Figure 5.26 Encoding an ADD Immediate Instruction

There are three major differences between the encoding of the ADD immediate and thestandard ADD instruction. First, and most important, the opcode has a one in the H.O. bitposition. This tells the CPU that the instruction has an immediate constant. This individualchange, however, does not tell the CPU that it must execute an ADD instruction, as you'llsee momentarily.

The second difference is that there is no direction bit in the opcode. This makes sensebecause you cannot specify a constant as a destination operand. Therefore, the destina-tion operand is always the location the MOD and R/M bits specify in the MOD-REG-R/Mfield.



In place of the direction bit, the opcode has a sign extension (x) bit. For eight-bit operands,the CPU ignores this bit. For 16-bit and 32-bit operands, this bit specifies the size of the con-stant following the ADD instruction. If this bit contains zero then the constant is the same sizeas the operand (i.e., 16 or 32 bits). If this bit contains one then the constant is a signed eight-bit value and the CPU sign extends this value to the appropriate size before adding it to theoperand. This little trick often makes programs quite a bit shorter because one commonlyadds small valued constants to 16 or 32 bit operands.

The third difference between the ADD immediate and the standard ADD instruction is themeaning of the REG field in the MOD-REG-R/M byte. Since the instruction implies that thesource operand is a constant and the MOD-R/M fields specify the destination operand, theinstruction does not need to use the REG field to specify an operand. Instead, the 80x86CPU uses these three bits as an opcode extension. For the ADD immediate instruction thesethree bits must contain zero (other bit patterns would correspond to a different instruction).

Note that when adding a constant to a memory location, the displacement (if any) associatedwith the memory location immediately precedes the immediate (constant) data in the opcodesequence.



27

Encoding Eight, Sixteen, and Thirty-Two Bit Operands When Intel designed the 8086 they used one bit (s) to select between eight and sixteenbit integer operand sizes in the opcode. Later, when they extended the 80x86 architectureto 32 bits with the introduction of the 80386, they had a problem, with this single bit theycould only encode two sizes but they needed to encode three (8, 16, and 32 bits). Tosolve this problem, they used a operand size prefix byte.

Intel studied their instruction set and came to the conclusion that in a 32-bit environment,programs were more likely to use eight-bit and 32-bit operands far more often than 16-bitoperands. So Intel decided to let the size bit (s) in the opcode select between eight andthirty-two bit operands, as the previous sections describe. Although modern 32-bit pro-grams don't use 16-bit operands that often, they do need them now and then. To allow for16-bit operands, Intel lets you prefix a 32-bit instruction with the operand size prefix byte,whose value is $66. This prefix byte tells the CPU to operand on 16-bit data rather than32-bit data.

You do not have to explicitly put an operand size prefix byte in front of your 16-bit instruc-tions; the assembler will take care of this automatically for you whenever you use a 16-bitoperand in an instruction. However, do keep in mind that whenever you use a 16-bit oper-and in a 32-bit program, the instruction is longer (by one byte) because of the prefixvalue. Therefore, you should be careful about using 16-bit instructions if size (and to alesser extent, speed) are important because these instructions are longer (and may beslower because of their effect on the cache).



Alternate Encodings for Instructions As noted earlier in this chapter, one of Intel's primary design goals for the 80x86 was to cre-ate an instruction set to allow programmers to write very short programs in order to save pre-cious (at the time) memory. One way they did this was to create alternate encodings of somevery commonly used instructions. These alternate instructions were shorter than the standardcounterparts and Intel hoped that programmers would make extensive use of these instruc-tions, thus creating shorter programs.

A good example of these alternate instructions are the "add( constant, accumulator );"instructions (the accumulator is AL, AX, or EAX). The 80x86 provides a single byte opcodefor "add( constant, al );" and "add( constant, eax );" (the opcodes are $04 and $05, respec-tively). With a one-byte opcode and no MOD-REG-R/M byte, these instructions are one byteshorter than their standard ADD immediate counterparts. Note that the "add( constant, ax );"instruction requires an operand size prefix (as does the standard "add( constant, ax );"instruction, so it's opcode is effectively two bytes if you count the prefix byte. This, however, isstill one byte shorter than the corresponding standard ADD immediate.

You do not have to specify anything special to use these instructions. Any decent assemblerwill automatically choose the shortest possible instruction it can use when translating yoursource code into machine code. However, you should note that Intel only provides alternateencodings for the accumulator registers. Therefore, if you have a choice of several instruc-tions to use and the accumulator registers are among these choices, the AL/AX/EAX regis-ters almost always make the best bet. This is a good reason why you should take some timeand scan through the encodings of the 80x86 instructions some time. By familiarizing yourselfwith the instruction encodings, you'll know which instructions have special (and, therefore,shorter) encodings.



27

Putting It All Together Designing an instruction set that can stand the test of time is a true intellectual challenge.An engineer must balance several compromises when choosing an instruction set andassigning opcodes for the instructions. The Intel 80x86 instruction set is a classic exam-ple of a kludge that people are currently using for purposes the original designers neverintended. However, the 80x86 is also a marvelous testament to the ingenuity of Intel'sengineers who were faced with the difficult task of extending the CPU in ways it wasnever intended. The end result, though functional, is extremely complex. Clearly, no onedesigning a CPU (from scratch) today would choose the encoding that Intel's engineersare using. Nevertheless, the 80x86 CPU does demonstrate that careful planning (or justplain luck) does give the designer the ability to extend the CPU far beyond it's originaldesign.

Historically, an important fact we've learned from the 80x86 family is that it's very poorplanning to assume that your CPU will last only a short time period and that users willreplace the chip and their software when something better comes along. Software devel-opers usually don't have a problem adapting to a new architecture when they write newsoftware (assuming financial incentive to do so), but they are very resistant to movingexisting software from one platform to another. This is the primary reason the Intel 80x86platform remains popular to this day.

Choosing which instructions you want to incorporate into the initial design of a new CPUis a difficult task. You must balance the desire to provide lots of useful instructions withthe silicon budget and you must also be careful not to include lots of irrelevant instructionsthat programmers wind up ignoring for one reason or another. Remember, all future ver-sions of the CPU will probably have to support all the instructions in the initial instructionset, so it's better to err on the side of supplying too few instructions rather than too many.Remember, you can always expand the instruction set in a later version of the chip.



Hand in hand with selecting the optimal instruction set is allowing for easy future expansion ofthe chip. You must leave some undefined opcodes available so you can easily expand theinstruction set later on. However, you must balance the number of undefined opcodes withthe number of initial instructions and the size of your opcodes. For efficiency reasons, wewant the opcodes to be as short as possible. We also need a reasonable set of instructions inthe initial instruction set. A reasonable instruction set may consume most of the legal bit pat-terns in small opcode. So a hard decision has to be made: reduce the number of instructionsin the initial instruction set, increase the size of the opcode, or rely on an opcode prefix byte(which makes the newer instructions (you add later) longer. There is no easy answer to thisproblem, as the CPU designer, you must carefully weigh these choices during the initial CPUdesign. Unfortunately, you can't easily change your mind later on.

Most CPUs (Von Neumann architecture) use a binary encoding of instructions and fetchthese instructions from memory. This chapter introduces the concept of binary instructionencoding via the hypothetical "Y86" processor. This is a trivial (and not very practical) CPUdesign that makes it easy to demonstrate how to choose opcodes for a simple instruction set,encode operands, and leave room for future expansion. Some of the more interesting fea-tures the Y86 demonstrates includes the fact that an opcode often contains subfields and weusually group instructions by the number of types of operands they support. The Y86 encod-ing also demonstrates how to use special opcodes to differentiate one group of instructionsfrom another and to provide undefined (illegal) opcodes that we can use for future expansion.

The Y86 CPU is purely hypothetical and useful only as an educational tool. After exploringthe design of a simple instruction set with the Y86, this chapter began to discuss the encod-ing of instructions on the 80x86 platform. While the full 80x86 instruction set is far too com-plex to discuss this early in this text (i.e., there are lots of instructions we still have to discusslater in this text), this chapter was able to discuss basic instruction encoding using the ADDinstruction as an example. Note that this chapter only touches on the 80x86 instructionencoding scheme. For a full discussion of 80x86 encoding, see the appendices in this textand the Intel 80x86 documentation.



27

1As in "Everything, including the kitchen sink." 2Not to mention faster and less expensive. 3To many CPU designers it is not; however, since this was a design goal for the 8086 we'll follow this path. 4Assuming this operation treats its single operand as both a source and destina-tion operand, a common way of handling this instruction. 5Actually, Intel claims it's a one byte opcode plus a one-byte "mod-reg-r/m" byte. For our purposes, we'll treat the mod-reg-r/m byte as part of the opcode. 6The Y86 processor only performs unsigned comparisons. 7Technically, registers do not have an address, but we apply the term addressing mode to registers nonetheless. 8All numeric constants in Y86 assembly language are given in hexadecimal. The "$" prefix is not necessary. 9"Parse" means to figure out the meaning of the statement. 10This program is written with Borland's Delphi and was not ported to Linux by the time this was written. 11We could also have used values $F7, $EF, and $E7 since they also correspond to an attempt to store a register into a constant. However, $FF is easier to decode. On the other hand, if you need even more prefix bytes for instruction expansion, you can use these three values as well.


Lesson 10 - Structured Exception Handling (SEH)

Lesson 10 - Structured Exception Handling (SEH)19

Everybody is talking about SEH. This seems to be a high-leveled topic which only hardestexperts use and understand. If you come from a highlevel language like Delphi, Java or C++you know this concept. Handling appearing exceptions is one important concept to reducethe "crashability" of your application.

I can give you one example:

You code an application which tries to open a file. But the file is not there. Under Delphi GUIcoding your application will crash and you have no possibility to receive a flag with true orfalse of the operation. But you can simply use exception handling to eliminate a crash. WithSEH you first "try" to do the operation. If something goes wrong it will cause an exception.When this exception is "thrown" you can offer your application to execute a different code andthe application does not crash. That´s it and not more.

19.Like the Iczelion tutorials (we included them) this article. It is a great document so please respect the work of the author like we do it.



27

Win32 Exception handling for assembler programmers by Jeremy Gordon - Background20

We're going to examine how to make an application more robust by handling its own exceptions, rather than permitting the system to do so. An "exception" is an offence com-mitted by the program, which would otherwise result in the embarrassing appearance of the dreaded closure message box:-

or its more elaborate counterpart in Windows NT.

20.This lesson is the full article by Jeremy Gordon (Copyright © Jeremy Gordon 1996-2002). There was no need to write an own crappy lesson. This article still rulez. You can find the original article with your favou-rite search-engine.



What exception handling does ...

The idea of exception handling (often called "Structured Exception Handling") is that yourapplication instals one or more callback routines called "exception handlers" at run-time andthen, if an exception occurs, the system will call the routine to let the application deal with theexception. The hope would be that the exception handler may be able to repair the exceptionand continue running either from the same area of code where the exception occurred, orfrom a "safe place" in the code as if nothing had happened. No closure message box wouldthen be displayed and the user would be done the wiser. As part of this repair it may be nec-essary to close handles, close temporary files, free device contexts, free memory areas,inform other threads, then unwind the stack or close down the offending thread. During thisprocess the exception handler may make a record of what it is doing and save this to a file forlater analysis.

If a repair cannot be achieved, exception handling allows your application to close gracefully,having done as much clearing up, saving of data, and apologising as it can.



27

Planned exceptions

The Windows SDK suggests another use for exception handling. It is suggested as a wayto keep track of memory usage. The idea is that an exception will occur if you need tocommit more memory: you intercept it and carry out the memory allocation. This can bedone by intercepting a memory access violation [exception number 0C0000005h], whichwould occur if your code tries to read from, or write to, memory which had not been com-mitted.

Another way suggested to keep track of memory usage is to set the guard page flag in acall to VirtualAlloc when committing the memory, or later using VirtualProtect. This causesa guard page exception [080000001h] if an attempt was made to read to, or write from aguarded area of memory, after which the guard page flag is released. The exception han-dler would therefore be kept informed of the memory requirements and could reset theflag if required.

These methods are widely used throughout the system, for example, as more stack isrequired by a thread, it is automatically enlarged.

An application, however, usually knows what it hopes to do next, so it is much simpler andquicker to keep track of memory requirements by keeping the top of the memory area asa data variable, and to check before the start of each series of memory read/write opera-tions whether the memory area needs to be enlarged or diminished.

This works even if more than one thread uses the same area of memory, since the samedata variable can be used by each thread. In that case, handling the 0C0000005h excep-tion might only be a backup in case your code went wrong.



And what exception handling cannot do ...

Apart from divide by zero [exception code 0C0000094h] which can easily be avoided by pro-tective coding, the most common type of exception is an attempt to read from, or write to, anillegal memory address [0C0000005h]. There are several ways that the second (illegaladdress) can arise. For example:-

- wrong index register values when addressing memory

- unexpected continuous loops involving memory access

- mismatch of PUSHes and POPs so execution continues from the wrong place after return from a CALL

- unforeseen corruption in input data files

It can be seen from this list that exceptions may occur in unexpected circumstances for avariety of reasons. And it will be precisely this type of exception which may terminate yourprogram despite the best efforts of your exception handler. In these circumstances at the veryleast, the exception handler should try to save important data which would otherwise be lost,and then retire gracefully, with suitable apologies.

Other program failures

Your program may fail for other reasons which will not result in an exception at all.

The usual cause of this is:- - insufficient system resources

- continuous loops in your program which do not involve memory access

The result is that your program will not be able to respond to system messages it will appearto the user simply to have stopped. Luckily, however, because it runs in its own virtualaddress space other programs will not be affected, although the whole system may appear torun a little more slowly.



28

Utterly fatal exceptions

Some errors are so bad that the system cannot even manage to call your exception han-dler. Then only if the user is lucky will the system's closure message box appear, or thedevastating bright blue error screen will appear, showing that a "fatal" error has occurred.Almost inevitably this is a result of a total crash of the system and a reboot is the onlyremedy. Fortunately in Win32 you have to try quite hard to produce such errors, but theycan still occur.

... and where exception handling really scores

Having spent some time on what exception handling cannot do, let's review the instanceswhere it is invaluable:-

- During program development, to catch and report on errors as an alterna-tive to debug control.

- When using code written by others which may not be fully trusted.

- When reading from, or writing to, memory areas which may be moved without notice. For example, while spelunking around system memory areas (which would be under system control) or memory areas which could possibly be closed by other processes or threads.

- Using pointers from files which may be corrupted or of the wrong format. Here exception handling would be much quicker than using the IsBadReadPtr or IsBadWritePtr APIs to check each pointer immediately prior to its use.

- As a general catch-all for all unforeseen bugs.



Exception handling in practiceThe Windows sequence

In order to understand what your code can or should do when handling exceptions, you needto know in some more detail what the system does when an exception occurs. If you are newto the subject, the following may not yet be clear. However it is necessary to know thesesteps to understand the subject. The steps are as follows:-

1.Windows decides first whether it is an exception which it is willing to send to the program's exception handler. If so, if the program is being debugged, Windows will notify the debugger of the exception by suspending the program and sending EXCEPTION_DEBUG_EVENT (value 1h) to the debugger.

2.If the program is not being debugged or if the exception is not dealt with by the debugger, the system sends the exception to your per-thread exception handler if you have installed one. A per-thread handler is installed at run-time and is pointed to by the first dword in the Thread Information Block whose address is at FS:[0].

3.The per-thread exception handler can try to deal with the exception, or it may not do so, leaving it for handlers further up the chain, if there are any more handlers installed.

4.Eventually if none of the per-thread handlers deal with the exception, if the program is being debugged the system will again suspend the program and notify the debugger.

5.If the program is not being debugged or if the exception is still not dealt with by the debugger, the system will call your final handler if one is installed. This will be a final handler installed at run-time by the applica-tion using the API SetUnhandledExceptionFilter.

6.If your final handler does not deal with the exception after it returns, the system final handler will be called. Optionally it will show the system's closure message box. Depending on the registry settings, this box may give the user a chance to attach a debugger to the program. If no debugger can be attached or if the debugger is powerless to assist, the program is doomed and the system will call ExitProcess to terminate the program.

7.Before finally terminating the program, though, the system will cause a "final unwind" of the stack for the thread in which the exception occurred.



28

Advantages of using assembler for exception handling

Win32 provides only the framework for exception handling, using a handful of APIs. Somost of the code required for exception handling has to be coded by hand.

"C" programmers will use various shortcuts provided by their compilers by including intheir source code statements such as _try, _except, _finally, _catch and _throw.

One real disadvantage in relying on the compiler's code is that it can enlarge the final exefile enormously.

Also most C programmers would have no idea what code is produced by the compilerwhen exception handling is used, and this is a real disadvantage because to handleexceptions properly you need flexibility, understanding and control. This is becauseexceptions can be intercepted and handled in various ways and at various different levelsin your code. Using assembler you can produce tight, reliable and flexible code which youcan tailor closely to your own application.

Multi-threaded applications need particularly careful treatment and assembler provides asimple and versatile way to add exception handling to such programs.

Information about exception handling at a low level is hard to get hold of, and the samplesin the Win32 Software Development Kit (SDK) concentrate on how to use the "C" com-piler statements rather than how to hard-wire a program to use the Win32 frameworkitself.

The information in this article was obtained using a test program and a debugger, and bydisassembling code produced by "C" compilers. The accompanying programs,Except1.exe and Except2.exe, demonstrate the techniques described here.



Setting up simple exception handlersI hope you will be pleasantly surprised to see in practice how easy it is in assembler to add exception handling to your programs.

The two types of exception handlersAs you have seen above, there are two types of exception handlers.

Type 1 - the "final" exception handlerThe "final" exception handler is called by the system if your program is doomed to close. Because this handler is process-specific it is called irrespective of which thread caused the exception.

Establishing a final exception handlerTypically, this is established in the main thread as soon as possible after the program entry point by calling the API SetUnhandledExceptionFilter. It therefore covers the whole program from that point until termination. There is no need to remove the handler on termination - this is done automatically by windows.



28

Example

No chaining of final exception handlers

There can only be one application-defined final exception handler in the process at anyone time. If SetUnhandledExceptionFilter is called a second time in your code theaddress of the final exception handler is simply changed to the new value, and the previ-ous one is discarded.

Type 2 - the "per-thread" exception handler

This type of handler is typically used to guard certain areas of code and is established byaltering the value held by the system at FS:[0]. Each thread in your program has a differ-ent value for the segment register FS, so this exception handler will be thread specific. Itwill be called if an exception occurs during the execution of code protected by the han-dler.

START: PUSH ADDR FINAL_HANDLER CALL SetUnhandledExceptionFilter ; ... ; ... ; ... CALL ExitProcess ;************************************FINAL_HANDLER: ; ... ; ... ; ... ;(eax=-1 reload context and continue)MOV EAX,1 RET

;program entry point ; ; ; ;code covered by final handler ; ; ; ; ; ;code to provide a polite exit ; ; ;eax=1 stops display of closurebox ;eax=0 enables display of the box



The value in FS is a 16-bit selector which points to the "Thread Information Block", a structurewhich contains important information about each thread. The very first dword in the ThreadInformation Block points to a structure which we are going to call an "ERR" structure.

The "ERR" structure is at least 2 dwords as follows:-

Establishing a "per-thread" exception handler

So now we can see how easy it is to establish this type of exception handler:-

1st dword +0 Pointer to next ERR structure

2nd dword +4 Pointer to own exception handler



28

Example

PUSH ADDR HANDLER FS PUSH [0] FS MOV [0],ESP ... ... ... FS POP [0] ADD ESP,4h RET ;*********************** HANDLER: ... ... ... MOV EAX,1 RET

; ;address of next ERR structure ;give FS:[0] the ERR address just made ; ;the code protected by the handler goes here ; ;restore next ERR structure to FS:[0] ;throw away rest of ERR structure ; ; ; ; ;exception handler code goes here ; ;eax=1 go to next handler ;eax=0 reload context & continue execution



Chaining of per-thread exception handlers

In the above code we can see that the 2nd dword of the ERR structure, which is the addressof your handler, is put on the stack first, then the 1st dword of the next ERR structure is put onthe stack by the instruction FS PUSH [0]. Suppose the code which was then protected by thishandler called other functions which needed their own individual protection. Then you maycreate another ERR structure and handler to protect that code in exactly the same way. Thisis called chaining. In practice this means that when an exception occurs the system will walkthe handler chain by first calling the exception handler most recently established before thecode where the exception occurred. If that handler does not deal with the exception (return-ing EAX=1), then the system calls the next handler up the chain. Since each ERR structurecontains the address of the next handler up the chain, any number of such handlers can beestablished in this way. Each handler might guard against or deal with particular types ofexceptions depending on what is foreseeable in your code. The stack is used to keep theERR structure, to avoid write-overs. However there is nothing to stop you using other parts ofmemory for the ERR structures if you prefer.



28

Stack unwindsWe're going to look at with stack unwinds at this point because they shouldn't keep theirmystery any longer! A "stack unwind" sounds very dramatic, but in practice it's simply allabout calling the exception handlers whose local data is held further down the stack andthen (probably) continuing execution from another stack frame. In other words the pro-gram gets ready to ignore the stack contents between these two positions.

Suppose you have a chain of per-thread handlers established as in this arrangement,where Function A calls Function B which calls Function C:-



Then the stack will look something like this:-

Here as each function is called things are PUSHed onto the stack: firstly the return address,then local data, and then the exception handler (this is the "ERR" structure referred to ear-lier).

stack? +ve

Use of stack by Function C

Handler 3

3rd

Stack

Frame Local Data Function C

Return address Function C

Use of stack by Function B

Handler 2

2nd

Stack

Frame

Local Data Function B

Return address Function B

Use of stack by Function A

Handler 1

1st

Stack

Frame

Local Data Function A

Return address Function A

Stack? +ve



29

Then suppose that an exception occurs in Function C. As we have seen, the system willcause a walk of the handler chain. Handler 3 will be called first. Suppose Handler 3 doesnot deal with the exception (returning EAX=1), then Handler 2 will be called. SupposeHandler 2 also returns EAX=1 so that Handler 1 is called. If Handler 1 deals with theexception, it may need to cause a clear-up using local data in the stack frames created byFunctions B and C.

It can do so by causing an Unwind.

This simply repeats the walk of the handler chain again, causing first Handler 3 then Han-dler 2, then Handler 1 to be called in turn.

The differences between this type of handler chain walk and the walk initiated by the sys-tem when the exception first occurred are as follows:-

1.This handler walk is initiated by your handler rather than by the system

2.The exception flag in the EXCEPTION_RECORD should be set to 2h(EH_UNWINDING). This indicates to the per-thread handler that it is being calledby another handler higher in the chain to clear-up using local data. It should notattempt to do any more than that and it must return EAX=1.

3.The handler walk stops at the handler immediately before the caller. For examplein the diagram, if Handler 1 initiates the unwind, the last Handler to be called dur-ing the unwind is Handler 2. There is no need for Handler 1 to be called fromwithin itself because it has access to its own local data to clear-up.

You can see below ("Providing access to local data") how the handler is able to find localdata during the handler walk.



How the unwind is done

The handler can initiate an unwind using the API RtlUnwind or, as we shall see, it can alsoeasily be done using your own code. This API can be called as follows:- PUSH Return value

PUSH pExceptionRecord

PUSH ADDR CodeLabel

PUSH LastStackFrame

CALL RtlUnwind

Where:-

Return value is said to give a return value after the unwind (you would probably not usethis)

pExceptionRecord is a pointer to the exception record, which is one of the structuressent to the handler when an exception occurs

CodeLabel is a place from which execution should continue after the unwind and is typ-ically the code address immediately after the call to RtlUnwind. If this is not specified theAPI appears to return in the normal way, however the SDK suggests that it should beused and it is better to play safe with this type of API

LastStackFrame is the stack frame at which the unwind should stop. Typically this willbe the stack address of the ERR structure which contains the address of the handlerwhich is initiating the unwind

Unlike other APIs you cannot rely on RtlUnwind saving the EBX, ESI or EDI registers – if you are using these in your code you should ensure that they are saved prior to PUSHing the first parameter and restored after the CodeLabel



29

Own-code Unwind

The following code simulates the unwind (where ebx holds the address of theEXCEPTION_RECORD structure sent to the handler):-

Own-code Unwind

Here each handler is called in turn with the ExceptionFlag set to 2h until the last handleris reached (the system has a value of -1 in the last ERR structure).

The above code does not check for corruption of the values at [EDI] and at [EDI+4]. Thefirst is a stack address and could be checked by ensuring that it is above the thread'sstack base given by FS:[8] and below the thread's stack top given by FS:[4]. The secondis a code address and so you could check that it lies within two code labels, one at thestart of your code and one at the end of it. Alternatively you could check that [EDI] and[EDI+4] could be read by calling the API IsBadReadPtr.

MOV D[EBX+4],2h FS MOV EDI,[0] L2: CMP D[EDI],-1 JZ >L3 PUSH EDI,EBX CALL [EDI+4] ADD ESP,8h MOV EDI,[EDI] JMP L2 L3:

;make the exception flag EH_UNWINDING ;get 1st per-thread handler address ; ;see if it’s the last one ;yes, so finish ;push ERR structure, EXCEPTION_RECORD ;call handler to run clear-up code ;remove the two parameters pushed ;get pointer to next ERR structure ;and do next if not at end ;code label when finished



Unwind by final handler then continue

It is not just a per-thread handler which can initiate a stack unwind. It can also be done in yourfinal handler by calling either RtlUnwind or an own-code unwind and then returning EAX= -1.(See "Continuing execution after final handler called").

Final unwind then terminate

If a final handler is installed and it returns either EAX=0 or EAX=1, the system will cause theprocess to terminate. However, before final termination something interesting happens. Thesystem does a final unwind by going back to the very first handler in the chain (that is to say,the handler guarding the code in which the exception occurred). This is the very last opportu-nity for your handler to execute the clear-up code necessary within each stack frame. Youcan see this final unwind clearly occurring if you set the accompanying demo programExcept2.exe to allow the exception to go to the final handler and press either F3 or F5 whenthere. It also happens in the simpler Except1.exe program.

The following code simulates the unwind (where ebx holds the address of theEXCEPTION_RECORD structure sent to the handler):-



29

The information sent to the handlersClearly sufficient information must be sent to the handlers for them to be able to try torepair the exception, make error logs, or report to the user. As we shall see, this informa-tion is sent by the system itself on the stack, when the handlers are called. In addition tothis you can send your own information to the handlers by enlarging the ERR structure sothat it contains more information.

The information sent to the final handler

The final handler is documented in the Windows Software Development Kit ("SDK") asthe API "UnhandledExceptionFilter". It receives one parameter only, a pointer to the struc-ture EXCEPTION_POINTERS. This structure is as follows:-

EXCEPTION_POINTERS +0

Pointer to structure:- EXCEPTION_RECORD

+4 Pointer to structure:- CONTEXT record



The structure EXCEPTION_RECORD has these fields:-

EXCEPTION_RECORD +0 ExceptionCode

+4 ExceptionFlag

+8 NestedExceptionRecord

+C ExceptionAddress

+10 NumberParameters

+14 AdditionalData



29

Where ExceptionCode gives the type of exception which has occurred. There are a

number of these listed in the SDK and header files, but in prac-tice, the types which you may come across are:-

C0000005h - Read or write memory violation

C0000094h - Divide by zero

C0000095h - Divide overflow

C00000FDh - The stack went beyond the maximum available size

80000001h - Violation of a guard page in memory set up using Virtual Alloc

The following only occur whilst dealing with exceptions:-

C0000025h - A non-continuable exception - the handler should not try to deal with it

C0000026h - Exception code used the by system during exception handling. This code might be used if the system encounters an unexpected return from a handler. It is also used if no Exception Record is supplied when calling RtlUnwind.

The following are used in debugging:-

80000003h - Breakpoint occurred because there was an INT3 in the code

80000004h - Single step during debugging

The exception codes follow these rules: Bits 31-30 Bit 29 Bit 28 Bits 27-0 0=success 0=Microsoft Reserved For exception 1=information 1=Application Must be zero code 2=warning 3=error A typical own exception code sent by RaiseException might therefore be E0000100h (error, application, code=100h).



Own user code - this would be sent by your own application by calling the API RaiseException. This is a quick way to exit code directly into your handler if required.

Exception flag which gives instructions to the handler. The values can be:-

0 - a continuable exception (can be repaired)

1 - a non-continuable exception (cannot be repaired)

2 - the stack is unwinding - do not try to repair

Nested exception record pointing to another EXCEPTION_RECORD structure if the handler itself has caused another exception

Exception address - the address in code where the exception occurred

NumberParameters - number of dwords to follow in Additional information

Additional information - array of dwords with further information

This can either be information sent by the application itself when calling RaiseException, or, if the exception code is C0000005h it will be as follows:-

1st dword - 0=a read violation, 1=a write violation.

2nd dword - address of access violation

The second part of the EXCEPTION_POINTERS structure which is sent to the final handlerpoints to the CONTEXT record structure which contains the processor-specific values of allthe registers at the time of the exception. WINNT.H contains the CONTEXT structures for var-ious processors. Your program can find out what sort of processor is being used by callingGetSystemInfo. CONTEXT is as follows for IA32 (Intel 386 and upwards):-



29

+0 context flags

(used when calling GetThreadContext)

DEBUG REGISTERS

+4 debug register #0


+C debug register #2




FLOATING POINT / MMX registers

+1C ControlWord

+20 StatusWord

+24 TagWord

+28 ErrorOffset

+2C ErrorSelector

+30 DataOffset

+34 DataSelector

+38 FP registers x 8 (10 bytes each)

+88 Cr0NpxState

SEGMENT REGISTERS

+8C gs register

+90 fs register

+94 es register

+98 ds register

ORDINARY REGISTERS

+9C edi register

+A0 esi register

+A4 ebx register

+A8 edx register

+AC ecx register

+B0 eax register

CONTROL REGISTERS

+B4 ebp register

+B8 eip register

+BC cs register

+C0 eflags register

+C4 esp register

+C8 ss register



The information sent to the per-thread handlers

At the time of the call to the per-thread handler, ESP points to three structures as follows:-

ESP+4 Pointer to structure:- EXCEPTION_RECORD

ESP+8 Pointer to own ERR structure

ESP+C Pointer to structure:- CONTEXT record

Unlike usual CALLBACKs in Windows, when the per-thread handler is called, the C calling convention is used (caller to remove the arguments from the stack) not the PASCAL convention (function to do so). This can be seen from the actual Kernel32 code used to make the call:-

PUSH Param, CONTEXT record, ERR, EXCEPTION_RECORD CALL HANDLER ADD ESP,10h

In practice the first argument, Param, was not found to contain meaningful information



30

The EXCEPTION_RECORD and CONTEXT record structures have already beendescribed above.

The ERR structure is the structure you created on the stack when the handler was estab-lished and it must contain the pointer to the next ERR structure and the code address ofthe handler now being installed (see "Setting up simple exception handlers", above). Thepointer to the ERR structure passed to the per-thread handler is to the top of this struc-ture. It is possible, therefore, to enlarge the ERR structure so that the handler can receiveadditional information.

In a typical arrangement the ERR structure might look like this, where [ESP+8h] points tothe top of this structure when the handler is called:-

As we shall see below ("Continuing execution from a safe-place"), the fields at +8 and+14 may be used by the handler to recover from the exception.

ERR +0 Pointer to next ERR structure

+4 Pointer to own exception handler

+8 Code address of "safe-place" for handler

+C Information for handler

+10 Area for flags

+14 Value of EBP at safe-place



Providing access to local data

Let's now consider the best position of the ERR structure on the stack relative to the stackframe, which may well hold local data variables. This is important because the handler maywell need access to this local data in order to clear-up properly. Here is some typical codewhich may be used to establish a per-thread handler where there is local data:-

MYFUNCTION: PUSH EBP MOV EBP,ESP SUB ESP,40h ;******** local data now at ;********** install handler PUSH EBP PUSH 0 PUSH 0 PUSH ADDR SAFE_PLACE PUSH ADDR HANDLER FS PUSH [0] FS MOV [0],ESP ... ... ... JMP >L10 SAFE_PLACE: L10: FS POP [0] MOV ESP,EBP POP EBP RET ;***************** HANDLER: RET

;procedure entry point ;save ebp (used to address stack frame) ;use EBP as stack frame pointer ;make 16 dwords on stack for local data [EBP-4] to [EBP-40h] and its ERR structure ;ERR+14h save ebp (being ebp at safe-place) ;ERR+10h area for flags ;ERR+0Ch information for handler ;ERR+8h new eip at safe-place ;ERR+4h address of handler ;ERR+0h keep next ERR up the chain ;point to ERR just made on the stack ; ;code which is protected goes here ; ;normal end if there is no exception ;handler sets eip/esp/ebp for here ; ;restore next ERR up the chain



30

Using this code, when the handler is called, the following is on the stack, and with[ESP+8h] pointing to the top of the ERR structure (ie. ERR+0):-

You can see from this that since the handler is given a pointer to the ERR structure it canalso find the address of local data on the stack. This is because the handler knows thesize of the ERR structure and also the position of the local data on the stack. If the EBPfield is used at ERR+14h as in the above example, that could also be used as a pointer tothe local data.

Stack +ve

ERR +0 Pointer to next ERR structure

ERR +4 Pointer to own exception handler

ERR +8 Code address of "safe-place" for handler

ERR +C Information for handler

ERR +10 Area for flags

ERR +14 Value of EBP at safe-place

+18 Local Data

+1C Local Data

+20 Local Data

more local data



Recovering from and Repairing an exceptionContinuing execution from a safe-place

Choosing the safe-place

You need to continue execution from a place in the code which will not cause further prob-lems. The main thing you must bear in mind is that since your program is designed to workwithin the Windows framework, your aim is to return to the system as soon as possible in acontrolled manner, so that you can wait for the next system event. If the exception hasoccurred during the call by the system to a window procedure, then often a good safe-placewill be near the exit point of the window procedure so that control passes back to the systemcleanly. In this case it will simply appear to the system that your application has returned fromthe window procedure in the usual way.

If the exception has occurred, however, in code where there is no window procedure, thenyou may need to exercise more control. For example, a thread established to do certain taskswill probably need to be terminated, reporting to the main thread that it could not completethe task.

Another major consideration is how easy it is to get the correct EIP, ESP and EBP values atthe safe-place. As we can see below, this may not be at all difficult.

There are so many possible permutations here it is probably pointless to postulate them. Theprecise safe-place will depend on the nature of your code and the use you are making ofexception handling.



30

Example of how to get to safe-place

As an example, though, look again at the code example above in MYFUNCTION. Youcan see the code label "SAFE-PLACE". This is a code address from which executioncould continue safely, the handler having done all necessary clearing up.

In the code example, in order to continue execution successfully, it must be borne in mindthat although SAFE-PLACE is within the same stack frame as the exception occurred, thevalues of ESP and EBP need carefully to be set by the handler before execution contin-ues from EIP.

These 3 registers therefore need to be set and for the following reasons:- - ESP - to enable the FS POP [0] instruction to work and to POP other val-ues if necessary

- EBP - to ensure that local data can be addressed within the handler and to restore the correct ESP value to return from MYFUNCTION

- EIP - to cause execution to continue from SAFE-PLACE

Now you can see that each of these values is readily available from within the handlerfunction. The correct ESP value is, in fact, exactly the same as the top of the ERR struc-ture itself (given by [ESP+8h] when the handler is called). The correct EBP value is avail-able from ERR+14h, because this was PUSHed onto the stack when the ERR structurewas made. And the correct code address of SAFE-PLACE to give to EIP is at ERR+8h.

Now we are ready to see how the handler can ensure that execution continues from asafe-place, instead of allowing the process to close, should an exception occur.



HANDLER: PUSH EBP MOV EBP,ESP ;** now [EBP+8]=pointer ;** [EBP+0Ch]=pointer to ;** [EBP+10h]=pointer to PUSH EBX,EDI,ESI MOV EBX,[EBP+8] TEST D[EBX+4],1h JNZ >L5 TEST D[EBX+4],2h JZ >L2 ... ... ... JMP >L5 L2: PUSH 0 PUSH [EBP+8h] PUSH ADDR UN23 PUSH [EBP+0Ch] CALL RtlUnwind UN23: MOV ESI,[EBP+10h] MOV EDX,[EBP+0Ch] MOV [ESI+0C4h],EDX MOV EAX,[EDX+8] MOV [ESI+0B8h],EAX MOV EAX,[EDX+14h] MOV [ESI+0B4h],EAX XOR EAX,EAX JMP >L6 L5: MOV EAX,1 L6: POP ESI,EDI,EBX MOV ESP,EBP POP EBP RET

; ; ; to EXCEPTION_RECORD ERR structure CONTEXT record ;save registers as required by windows ;get exception record in ebx ;see if its a non-continuable exception ;yes, so must not deal with it ;see if its EH_UNWINDING (from Unwind) ;no ; ;clear-up code when unwinding ; ;must return 1 to go to next handler ; ;return value (not used) ;pointer to this exception record ;code address for RtlUnwind to return ;pointer to this ERR structure ; ; ;get context record in esi ;get pointer to ERR structure ;use it as new esp ;get safe place given in ERR structure ;insert new eip ;get ebp at safe place given in ERR ;insert new ebp ;reload context & return to system eax=0 ; ; ;go to next handler - return eax=1 ;ordinary return (no actual arguments)



30

Repairing the exception

In the above example you saw the context being loaded with the new eip, ebp and esp tocause execution to continue from a safe-place. It may be possible using the same methodof replacing the values for some of the registers in the context, to "repair" the exception,permitting execution to continue from near the offending code, so that the current taskcan be continued.

An obvious example would be a divide by zero, which can be repaired by the handler bysubstituting the value 1 for the divisor, and then a return with EAX=0 (if a "per-thread"

handler) causing the system to reload the context and continue execution.



In the case of memory violations, you can make use of the fact that the address of the mem-ory violation is passed as the second dword in the additional information field of the exceptionrecord. The handler can use this very same value to pass to VirtualAlloc to commit morememory starting at that place. If this is successful, the handler can then reload the context(unchanged) and return EAX=0 to continue execution (in the case of a "per-thread" handler).



30

Continuing execution after final handler calledIf you wish you can deal with exceptions in the final handler. You recall that at the begin-ning of this article I said that the final handler is called by the system when the process isabout to be terminated.

This is true.

The returns in EAX from the final handler are not the same as those from the per-threadhandler. If the return is EAX=1 the process terminates without showing the system's clo-sure message box, and if EAX=0 the box is shown.

However, there is also a third return code, EAX= -1 which is properly described in theSDK as "EXCEPTION_CONTINUE_EXECUTION". This return has the same effect asreturning EAX=0 from a per-thread handler, that is, it reloads the context record into theprocessor and continues execution from the eip given in the context. Of course, the finalhandler may change the context record before returning to the system, in the same wayas a per-thread handler might do so. In this way the final handler can recover from theexception by continuing execution from a suitable safe-place or it may try to repair theexception.

If you use the final handler to deal with all exceptions instead of using per-thread handlersyou do lose some flexibility, though.



Firstly, you cannot nest final handlers. You can only have one working final handler estab-lished by SetUnhandledExceptionFilter in your code at any one time. You could, if youwished, change the address of the final handler as different parts of your code are being pro-cessed. SetUnhandledExceptionFilter returns the address of the final handler being replacedso you could make use of this as follows:-

Note here that at the time of the second call to SetUnhandledExceptionFilter the address ofthe previous handler is already on the stack because of the earlier PUSH EAX instruction.

Another difficulty with using the final handler is that the information sent to it is limited to theexception record and the context record. Therefore you will need to keep the code address ofthe safe-place, and the values of ESP and EBP at that safe-place, in static memory. This canbe done easily at run time. For example, when dealing with the WM_COMMAND messagewithin a window procedure,

PUSH ADDR FINAL_HANDLER CALL SetUnhandledExceptionFilter PUSH EAX ... ... ... ... CALL SetUnhandledExceptionFilter

; ; ;keep address of previous handler ; ;this is the code ;being guarded ; ;restore previous handler

PROCESS_COMMAND: MOV EBPSAFE_PLACE,EBP MOV ESPSAFE_PLACE,ESP ... ... ... SAFE_PLACE: XOR EAX,EAX RET

;called on uMsg=111h (WM_COMMAND) ;keep ebp at safe-place ;keep esp at safe-place ; ;protected code here ; ;code-label for safe-place ;return eax=0=message processed



31

In the above example, in order to repair the exception by continuing execution from thesafe-place, the handler would insert the values of EBPSAFE_PLACE at CONTEXT+0B4h(ebp), ESPSAFE_PLACE at CONTEXT+0C4h (esp), and ADDR SAFE_PLACE intoCONTEXT+0B8h (eip) and then return -1.

Note that in a stack unwind forced by the system because of a fatal exit, only the "per-thread" handlers (if any) and not the final handler are called. If there are no "per-thread"handlers, the final handler would have to deal with all clearing-up itself before returning tothe system.



Single-stepping by setting the trap flag within the handlerYou can make a simple single-step tester for your program while it is under development byusing the handler's ability to set the trap flag in the register context before returning to thesystem. You can arrange for the handler to display the results on the screen, or to dumpthem to a file. This may be useful if you suspect that results are being altered under debug-ger control, or if you need to see quickly how a particular piece of code responds to variousinputs. Insert the following code fragment where you want single-stepping to begin:- MOV D[SSCOUNT],5

INT 3

SSCOUNT is a data symbol and is set to the number of steps the handler should do beforereturning to normal operation. The INT 3 causes a 80000003h exception, so your handler iscalled.



31

The code in your development program should be protected by a per-thread handlerusing code like this:-.

Here the first call to the handler is caused by the INT 3 (the system objected strongly tothe use of INT 1 when I tried it). On receipt of this exception, which could only come fromthe code fragment inserted in the code-to-test, the handler sets the trap flag in the contextbefore returning. This causes a 80000004h exception to come back to the handler uponthe next instruction. Note that with these exceptions, eip is already at the next instructionie. one past the INT 3, or past the instruction executed with the trap flag set. Accordinglyall you have to do in the handler to continue single-stepping is to set the trap flag againand return to the system. * Thanks to G.W.Wilhelm, Jr of IBM for this idea

SS_HANDLER: PUSH EBP MOV EBP,ESP PUSH EBX,EDI,ESI MOV EBX,[EBP+8] TEST D[EBX+4],01h JNZ >L14 TEST D[EBX+4],02h JNZ >L14 MOV ESI,[EBP+10h] MOV EAX,[EBX] CMP EAX,80000004h JZ >L10 CMP EAX,80000003h JNZ >L14 L10: DEC D[SSCOUNT] JZ >L12 OR D[ESI+0C0h],100h L12: ... ... ... XOR EAX,EAX JMP >L17 L14: MOV EAX,1 L17: POP ESI,EDI,EBX MOV ESP,EBP POP EBP RET

; ; ; ;save registers as required by Windows ;get exception record in ebx ;see if its a non-continuable exception ;yes ;see if EH_UNWINDING ;yes ;get context record in esi ;get ExceptionCode ;see if here because trap flag set ;yes ;see if its own INT 3 inserted to single-step;no ; ;stop when correct number done ; ;set trap flag in context ; ; ;code here to display results to screen ; ;eax=0 reload context and return to system ; ; ;eax=1 system to go to next handler



Exception handling in multi-threaded applicationsWhen it comes to exception handling in multi-threaded applications there is little or no helpfrom the system. You will need to plan for likely faults and organise your threads accordingly.

The rules applying to the exception handling provided by the system (in the context of a multi-threaded application) are:-

1.Only one type 1 (final handler) can be in existence at any one time for each process. Ifa new thread calls SetUnhandledExceptionFilter, this will simply replace the final han-dler - there is no chain of final handlers as there is for the type 2 (per-thread) han-dlers. Therefore the simplest way of using the final handler is still probably the bestway in a multi-threaded application - establish it in the main thread as soon as possi-ble after the program start point.

2.The final handler will be called by the system if the process will be terminating, regard-less of which thread caused the exception.

3.However, there will only be a final unwind (immediately prior to termination) in the per-thread handlers established for the thread which caused the exception. Even if anyother (innocent) threads have a window and a message loop, the system will not warnthem that the process is about to terminate (no special message will be sent to themother than usual messages arising from the loss of focus of other windows).

4.Therefore the other (innocent) threads cannot expect a final unwind if the process is toterminate. And they will remain ignorant of the imminent termination.

5.If, as is likely, these other innocent threads will also need to clear-up on such termina-tion you will need to inform them from the final handler. The final handler will need towait until these other threads have completed clearing up before returning to the sys-tem.



31

6.The way in which the innocent threads are informed of the expected termination ofthe program depends on the precise make-up of your code. If the innocent threadhas a window and message loop, then the final handler can use SendMessage tothat window to send an application defined message (must be 400h or above), toinform that thread to terminate gracefully.

If there is no window and message loop, the final handler could set a public variableflag, polled from time to time by the other thread. Alternatively you could use Set-ThreadContext to force the thread to execute certain termination code, by settingthe value of eip to point to that code. This method would not work if the thread is inan API, for example, waiting for the return from GetMessage. In that case youwould need to send a message as well, to make sure the thread returned from theAPI, so that the new context is set.

7.RaiseException only works on the calling thread, so this cannot be used as ameans of communication between threads to make an innocent thread execute itsown exception handler code.

8.How does the final handler know when it may proceed after informing the otherthreads that the program is about to terminate? SendMessage will not return untilthe recipient has returned from its window procedure and the final handler couldwait for that return. Alternatively it could poll a flag waiting for a response from theother thread that it has finished clearing up (note you must call the API Sleep inthe polling loop to avoid over-using the system). Or better still, the final handlercould wait until the other thread has terminated (this can be done using the APIWaitForSingleObject or WaitForMultipleObjects if there is more than one thread).Alternatively use could be made of the Event or Semaphore APIs.

9.For an example of how these procedures could work in practice, suppose a sec-ondary thread has the job of re-organising a database and then writing it to disk. Itmay be in the middle of this task when the main thread causes an exception whichenters your final handler. Here you could either cause the secondary thread toabort its job, by causing it to unwind and terminate gracefully, leaving the originaldata on disk or alternatively you could permit it to complete the task, and theninform the handler that it had finished so that the handler could then return to thesystem. You would need to stop the secondary thread starting any further suchjobs if your handler had been called. This could be achieved by the handler setting



a flag tested by the secondary thread before it started any job, or by using the EventAPIs.



31

10.If communication between threads is difficult, there is another way for one threadto access the stack of another thread, and thereby cause an unwind. This makesuse of the fact that whereas each thread has its own stack, the memory reservedfor that stack is within the address space for the process itself. You can check thisyourself if you watch a multi-threaded application using a debugger. As you movebetween threads the values of ESP and EBP will change, but they are all keptwithin the address space of the process itself. The value of FS will also be differ-ent between threads and will point to the Thread Information Block for eachthread. So if you take the following steps one thread can access the stack andcause an unwind of another:-

a. As each thread is created record in a static variable the value of its FS register.

b. As each thread closes it returns the static variables to zero.

c. The handler which needs to unwind other threads should take all the static vari-ables in turn and for those which have a non-zero value (ie. thread was run-ning at the time of the exception) the handlers should be called with theexception flag of 2 (EH_UNWINDING) and, a user flag of say, 400h to showthat the per-thread handler is being called by your final handler. You cannotcall a per-thread handler in a different thread using RtlUnwind (which isthread-specific) but it can be done using the following code (where ebx holdsthe address of the EXCEPTION_RECORD):-



;now loop back to L1 with a new FS_VALUE until all threads done

Here you see that the Thread Information Block of each innocent thread is read using the ESregister, which is temporarily given the value of the thread's FS register.

Instead of using FS to find the Thread Information Block you could use the following code toget a 32-bit linear address for it. In this code LDT_ENTRY is a structure of 2 dwords, ax holdsthe 16-bit selector value (FS_VALUE) to be converted and hThread is any valid thread han-dle:-

MOV D[EBX+4],402h L1: PUSH ES MOV AX,[FS_VALUE] MOV ES,AX ES MOV EDI,[0] POP ES L2: CMP D[EDI],-1 JZ >L3 PUSH EDI,EBX CALL [EDI+4] ADD ESP,8h MOV EDI,[EDI] JMP L2 L3:

;make the exception flag EH_UNWINDING + 40; ; ;get FS value of thread to unwind ; ;get 1st per-thread handler address ; ; ;see if it’s the last one ;yes, so finish ;push ERR structure, EXCEPTION_RECORD ;call handler to run clear-up code ;remove the two parameters pushed ;get pointer to next ERR structure ;and do next if not at end ;code label when finished



31

The reason why it is important (using the flag 400h) to inform the handler being called thatit is being called by another thread (the final handler) is that the thread being called is stillrunning because the exception occurred in a different thread. The handler may well needto suspend the thread in these circumstances, so that the clear-up job can be achieved bythe calling thread. The innocent thread would then be given a safe-place to go to beforecalling ResumeThread. All this must be done before the final handler is allowed to returnto the system because on return the system will simply terminate all threads by bruteforce.

AND EAX,0FFFFh PUSH ADDR LDT_ENTRY,EAX,[hThread] CALL GetThreadSelectorEntry OR EAX,EAX JZ >L300 MOV EAX,ADDR LDT_ENTRY MOV DH,[EAX+7] MOV DL,[EAX+4] SHL EDX,16D MOV DX,[EAX+2] OR EDX,EDX L300:

; ; ; ;see if failed ;yes so return zero ; ;get base high ;get base mid ;shift to top of edx ;and get base low ;edx now=linear 32 bit address);return nz on success



Except1This program provides a simple example of how exception handling can be used in practicein Windows programs written in assembler. The source code is contained in Except1.asm.This is written in GoAsm syntax. Although the program is a Windows GDI program, it onlyrelies on message boxes, which is why there is no message loop.

The program has two exception handlers, a final exception handler and a per-thread excep-tion handler. The final exception handler is created first, then a procedure is called which is incode protected by the per-thread exception handler. An exception occurs within that proce-dure and the per-thread handler is called. Within the handler, the user is asked whether thehandler should swallow the exception or not. If the user decides to swallow the exception, theprogram would be able to continue to run, but actually in this case it terminates normally. Ifthe user decides that the exception should not be swallowed by the handler, then the finalexception handler is called (on the way to program closure). In real life, this handler would beresponsible for completing logs and records, closing file handles, releasing memory etc. Butbefore the program finally finishes, something interesting happens. The system calls the per-thread exception handler in case there is more clearing up to do in that particular stack frameusing local data. This is the system unwind. All these events are followed from the variousmessage boxes which appear on the screen.



32

Source 1: Except1 - Exception Handling

;////////////////////////////////////////////////////////////////////////////

;// //

;// EXCEPT1.ASM - source for Except1.Exe //

;// Simple Demo of Win32 structured exception handling //

;// for assembler programmers //

;// See Except2 for a more complex demo dealing with voluntary //

;// stack unwinds and multiple handler levels //

;// COPYRIGHT NOTE - this file is Copyright Jeremy Gordon 2002 //

;// [McDuck Software] //

;// - e-mail: [email protected] //

;// - www.GoDevTool.com //

;// LEGAL NOTICE - The author accepts no responsibility for losses //

;// of any type arising from this file or anything wholly or in part //

;// created from it //

;// //

;////////////////////////////////////////////////////////////////////////////

;

;This program only uses Windows message boxes, which is why there is no

;message loop.

;The program has two exception handlers. The final exception handler

;is created first, then a procedure is called which has its own

;per-thread exception handler, capable of swallowing an exception.

;This it does at the option of the user.

;If the user decides to swallow the exception, the program would be able

;to continue to run, but actually in this case it terminates normally.

;If the user decides that the exception should not be swallowed by the

;handler, then the final exception handler is called (on the way to

;program closure). In real life, this handler would be responsible for

;completing logs and records, closing file handles, releasing memory etc.

;But before the program finally finishes, the system calls the per-thread

;exception handler in case there is more clearing up to do in that

;particular stack frame using local data. This is the system unwind.

;



;Written for GoAsm (Jeremy Gordon). Assemble using:-

;GoAsm except1.asm

;Link using:-

;ALINK -oPE except1.obj -entry START kernel32.lib user32.lib gdi32.lib

;(where the lib files are made using ALIB)

;*******************************************************************

;

DATA SECTION

;

;*******************************************************************

FATALMESS DB "I thoroughly enjoyed it and I have already tidied everything up - "

DB "you know, completed records, closed filehandles, "

DB "released memory, that sort of thing .."

DB "Glad this was by design - bye, bye ..",0Dh,0Ah

DB ".. but first, I expect the system will do an unwind ..",0

;******************************

;

CODE SECTION

;

CLEAR_UP: ;all clearing up would be done here

RET

;

FINAL_HANDLER: ;system passes EXCEPTION_POINTERS

PUSH EBX,EDI,ESI ;save registers as required by Windows

CALL CLEAR_UP

PUSH 40h ;exclamation sign + ok button only

PUSH "Except1 - well it's all over for now."

PUSH ADDR FATALMESS,0

CALL MessageBoxA ;wait till ok pressed

MOV EAX,1 ;terminate process without showing system message box

POP ESI,EDI,EBX

RET

;

;********************************* PROGRAM START

START:



32

;******** first lets make our final handler which would do all clearing up if

;******** the program has to close

PUSH ADDR FINAL_HANDLER

CALL SetUnhandledExceptionFilter

CALL PROTECTED_AREA

CALL CLEAR_UP ;here the program clears up normally


PUSH "Except1","This is a very happy ending",0


PUSH 0 ;code meaning a succesful conclusion

CALL ExitProcess ;and finish with aplomb!

;********************************* PROGRAM END

;

PROTECTED_AREA:

PUSH EBP,0,0 ; )create the

PUSH OFFSET SAFE_PLACE ; )ERR structure

PUSH OFFSET HANDLER ; )on the

FS PUSH [0] ; )stack

FS MOV [0],ESP ;point to structure just established on the stack

;

;*********************** and now lets cause the exception ..

XOR ECX,ECX ;set ecx to zero

DIV ECX ;divide by zero, causing exception

;*********************** because of the exception the code never gets to here

;

SAFE_PLACE: ;but the handler will jump to here ..

FS POP [0] ;restore original exception handler from stack

ADD ESP,14h ;throw away remainder of ERR structure made earlier

RET

;

;This simple handler is called by the system when the divide by zero

;occurs. In this handler the user is given a choice of swallowing the

;exception by jumping to the safe-place, or not dealing with it at all,

;in which case the system will send the exception to the FINAL_HANDLER

;



HANDLER:


MOV EBX,[EBP+8] ;get exception record in ebx

MOV EAX,[EBX+4] ;get flag sent by the system

TEST AL,1h ;see if its a non-continuable exception

JNZ >.nodeal ;yes, so not allowed by system to touch it

TEST AL,2h ;see if its the system unwinding

JNZ >.unwind ;yes

PUSH 24h ;question mark + YES/NO buttons

PUSH 'Except1','There was an exception - do you want me to swallow it?',0

CALL MessageBoxA ;wait till button pressed

CMP EAX,6 ;see if yes clicked

JNZ >.nodeal ;no

;***************************** go to SAFE_PLACE

MOV ESI,[EBP+10h] ;get register context record in esi

MOV EDI,[EBP+0Ch] ;get pointer to ERR structure in edi

MOV [ESI+0C4h],EDI ;insert new esp (happens to be pointer to ERR)

MOV EAX,[EDI+8] ;get address of SAFE_PLACE given in ERR structure

MOV [ESI+0B8h],EAX ;insert that as new eip in register context

MOV EAX,[EDI+14h] ;get ebp at safe place given in ERR structure

MOV [ESI+0B4h],EAX ;insert that as new ebp in register context

XOR EAX,EAX ;eax=0 reload context and return to system

JMP >.fin

.unwind:


PUSH "Except1"

PUSH "The system calling the handler again for more clearing up (unwinding)"

PUSH 0

CALL MessageBoxA ;wait till ok pressed, then return eax=1

.nodeal:

MOV EAX,1 ;eax=1 system to go to next handler

.fin:

POP ESI,EDI,EBX

RET

;



32

Except 2This is a more complex program which is intended to demonstrate in more detail the con-tents of this article.

The source code for Except2.Exe (Except2.asm and Except2.RC) is also provided andagain it is in GoAsm syntax.

The main window is actually a modal dialog. A final handler is set up very early in the pro-cess. When the "Cause Exception" button is clicked, first the dialog procedure is calledwith the command, then 2 further routines are called, the third routine causing an excep-tion of the type chosen by the radiobuttons. As execution passes through this code, 3 per-thread exception handlers are created.



The exception is either repaired in situ if possible, or the program recovers in the chosen han-dler from a safe-place. If the exception is allowed to go to the final handler you can either exitby pressing F3 or F5, or if you press F7 the final handler will try to recover from the exception.

You can follow events as they occur because each handler displays various messages in thelistbox. There is a slight delay between each message so that you can follow more easilywhat is happening, or you can scroll the messages to get them back into view.

When the program is about to terminate, something interesting happens. The system causesa final unwind with the exception flag set to 2h. The messages sent to the listbox are sloweddown even further because the program will be terminating soon!

You will see that the same type of unwind occurs if you specify that execution should continuefrom a "safe-place" or if F7 is pressed from the final handler. This unwind is initiating by thehandler itself.



32

Source 2: Except2 - Complex Exception Handling

;////////////////////////////////////////////////////////////////////////////

;// //

;// EXCEPT2.ASM - source for Except2.Exe //

;// Complex Demo of Win32 structured exception handling //

;// for assembler programmers //

;// See Except1.asm for a simple demo! //

;// COPYRIGHT NOTE - this file is Copyright Jeremy Gordon 1996-2002 //

;// [McDuck Software] //

;// - e-mail: [email protected] //

;// - www.GoDevTool.com //

;// LEGAL NOTICE - The author accepts no responsibility for losses //

;// of any type arising from this file or anything wholly or in part //

;// created from it //

;// //

;////////////////////////////////////////////////////////////////////////////

;

;The program uses a modal dialog box as its main window, which is why

;there is no message loop (this is dealt with by the system itself)

;A dialog box is created and the user has the choice of exceptions to choose

;from. The exception can be dealt with in handlers 1, 2 or 3; if it would

;normally cause program exit, it goes to the final handler.

;if it is repaired, this can be done either by returning to the place

;of exception or to a safe-place.

;As a final luxory the final handler may also try to recover from the

;exception, unwinding the stack first of course.

;If you decide to let the system deal with the exception, the system then

;unwinds the stack in exactly the same way as the handler does if the

;program is to try to continue running.

;

;Written for GoAsm (Jeremy Gordon). Assemble using:-

;GoAsm except2.asm

;Resources (dialogs, version and bitmap) compiled using GoRC (Jeremy Gordon) use:-



;GoRC except2.rc

;Link using:-

;ALINK -oPE except2.obj except2.res -entry START kernel32.lib user32.lib gdi32.lib

;(where the lib files are made using ALIB)

;*******************************************************************

;

DATA SECTION

;

;*******************************************************************

MSG DD 7 DUP 0 ;hWnd, +4=message, +8=wParam, +C=lParam, +10h=time, +14h/18h=pt

RECT DD 4 DUP 0 ;rectangle - left, +4 top, +8 right, +0Ch bottom

;****************************** some dwords

lpArguments DD 2 DUP 0 ;holds data when RaiseException called

flOldProtect DD 0 ;holds previous code section access protection

hHeap DD 0 ;handle to temporary memory areas

hList DD 0 ;handle to listbox

hDC DD 0 ;handle to device context of listbox

hCombo DD 0 ;handle to combo box

hInst DD 0 ;handle to main process

CINDEX DD 0 ;index of combobox selection

COUNT DD 0 ;used in getting a random number

MESSDELAY DD 100h ;length of time to keep message on the screen

EBPSAFE_PLACE3 DD 0 ;these are kept solely for

ESPSAFE_PLACE3 DD 0 ;repair by final handler

;******************************* non-doublewords follow

EXC_TYPE DB 0 ;radio button exception type chosen

HANDLER DB 0 ;the handler to repair the exception

CONTINUE DB 0 ;1=continue from handler safe-place

HANDLERFLAG DB 0 ;1=read/write message is new

;2=final handler unwind

;********************************* and some strings

BYETEXT DB 'Have an exceptional day!',0

;********************** combo box messages

COMBO_STRING1 DB 'Deal with the exception in handler ',0

COMBO_STRING3 DB 'Allow exception to go to final handler',0



32

;********************** exception messages

EXC_MESS0 DB 'Reading from h ... ',0 ;spaces at end to get rub-out

EXC_MESS1 DB 'Writing to h ... ',0 ;spaces at end to get rub-out

EXC_MESS2 DB 'ExceptionCode h now in handler :',0

EXC_MESS3 DB 'Attempting local repair (no unwind)',0

EXC_MESS4 DB 'Repair appears successful',0

EXC_MESS5 DB ' Flag= h (continuable exception)',0

EXC_MESS5A DB ' Flag= h (non-continuable exception)',0

EXC_MESS5B DB ' Flag= h (unwinding)',0

EXC_MESS5C DB ' Local data= h',0

EXC_MESS6 DB 'Handler cannot repair this exception',0

EXC_MESS7 DB 'Memory write error at h',0

EXC_MESS8 DB 'Memory read error at h',0

EXC_MESS9 DB 'Attempt to corrupt code at h',0

EXC_MESS10 DB 'ExceptionCode h in final handler',0

EXC_MESS11 DB 'Handler clear-up code',0

EXC_MESS11A DB 'Handler clear-up code - byebye ........',0

EXC_MESS12 DB 'Ready to do voluntary stack unwind',0

EXC_MESS13 DB ' Exception at eip= h',0

EXC_MESS14 DB 'Hello from safe-place #2!',0



EXC_MESS17 DB 'Key F3=polite end; F5=nasty end; F7=recover',0

EXC_MESS18 DB 'Closing memory heap and dc',0

EXC_MESS19 DB 'There will be an exception in 3rd routine',0

EXC_MESS20 DB ' (protected by handler 3)',0

EXC_MESS21 DB 'Now system will unwind and call ExitProcess ...',0

EXC_MESS22 DB 'Code at h caused an exception',0

EXC_MESS23 DB 'Now for own unwind then get to safe-place ...',0

EXC_MESS24 DB 'Hello from final handler in safe-place #3!',0

;

;*********************** for HEXWRITE

sHEXb DB '0123456789ABCDEF'

;

;*******************************************************************



;* CODE

;*******************************************************************

CODE SECTION

;

CODESTART: ;label for code corruption test

;

HEXWRITE: ;write hex number from eax into [esi]

PUSH EAX,EBX,EDX

MOV EBX,ADDR sHEXb

ROL EAX,4 ;get high order nibble into al

MOV DL,AL

AND EDX,0Fh ;use only least sig nibble

MOV DL,[EBX+EDX]

MOV [ESI],DL ;write the nibble

INC ESI ;ready for next


MOV DL,AL


MOV DL,[EBX+EDX]




MOV DL,AL


MOV DL,[EBX+EDX]




MOV DL,AL


MOV DL,[EBX+EDX]




MOV DL,AL



33


MOV DL,[EBX+EDX]




MOV DL,AL


MOV DL,[EBX+EDX]




MOV DL,AL


MOV DL,[EBX+EDX]




MOV DL,AL


MOV DL,[EBX+EDX]



POP EDX,EBX,EAX

RET

;

ADD_LISTBOXSTRING: ;add a string to listbox, scrolling if required

PUSH EDX,0,180h,[hList] ;LB_ADDSTRING (address in edx)

CALL SendMessageA

PUSH EAX ;keep item index

DEC EAX ;index now one smaller

PUSH 0,EAX ;string to ensure visible

PUSH 197h,[hList] ;LB_SETTOPINDEX

CALL SendMessageA ;scroll listbox now to show string just inserted

PUSH [hList]

CALL UpdateWindow



POP EAX ;restore item index

RET

;

WRITE_LISTBOXLINE: ;write the string in edx to listbox

PUSH EAX

;**************************

CALL ADD_LISTBOXSTRING ;write to listbox

PUSH [MESSDELAY] ;256 milliseconds at start

CALL Sleep ;delay for a while

;**************************

POP EAX

RET

;

WRITE_MEM_ERROR:

PUSH EBX

MOV EDX,ADDR EXC_MESS7 ;correct message if write error

CMP D[EBX+14h],1 ;see if write error flag from 1st part of array

JZ >0 ;yes (write=1, read=0)

MOV EDX,ADDR EXC_MESS8 ;correct message if read error

0:

MOV EAX,[EBX+18h] ;get 2nd part of array (inaccessible address)

MOV ESI,EDX

ADD ESI,22D

CALL HEXWRITE ;write address into message

CALL WRITE_LISTBOXLINE ;write the string in edx to listbox

OR B[HANDLERFLAG],1 ;ensure that read/write message is written into listbox

POP EBX

RET

;

WCE23: ;write memory read/write number into message

PUSH ESI

MOV ESI,EBX

CALL HEXWRITE ;write memory read/write number into message at esi

POP ESI

RET



33

;

WRITE_CURRENT_EDI: ;correct message in esi

PUSH ECX,EDI

MOV EDX,ADDR EXC_MESS0 ;read message

MOV EBX,13D

CMP B[EXC_TYPE],104D ;see if read test

JZ >1 ;yes

SUB EBX,2

MOV EDX,ADDR EXC_MESS1 ;write message

1:

MOV ESI,EDX ;keep correct message in esi

ADD EBX,EDX ;and correct write-place in ebx

TEST B[HANDLERFLAG],1 ;see if first read/write message

JZ >2 ;no

;************ drawtext is used because it is much quicker than lb_insertstring

;************ insert eventual item in listbox but write over it for now

MOV EAX,EDI ;this message will be displayed at end of test

ADD EAX,1000h ;so ensure it shows correct place of exception occurance

CALL WCE23 ;write memory read/write number into message

MOV EDX,ESI

CALL ADD_LISTBOXSTRING ;write item to listbox, returning index in eax

PUSH ADDR RECT,EAX ;index of last string written (wParam)

PUSH 198h,[hList] ;LB_GETITEMRECT

CALL SendMessageA ;get client co-ordinates in RECT for string just written

ADD D[RECT],2 ;allow for lhs border

AND B[HANDLERFLAG],0FEh ;don't come here again

2:

MOV EAX,EDI

CALL WCE23 ;write memory read/write number into message

;*********************

PUSH 100h,ADDR RECT ;no clipping

PUSH -1,ESI,[hDC] ;-1=system to count length

CALL DrawTextA

;*********************

POP EDI,ECX



RET

;

WRITE_WHICHADDRESS: ;eax=code address

MOV ESI,ADDR EXC_MESS22

MOV EDX,ESI

ADD ESI,8

CALL HEXWRITE ;write code address into message


RET

;

WRITE_HANDLERDATA: ;eax=exception no., ebx=record, dl=handler no.

PUSH EAX,ESI,EDX


CMP DL,4 ;see if final handler

PUSHFD ;keep flag

JZ >3 ;yes


ADD DL,48D ;convert handler number to ascii char

MOV [ESI+39D],DL ;write the handler number

3:

MOV EDX,ESI ;keep correct message

ADD ESI,14D

CALL HEXWRITE ;write exception number into message


MOV EAX,[EBX+4] ;get exception flag

MOV ESI,ADDR EXC_MESS5 ;continuable

CMP EAX,1

JB >4

MOV ESI,ADDR EXC_MESS5A ;non-continuable

JZ >4

MOV ESI,ADDR EXC_MESS5B ;unwind

4:

MOV EDX,ESI ;keep for WRITE_LISTBOXLINE later

ADD ESI,13D

CALL HEXWRITE ;write exception flag into message



33


POPFD ;restore flag

JZ >5 ;final handler so don't show local data address

MOV ESI,ADDR EXC_MESS5C

MOV EDX,ESI ;keep for WRITE_LISTBOXLINE later

ADD ESI,19D

MOV EAX,[EBP+0Ch] ;get pointer to ERR structure

CALL HEXWRITE ;write as address of local data


5:

POP EDX,ESI,EAX

RET

;

CLEARUPCODE_MESS: ;handler in edx


CMP DL,1 ;see if handler 1

JNZ >6

TEST B[HANDLERFLAG],2 ;see if final handler doing unwind, though

JNZ >6 ;yes, so do ordinary message

MOV D[MESSDELAY],3000D ;3 seconds

MOV ESI,ADDR EXC_MESS11A

6:

ADD DL,48D ;convert handler number to ascii char

MOV [ESI+8D],DL ;write the handler number into message

MOV EDX,ESI ;keep correct message


RET

;

ADD_STRING:

PUSH ESI,0,143h,[hCombo] ;CB_ADDSTRING (uMsg), handle to combobox

CALL SendMessageA

RET

;

INITIALISE_CONTROLS:

MOV ECX,[EBP+14h] ;get dialog id sent to DialogBoxIndirectParam (lParam)



JCXZ >1 ;it's main dialog

RET ;it must be "about" dialog

1:

;************************* initialise the radio buttons

PUSH 108D ;button to select

PUSH 109D,104D ;last,first in group

PUSH [EBP+8] ;hdlg

CALL CheckRadioButton

;************************* now initialise 2nd lot of radio buttons

PUSH 1 ;indicate check

PUSH 111D ;identifier

PUSH [EBP+8] ;hdlg

CALL CheckDlgButton

;************************* now initialise the list and combo box

PUSH 113D,[EBP+8] ;list box identifier

CALL GetDlgItem ;get list box handle

MOV [hList],EAX ;keep it

PUSH 110D,[EBP+8] ;combo box identifier

CALL GetDlgItem ;get combo box handle

MOV [hCombo],EAX ;keep it

MOV BL,'1' ;handler number to add to message

MOV ESI,ADDR COMBO_STRING1

2:

MOV [ESI+35D],BL ;insert number into message

CALL ADD_STRING

INC BL

CMP BL,'4' ;see if at last message

JNZ 2

MOV [CINDEX],EAX ;keep the selection for later use

PUSH 0,EAX,14Eh,[hCombo] ;CB_SETCURSEL, handle to combobox

CALL SendMessageA

MOV ESI,ADDR COMBO_STRING3

CALL ADD_STRING ;no repair message

RET

;



33

GET_EXC_TYPE: ;get the chosen exception type

MOV EBX,104D

MOV ESI,6 ;number to do

3:

PUSH EBX,[EBP+8] ;button identifier, hdlg

CALL IsDlgButtonChecked

CMP AL,1 ;see if button is checked

JZ >4 ;yes

INC EBX

DEC ESI

JNZ 3

4:

MOV [EXC_TYPE],BL ;keep type for later tests

RET

;

;***************************************************** PROGRAM START

START:

PUSH 0

CALL GetModuleHandleA

MOV [hInst],EAX

;**************************** establish a handler for the final exit

PUSH ADDR FINAL_HANDLER

CALL SetUnhandledExceptionFilter

;****************************** now create the dialog box

PUSH 0,ADDR DlgProc ;pointer to dialog procedure (param=0=main dialog)

PUSH 0 ;this dialog is the main window (no parent)

PUSH 'MainDialog' ;name of dialog in resource file

PUSH [hInst]

CALL DialogBoxParamA ;this does not return until dialog closed

PUSH 0 ;exit code zero=success if finishes this way

CALL ExitProcess

;****************************************************** PROGRAM END

;

PROCESS_COMMAND: ;called if WM_COMMAND (eax holds wParam)

CMP EAX,99D ;see if "about" clicked



JNZ >0 ;no

PUSH 1,ADDR DlgProc,[EBP+8h] ;param=1

PUSH 'About'

PUSH [hInst]

CALL DialogBoxParamA ;create about dialog, borrowing main dlgproc

RET

0:

CMP EAX,101D ;see if it was "cause exception" button

JZ >1 ;yes

RET

;************************************************* CAUSE EXCEPTION WAS CLICKED

1:

CALL GET_EXC_TYPE ;get the chosen exception type

;************************* next see if check button is checked

PUSH 112D,[EBP+8] ;identifier of safe-place radiobutton

CALL IsDlgButtonChecked

MOV [CONTINUE],AL ;keep this 1=continue from safe-place

;************************* now get the combo box selection

PUSH 0,0,147h ;CB_GETCURSEL (uMsg)

PUSH [hCombo] ;handle to combobox

CALL SendMessageA ;get current selection

INC AL ;handler 1 now = 1

MOV [HANDLER],AL

;***************** clear the listbox

PUSH 0,0,184h ;LB_RESETCONTENT

PUSH [hList] ;handle to listbox

CALL SendMessageA

CALL SECOND_ROUTINE ;run until exception and repair

RET

;

;******************************************************* DIALOG PROCEDURE

;******* The about dialog also comes here, but no static data is re-used

;******* apart from COUNT

DlgProc:

;



33

PUSH EBP

MOV EBP,ESP

;now [EBP+8]=hDlg, [EBP+0Ch]=uMsg, [EBP+10h]=wParam, [EBP+14h]=lParam

;************************************** create area for local data

SUB ESP,40h ;make space of 16 dwords on stack for local data

;now addressable as [EBP-4] to [EBP-40h]

;************************************** save registers as required by Windows

PUSH EBX,EDI,ESI

;************************************** install handler_1 and its ERR structure

PUSH EBP ;ERR+14h save ebp (being ebp at safe-place1)

PUSH 0 ;ERR+10h area for flags

PUSH ADDR EXC_MESS16 ;ERR+0Ch safe place 1 message

PUSH ADDR SAFE_PLACE1 ;ERR+8h place for new eip

PUSH ADDR HANDLER_1 ;ERR+4h address of handler routine

FS PUSH [0] ;ERR+0h keep next handler up the chain


;**************************************

INC D[COUNT] ;used in getting a random number

MOV EAX,[EBP+0Ch] ;get uMsg

CMP EAX,136h ;see if WM_CTLCOLORDLG

JZ >3 ;yes

CMP EAX,135h ;see if WM_CTLCOLORBTN

JZ >2 ;yes

CMP EAX,138h ;see if WM_CTLCOLORSTATIC

JNZ >4 ;no

PUSH 120D,[EBP+8]

CALL GetDlgItem ;get control 120 handle

CMP EAX,[EBP+14h] ;see if its the static control for bitmap frame

JZ LONG >8 ;must be kept white

2:

PUSH 1,[EBP+10h] ;1=transparent, wParam

CALL SetBkMode

3:

PUSH 00808040h ;blue colour from default palette

CALL CreateSolidBrush ;create brush as an object with handle in EAX



JMP LONG >9 ;return with the brush handle (deleted on program exit)

4: ;this is needed because dialog=main window (no IDCANCEL)

CMP EAX,110h ;see if WM_INITDIALOG

JNZ >5 ;no

CALL INITIALISE_CONTROLS

JMP >.nonzero ;return non-zero

5:

CMP EAX,10h ;see if WM_CLOSE (sent if sysmenu clicked)

JZ >6 ;yes, so say goodbye and finish

CMP EAX,111h ;see if WM_COMMAND

JNZ >8 ;no

TEST B[HANDLERFLAG],2 ;see if in final handler

JNZ >8 ;yes so ignore command messages

MOV EAX,[EBP+10h] ;wParam

CMP EAX,102D ;see if it was quit button

JZ >6 ;yes, so say goodbye and finish

CMP EAX,100D ;see if "about" OK button

JZ >7 ;yes so remove about dialog

CALL PROCESS_COMMAND

JMP >.nonzero

6:

TEST B[HANDLERFLAG],2 ;see if in final handler

JNZ >8 ;yes so ignore quit/close messages

MOV D[MESSDELAY],1000D ;one second delay

MOV EDX,ADDR BYETEXT ;write "Have an exceptional day!"


7:

PUSH 0,[EBP+8]

CALL EndDialog ;end dialog

.nonzero

MOV EAX,1 ;return non-zero (TRUE=message processed)

JMP >9

;****************************************************** HANDLER SAFE-PLACE 1

SAFE_PLACE1: ;esp/ebp already set to correct values by handler

CALL WRITE_LISTBOXLINE ;write the string in edx to listbox tell user reached here



34

8:

XOR EAX,EAX ;return zero (FALSE=message not processed)

9:



POP ESI,EDI,EBX

MOV ESP,EBP

POP EBP

RET 10h ;automatically does epilogue code to close stack frame

;

ATTEMPT_CORRUPTION: ;attempt code corruption in random place

MOV ESI,ADDR CODESTART

MOV EDI,ADDR CODEEND

SUB EDI,ESI ;get how many bytes in the routine

;*****************************

;Note that it is possible the code section has a write attribute from its

;own PE file, so first ensure that this is removed ..

PUSH ADDR flOldProtect

PUSH 20h ;PAGE_EXECUTE_READ

PUSH EDI,ESI ;size, start

CALL VirtualProtect

OR EAX,EAX ;check for success

JZ >.fin ;no, so too dangerous to do the test

;***************************** get a random number no higher than edi

XOR EBX,EBX

7:

STC

RCL EBX,1

CMP EDI,EBX ;find how many bits may be looked at

JNB 7

8:

CALL GetTickCount ;get count since Windows started now

MOV EDX,EAX ;keep whole tick count

SUB EAX,[COUNT] ;add another random element

MOV ECX,200D



9:

AND EAX,EBX ;only look at correct number of bits

CMP EDI,EAX ;see if number is now too high

JNB >10 ;no

ROR EDX,5 ;rotate edx 5 times

ADD EAX,EDX ;add extra random element

LOOP 9 ;try again 200 times

JMP 8 ;try again with another tick count

10:

;*********** number now in eax

ADD ESI,EAX ;get to address to corrupt

PUSH ESI

MOV EAX,ESI ;get number to write in eax


ADD ESI,27D

CALL HEXWRITE ;write exception flags into message

MOV EDX,ADDR EXC_MESS9 ;write "Attempt to corrupt code at h"


POP ESI

MOV B[ESI],90h ;attempt to corrupted code (causes exception)

.fin

RET

;

MEM_TEST: ;its a memory read/write exception

OR B[HANDLERFLAG],1 ;ensure read/write message is written to listbox

;******** get device context and set up correct font and colour

PUSH [hList]

CALL GetDC

MOV [hDC],EAX ;keep handle of device context of listbox

PUSH 0,0,31h,[hList] ;WM_GETFONT

CALL SendMessageA ;get listbox font

PUSH EAX,[hDC]

CALL SelectObject ;use this font in the dc

PUSH 0FF0000h,[hDC] ;nice blue colour

CALL SetTextColor



34

;**************************************************************

OR BL,BL ;see if write test

JZ >22 ;yes

;******************************** now for the read test

PUSH 0,1000h,0 ;make "growable" memory, 4K for immediate use

CALL HeapCreate

MOV EDI,EAX

MOV [hHeap],EAX ;keep heap address

MOV ECX,2001h ;ready to read from 8K +1

20:

MOV AL,[EDI] ;read into al

CMP ECX,1 ;unless the last (handler returns to here for last one)

JZ >21 ;listbox message already written

CALL WRITE_CURRENT_EDI ;show user current position

21:

INC EDI

LOOP 20 ;continue so as to cause exception

PUSH [hHeap]

CALL HeapDestroy

JMP >25

;******************************** now for the write test

22:

PUSH 4h ;read & write access

PUSH 2000h ;MEM_RESERVE

PUSH 10000h ;64K

PUSH 0 ;system to decide address

CALL VirtualAlloc

MOV [hHeap],EAX

PUSH 4h ;read & write access

PUSH 1000h ;MEM_COMMIT

PUSH 1000h ;4K

PUSH [hHeap]

CALL VirtualAlloc

MOV EDI,EAX ;base address of allocated 4K

MOV ECX,2001h ;ready to write 8K + 1 byte



23:

MOV B[EDI],'X'

CMP ECX,1 ;unless the last (handler returns to here for last one)

JZ >24 ;listbox message already written

CALL WRITE_CURRENT_EDI ;show user current position

24:

INC EDI

LOOP 23 ;continue so as to cause exception

PUSH 4000h,0,[hHeap] ;MEM_DECOMMIT

CALL VirtualFree ;decommit memory used

PUSH 8000h,0,[hHeap] ;MEM_RELEASE

CALL VirtualFree ;free memory used

25:

;**************************** release the device contact

PUSH [hDC],[hList]

CALL ReleaseDC

RET

;

ERROR_ROUTINE: ;the exception will occur in this routine

XOR EBX,EBX

MOV BL,[EXC_TYPE] ;get exception type again

SUB EBX,105D ;see if memory read/write test

JA >30 ;no

CALL MEM_TEST

RET

30:

;*********************** own software exception

DEC EBX ;see if should do own (continuable) software exception

JZ >31 ;yes

CMP EBX,1 ;see if should do own (non-continuable) software exception

JNZ >32 ;no

31: ;0=continuable exception, 1=non-continuable exception

MOV EAX,ADDR AVOID ;get place to restart from

MOV [lpArguments],EAX ;keep in array in memory

MOV [lpArguments+4],ESP ;keep esp too



34

PUSH ADDR lpArguments ;give array to function

PUSH 2 ;number of arguments in array

PUSH EBX ;continuable or non-continuable exception flag

PUSH 0E0000100h ;exception code

CALL RaiseException

AVOID:

RET

32:

DEC EBX,EBX ;see if divide by zero

JNZ >33 ;no

;*********************** divide by zero exception

XOR ECX,ECX

MOV EAX,66D

DIV CL ;divide by zero to create exception

RET

33: ;must be attempt to corrupt code test

CALL ATTEMPT_CORRUPTION ;attempt code corruption in random place in code

RET

;

THIRD_ROUTINE:









;**************************************

MOV [EBPSAFE_PLACE3],EBP ;these are kept solely for

MOV [ESPSAFE_PLACE3],ESP ;repair by final handler

;**************************************

MOV EDX,ADDR EXC_MESS19 ;"exception will occur in level 3 code"


MOV EDX,ADDR EXC_MESS20 ;"(protected by exception handler 3)"




CALL ERROR_ROUTINE ;exception will be caused by this routine

JMP >4

;************************************** here is the safe place & code



4:


ADD ESP,14h ;throw away handler_3

RET

;

SECOND_ROUTINE:









;**************************************

CALL THIRD_ROUTINE

JMP >5

;************************************** here is the safe place & code



5:



RET

;

;************ here is the routine to "unwind" the stack and go to safe-place

TRYFOR_SAFEPLACE: ;EAX=exception

CMP EAX,0C0000005h ;see if memory read/write exception

JNZ >6 ;no



34

CALL WRITE_MEM_ERROR ;write type and place of error

6:

MOV EDX,ADDR EXC_MESS12

CALL WRITE_LISTBOXLINE ;write "Ready to do voluntary stack unwind"

;*** now carry out own unwind for other handlers to clear-up using local data

;*** here is the call to the only recently documented API function RtlUnwind

PUSH 0 ;return value (not needed)

PUSH [EBP+8] ;send exception_record to per-thread handlers

PUSH ADDR UN23 ;return address

PUSH [EBP+0Ch] ;pointer to this ERR structure

CALL RtlUnwind

UN23:

;***************************** now change context to suit safe place

;***************************** current context has values as at the exception

MOV ESI,[EBP+10h] ;get context record in esi

MOV EDX,[EBP+0Ch] ;get pointer to ERR structure

MOV [ESI+0C4h],EDX ;insert new esp (happens to be pointer to ERR)

MOV EAX,[EDX+8] ;get safe place given in ERR structure

MOV [ESI+0B8h],EAX ;insert new eip

MOV EAX,[EDX+0Ch] ;get message address in eax

MOV [ESI+0A8h],EAX ;insert new edx

MOV EAX,[EDX+14h] ;get ebp at safe place given in ERR structure

MOV [ESI+0B4h],EAX ;insert new ebp

RET

;***************** here is the routine to try repair an exception

ATTEMPT_LOCAL_REPAIR: ;EAX=exception, EBX=exception record


CALL WRITE_LISTBOXLINE ;write "Attempting local repair (no unwind)" (saves eax)

CMP EAX,0E0000100h ;see if own software exception

JZ >11 ;yes

CMP EAX,0C0000094h ;see if divide by zero exception

JZ >9 ;yes

CMP EAX,0C0000005h ;see if memory read/write exception

JNZ >10 ;no

CMP B[EXC_TYPE],104D ;see if memory test



JZ >7 ;yes


JNZ >10 ;no

7:

CALL WRITE_MEM_ERROR ;write type and place of error



;************** read from memory error - the following will work

PUSH 1000h ;allocate another 4K

PUSH 4 ;HEAP_GENERATE_EXCEPTIONS on error=another exception

PUSH [hHeap] ;normally get this from handler structure

CALL HeapAlloc ;allocate another 4K

OR EAX,EAX ;see if error

JZ >10 ;yes

JMP >12

;******** the above did not work for write error because memory has already

;been written to during exception and is therefore "corrupt". You get a

;C0000005h access violation. The way round this is to use the virtual alloc

;function which will permit you to specify the starting place for the new

;memory allocation (which is the same as inaccessible address):-

8:

PUSH 4 ;read and write access

PUSH 1000h ;commit more memory

PUSH 1000h ;another 4K required

PUSH [EBX+18h] ;inaccessible address sent as 2nd part of array

CALL VirtualAlloc ;add another 4K using inaccessible address as base

OR EAX,EAX ;see if error

JZ >10 ;yes

JMP >12

;********************************

9: ;its divide by zero exception


MOV D[ESI+0ACh],1D ;replace ecx with 1 to ensure div by 1 next time

JMP >12

10: ;error or unexpected exception return



34


CALL WRITE_LISTBOXLINE ;write "Handler cannot repair this exception"

STC

RET

11: ;its an own software exception


MOV EDX,[EBP+0Ch] ;get pointer to ERR structure

MOV EAX,[EDX+14h] ;get ebp at safe place given in ERR structure

MOV [ESI+0B4h],EAX ;insert new ebp in context

MOV EAX,[EBX+14h] ;get from exception record the address to jump to

MOV [ESI+0B8h],EAX ;change eip in context

MOV EAX,[EBX+18h] ;get from exception record the 2nd part of array

MOV [ESI+0C4h],EAX ;which is the ESP at repair place

12:


CALL WRITE_LISTBOXLINE ;write "repair appears successful"

CLC

RET ;return nc on success, c on failure

;

HEAP_CLOSE:


JZ >20 ;yes


JNZ >23 ;no

20:


CALL WRITE_LISTBOXLINE ;write "Closing memory heap and dc"



PUSH [hHeap]

CALL HeapDestroy

JMP >22

21:

PUSH 4000h,0,[hHeap] ;MEM_DECOMMIT

CALL VirtualFree ;decommit memory used



PUSH 8000h,0,[hHeap] ;MEM_RELEASE

CALL VirtualFree

22:

PUSH [hDC],[hList]

CALL ReleaseDC

23:

RET

;

HANDLER_3: ;handler 3

PUSH EBP

MOV EBP,ESP



TEST D[EBX+4],02h ;see if its EH_UNWINDING (from Unwind)

JNZ >30 ;yes, so exception address is not useful here

MOV EAX,[EBX+0Ch] ;get ExceptionAddress

CALL WRITE_WHICHADDRESS

30:

MOV EAX,[EBX] ;get ExceptionCode

MOV DL,3 ;indicate 3rd handler

CALL WRITE_HANDLERDATA ;saves edx

TEST D[EBX+4],01h ;see if its a non-continuable exception

JNZ >34 ;yes


JZ >31 ;no

CALL CLEARUPCODE_MESS

CALL HEAP_CLOSE ;close the memory heap and dc if memory test

JMP >34 ;must return 1 to go to next handler

31:

CMP [HANDLER],DL ;see if this handler allowed to deal

JNZ >34 ;no

CMP B[CONTINUE],1 ;see if 1=continue from safe-place

JNZ >32 ;no so deal with exception locally

CALL TRYFOR_SAFEPLACE

JMP >33



35

32:

CALL ATTEMPT_LOCAL_REPAIR

JNC >33 ;success


33:

XOR EAX,EAX ;reload context and return to system

JMP >35

34:

MOV EAX,1 ;this handler will not deal with this exception

35:

POP ESI,EDI,EBX

MOV ESP,EBP

POP EBP

RET ;ordinary return because was a "C" type call not PASCAL

;

HANDLER_2: ;second handler

PUSH EBP

MOV EBP,ESP




MOV DL,2 ;indicate 2nd handler



JNZ >43 ;yes


JZ >40 ;no



40:


JNZ >43 ;no






JMP >42

41:


JNC >42 ;success


42:

XOR EAX,EAX ;exception was repaired - reload context and try again

JMP >44

43:

MOV EAX,1 ;this handler will not deal with this exception

44:

POP ESI,EDI,EBX

MOV ESP,EBP

POP EBP


;

HANDLER_1:

PUSH EBP

MOV EBP,ESP




MOV DL,1 ;indicate 1st handler



JNZ >53 ;yes


JZ >50 ;no



50:


JNZ >53 ;no





35


JMP >52

51:


JNC >52 ;success


52:

XOR EAX,EAX ;reload context and return to system

JMP >54

53:

MOV EAX,1 ;go to next handler

54:

POP ESI,EDI,EBX

MOV ESP,EBP

POP EBP


;

FINAL_HANDLER_RECOVERY: ;ebx=exception record, esi=context

MOV EDX,ADDR EXC_MESS23 ;will now do voluntary unwind and safe-place


;

;-- DO NOT REMOVE ---------------- the following unwind systems are alternative

;************* the final handler does not know the last ERR structure

;************* so find it

;FS MOV EAX,[0] ;get pointer to very first ERR structure

;L880:

;CMP D[EAX],-1 ;see if the last one

;JZ >L881 ;yes, so finish

;MOV EAX,[EAX] ;get pointer to next ERR structure

;JMP L880

;L881:

;PUSH ESI ;cannot rely on RtlUnwind to keep this (context)

;;**********************

;PUSH 0 ;return value (not used)

;PUSH EBX ;send exception_record to per-thread handlers



;PUSH ADDR UN25 ;return address

;PUSH EAX ;pointer to last unwind frame

;CALL RtlUnwind

;UN25:

;;**********************

;POP ESI

;JMP >61

;-- DO NOT REMOVE --------------------------------------------------------

;

;********************************** trying own unwind in final handler

MOV D[EBX+4],02h ;indicate eh_unwinding flag for termination code

FS MOV EDI,[0] ;get pointer to very first ERR structure

60:

CMP D[EDI],-1 ;see if the last one

JZ >61 ;yes, so finish

PUSH EDI,EBX ;push ERR structure,exception record

CALL [EDI+4] ;call the associated handler to run clear-up code

ADD ESP,8h ;remove parameters put on the stack

MOV EDI,[EDI] ;get pointer to next ERR structure

JMP 60

61:

;*******************************************************************

MOV EAX,[EBPSAFE_PLACE3] ;kept earlier in third_routine

MOV [ESI+0B4h],EAX ;insert new ebp

MOV EAX,[ESPSAFE_PLACE3] ;in case of this repair

MOV [ESI+0C4h],EAX ;insert new esp

MOV EAX,ADDR SAFE_PLACE3

MOV [ESI+0B8h],EAX ;insert new eip

MOV EAX,ADDR EXC_MESS24 ;hello from safe-place 3 message

MOV [ESI+0A8h],EAX ;insert new edx

RET

;

;*********************** now if exception reached this point it is serious

FINAL_HANDLER: ;this time the system passes only the pointer

MOV EDX,[ESP+4] ;to EXCEPTION_POINTERS - get it in edx



35


OR B[HANDLERFLAG],2 ;flag that in final handler

;************************** see EXCEPTION_POINTERS structure

MOV ESI,[EDX+4] ;get context record in esi

MOV EBX,[EDX] ;get pointer to Exception Record

MOV EAX,[EBX] ;get exception code

MOV DL,4 ;indicate final handler

CALL WRITE_HANDLERDATA ;saves esi, ebx

MOV EAX,[ESI+0B8h] ;get eip from context

PUSH ESI ;keep context

MOV ESI,ADDR EXC_MESS13 ;Exception at eip= h

MOV EDX,ESI

ADD ESI,25D

CALL HEXWRITE

CALL ADD_LISTBOXSTRING ;write the string in edx to listbox

MOV EDX,ADDR EXC_MESS17 ;"Press F3=polite end, F5=nasty end, F7=recover!"

CALL ADD_LISTBOXSTRING ;write the string in edx to listbox

POP ESI ;restore context

;*************************************** flush any key messages in message queue

0:

CALL GetActiveWindow ;get handle to dialog

PUSH 1 ;PM_REMOVE remove message if there

PUSH 108h,100h,EAX,ADDR MSG ;WM_KEYLAST,WM_KEYFIRST key press filter

CALL PeekMessageA

OR EAX,EAX ;see if there was a key message there

JNZ 0 ;yes, so ignore it

;**************** now wait for correct keypress but let mouse messages through

1: ;note that command messages are sent direct to dlgproc

CALL GetActiveWindow ;get handle to dialog

PUSH 0,0,EAX,ADDR MSG ;get all messages

CALL GetMessageA

MOV EAX,[MSG+4] ;get message

CMP EAX,100h ;see if below WM_KEYFIRST

JB >2 ;yes, so send to dlgproc

CMP EAX,108h ;see if above WM_KEYLAST



JA >2 ;yes, so send to dlgproc

MOV EAX,[MSG+8] ;get virtual key

CMP EAX,76h ;see if F7 pressed

JZ >3 ;yes


JZ >5 ;yes


JZ >4 ;yes

JMP 1 ;no so ignore and wait for other messages

2:

PUSH ADDR MSG

CALL DispatchMessageA ;send mouse message to DlgProc

JMP 1

3:

CALL FINAL_HANDLER_RECOVERY

MOV EAX,-1 ;reload context and continue execution

JMP >7

;*****************************************************************************

4:

PUSH 0 ;ok button only

PUSH 'This is the polite end'

PUSH 'We sincerely offer our grovelling apologies (sic)!'

PUSH [hInst]



CALL WRITE_LISTBOXLINE ;back to the system for unwind and termination

MOV EAX,1 ;terminate process without showing message box

JMP >6

5:


CALL WRITE_LISTBOXLINE ;back to the system for unwind and termination

MOV EAX,0 ;terminate process showing message box

6:

MOV D[MESSDELAY],1000D ;greater delay for final messages from the system

7:



35

;*********************************************************************

AND B[HANDLERFLAG],0FDh ;clear flag that in final handler

POP ESI,EDI,EBX

RET 4h ;(for what it's worth) remove parameter from the stack

;

CODEEND: ;label for attempted code corruption

;


Lesson 11 - How is a disassembler working ?

Lesson 11 - How is a disassembler working21 ? What is this document about? This document describes the design and implementation of a tool which takes 32-bit Win-dows executable file and disassembles the raw machine code of the executable file into some form of human readable representation such as "assembly language", and displays it to the user.

What is the purpose of this document? Besides it serves as my personal note of what I studies, the document is mainly created for those of you who may be interested in learning how to write a disassembler. I also make all the source files available for download. I have extensive comments in the source, but some parts of the project may be still difficult to understand without understanding an overall design, so this document fills that hole.

It is, unfortunately, not possible for me (or anybody) to fully describe every detail of how to write a disassembler from A to Z. Moreover, I do not claim that my design and implementation is "the best". In fact, this project was more for educating myself than showing it to others. My original intent was to write just a framework, then publish it so that other people can extend it.

"Open ended implementation" The subtitle says "open ended implementation". What I mean by that is, as you will learn in this document later on, my implementation is basically incomplete, and you are more than welcome to take a part in it, completing the part that I left off. To start working on the part that I left, all you have to do is to copy a couple of DLLs (and associated header file and lib file) and start writing your own "decoder". See the document for detail.

I will also complete the project eventually...

21.This article was found via google and was written by Tsuyoshi Watanabe. We respect the work of this author and you should do the same



35

NOTE: I make no guarantee that my design nor implementation is the most efficient and correct. Indeed, my design only reflects how I solve the problem, and it should differ from yours.

I make certain assumptions:

- Using Microsoft Visual C++ as the compiler

- Executable file that can be disassembled is compiled by Microsoft tool (you can change this easily).

- it is only for 32-bit executable.



IntroductionQuestions

Disassembling a machine code into human readable assembly code sounds complicate.When you look at the Intel instruction manual, you understand that it is. However it is not nec-essarily difficult to write one given that you decompose the task into smaller subordinatetasks.

There are several problems that pops up in your mind when you think about writing a disas-sembler.

- How does the raw machine code look like?

- How are machine code and assembly code related?

- How do I get to the beginning of a machine code? Where does it come from?

- What kind of documentation and specification do I need?

- etc.



36

Dumpbin

The easiest way to get answers to those questions is to play with "dumpbin.exe" utilityprovided by Microsoft Visual C++ tools. This utility comes with every version of VisualC++, from 2.0 to 6.0 as far as I remember. Note that Visual C++ 1.52, a 16-bit edition,does NOT come with dumpbin.exe. Instead it came with exehdr.exe or something, andthat doesn't work for 32-bit PE format executable files.

Dumpbin.exe is a powerful PE format executable file dumper utility that can dump allkinds of stuff from any PE file. Here, we study the output of dumpbin.exe using /DISASMswitch. It literally "disassembles" the content of "code" section of a given file. (basicallywe don't have to write a disassembler at all since we got one!).



The following is a sample dumpbin output "NOTEPAD.EXE".

One instruction appears in a single line at a time (except when it is too long and wraps to afollowing line). At the far left columns, you see addresses of each instruction. The first instruc-tion "cmp" is located at address:01B41000:

The middle column shows variable-length "raw machine code" per instruction. For example,the first instruction is:83 3D E8 8E B4 01 00



36

Finally, the human-readable assembly language instruction appears. It is:cmp dword ptr ds:[01B48EE8h], 0

You don't need to understand what this really means until you get to much later part in thisdocument, but it roughly means that "compare a 4-byte big data located at address01B48EE8 in the DS segment against literal value 0".

Notice that there are instructions that are 7-byte in length, like the first instruction, but oth-ers may be 2-byte long, 5-byte long, some are even just 1-byte long. The point is that Intelx86 (starting from 8086 up to the current Pentium II) use "variable-length instruction" asagainst "fixed-length" instructions. This is one of the differences from RISC processors,whose instructions are all the same length. Also contrast this with Java byte code.Although Java byte code is not a native "machine" code (well, it sometimes is... I thinkSun has a hardware that directly interprets Java byte code), it is similar "encoding", andits instructions are all one byte.



Intel x86? Which processor are we going to work for?

One of the reasons of Intel's success in their processor business is their "backward compati-bility" with legacy codes. The following is a brief history of Intel's x86 series processors.

1979 8088/8086

1982 80286

1985 i386

1989 i486

1993 Pentium

1997 Pentium Pro & MMX stuff

1998 Pentium II & Celeron

Each generation of processor became better and better by improving things like:

expanding data and address bus to increase addressing space

introducing protected mode for more reliable operating environment

increasing the size of cache

integrating with FPU

adding more instructions like MMX that EVERYBODY uses

adding multi-scaler pipelining

increasing clock cycle rate

and many other stuff that I have no idea



36

From our disassembler's point of view, we don't have to worry about processor specificthings. It is all hidden, and instruction map is never "modified" although new instructionswere added over two decades.

Also, for this project, I intended to completely ignored 16-bit code. However, as I discover,it was easier to include logics that are only applicable to 16-bit to the project since theprocessor architecture is built with 16-bit and 32-bit mode relatively strongly coupled. Inanother word, the amount of work to separate 16-bit stuff from 32-bit stuff is more thansimply take both in to the project.

To answer the question of which processor would our disassembler work, it will work onlyback to i386. The reason: 80286 has no protected mode. Windows run only with pro-tected mode.

Which Microsoft Windows?

Our disassembler is going to work primarily with Portable Executable format files (a.k.aPE file). This PE format files are standard executable file format for 32-bit Windows. 16-bitWindows executable are in format called NE (New Executable?), and it is not compatiblewith PE format. Types of program that are in PE format are:

- User-mode executable file (EXE, DLL, and others) for Windows 95/98.

- User-mode executable file (EXE, DLL, and others) for Windows NT.

- Kernel-mode executable file (SYS) for Windows NT.

Kernel-mode executable for Windows 95 (and most of 98), normally called as VxD, are inLE format (a format that is somewhat more compatible with OS/2), and this is not compat-ible with PE file format.

However, as you will see, I designed the project in such way that the piece of softwarethat "parse" a stream of machine code byes are completely ignorant about "where" itcomes from. In another word, it could come from either PE file code section, or VxD'scode segment. So it is possible to extend it so that it will work with non-PE format file.

Still, vast majority of executables we deal with everyday are in PE file format. So we willonly work with PE file.



Any reference needed?

Only documentation that is going to be required is an Intel processor manual. It is officiallycalled "Intel Architecture Software Developer's Manual" (ISBN-1555122744). There are threevolumes, and the volume 2 contains most of the information we need. The problem is that thisdocument is not sold in most of the book stores. However, it is available for free from Intel'sdownload site .

Intel's official manual is not the only reference that we could use. In fact, there are otherbooks that also contain information needed to write disassembler. I find that it is helpful tohave several references so that when one book is not clear about something, I can checkother books. I used "The Intel Microprocessors 8086/8088, 80186/80188, 80286, 80386,80486, Pentium, and Pentium Pro Processor" by Barry B. Brey (ISBN-0132606704).



36

Overall architecturePhases of data representation.

Our data is a byte stream in machine code. Disassembler is nothing more than a softwarethat converts an input byte stream into something else. This conversion task could bebroken down into smaller pieces. To find out how many pieces into which we can break itdown, we need to see how many "phases" that our data will go through. The following fig-ure shows three basic possible phases of data.

The first phase is the start of the processing. There are just a bunch of raw byte streamwhich, supposedly, mean something to the hardware processor. Note that there are notmeaningful boundaries in the stream of bytes.

We like to transform this stream of bytes into a list of much smaller, yet still in raw format,groups of bytes, which I call "raw instructions". Each raw instruction should correspond toa single Intel x86 instruction. If data at this phase is rendered to users, they will only seebunch of variable-length hex numbers.

In the final phase, we hope that every raw instruction is converted into a line of words andnumbers that we understand as "assembly language". If data at this phase is rendered,users see "disassembled instructions".



Two processing tasks

By looking at the figure for the phases that our data will go through, we understand that thereneed to be two distinct "processing" tasks.

First, we need to bridge from "Phase I" to "Phase II". I decided to call the processing, thattransforms our data from "machine code byte stream" format into "raw instructions" format, as"Parsing". There could be better technical wording than "Parsing", but it could be calledsomething like "tokenization".

The second "processing" that transforms "raw instructions" into "assembly instructions" isnamed "Decoding" because what it really does is to "interpret" what each byte in a rawinstruction mean and put it in another human-readable form.



36

These two processings could have been put together in a big "disassembler" processor,but I thought it was better to separate them into two completely independent processingsbecause:

The task of parsing involves deciding where the current instruction ends. In another word,it was primarily concerned about "how many bytes" it should process (read) for a singleinstruction. On the other hand, "Decoding" is another kind of task that is not really con-cerned about (or doesn't want to be concerned about) how many bytes are in an instruc-tion, but rather what the data bytes mean so that it can convert it to a group of keywordsand numbers, which we understand as "assembly code".

It could be argued that by separating them into two, I am producing some redundancy --basically there could be almost two "paths" for every byte in the input data. However, itseemed to be reasonable to say that the cost of "duplicates" is far less than the time yourwill spend debugging a module that performs two logically different tasks simultaneously.



Mapping of "processing" tasks to objects.

In a pure Object-oriented design, this mapping is probably "big NO NO". I am mapping "pro-cessing" to "objects", which doesn't make sense in OO design world. However, I argue that"parsing" is done by a "parser", "decoding" is performed by a "decoder", so I could mapobjects to these "XXXers". The following figure shows our "parser" and "decoder".



37

The green rectangles are objects. As you can see, Parser takes "machine code bytestream" as its input, produces "raw instructions". In turn, Decoder takes "raw instruction"and converts it to "assembly instruction".

This is our overall design of the "engine" part of the disassembler.



Other utility objects we need.

You might have noticed that objects we got so far, Parser and Decoder, have no interactionwith the user. Parser could ask for a "machine code byte stream" from user directly, but I don'tknow how many users can actually hand-craft machine code byte stream and give it to theParser. Meanwhile, when Decoder does his job of decoding raw instruction into lines ofassembly language code, how is it going to show his work to the user? Should it show eachline in a separate Message Box? We might automatically think that output of any disassem-bler should be a scrolling output in the standard output console, but it doesn't have to be. Inever mention how it is "rendered". Who will be doing those extra works?

The figure below shows a couple of objects that do the "data providing (fetching)" and "ren-dering service" part.

"Data stream provider" is someone or a piece of software that somehow "produces" an inputdata stream. Our Parser happens to be a consumer of that product. The data stream may



37

come from a PE file, a LE file, from memory of a running program, or whatever. Therecould be several different flavors of "Data stream provider" including Clipboard to which auser may copied data from somewhere. The point is that it is bad idea to make suchassumptions here.

For this project, I arbitrarily decided to use a kind of PE dump utility which provide us withthe "data stream providing service".

"Rendering provider" (rendering service provider) is the UI guy. Rendering technique maybe a simple console output, dialog based list box output, or something more sophisticated(complicated) like a graph of caller-callee relationships, but it is up to designer of "Ren-dering provider" to decide how to "render" the assembly language lines produced byDecoder. In this project, I have a Decoder called SimpleDecoder, and it uses "std::cout"as the rendering provider. Since "std::cout" is a "service" not really an independent entity,so SimpleDecoder implementation basically lacks "rendering provider" piece. More onthis later.



Where is user?

Now, we put all the pieces together. Certainly, the most important piece is the "user". Idescribed in the previous section, UI layer is going to interact with the "user".

User provides "executable file" to be disassembled. In turn, our "disassembler system"returns a disassembled file.

So, this completes the section for "architecture".



37

Getting machine code byte streamPE file wrapper objectIn the previous chapter, we decided to have an utility object that provide "data stream pro-viding" service. I also decided that for this project, we use a some kind of PE dumper. Luckily, there are many sample PE dumper (I used Matt Pietrek's PEDUMP as the start-ing point, thanks Matt!).

The following figure shows you which part of the system we will work on in this chapter.



According to the requirement, what we need is an object which is capable of taking a speci-fied executable file from the user, then somehow get (extract?) the "code" part of the execut-able file, then make it available for others such as our Parser (but could be anybody else whoneed "machine code byte stream").

Requirements of this object are:

- It takes a file name of a target PE executable as an input

- It understands the PE file format

- It provides service functions so that client can obtain "machine code byte stream".

This is not a terribly involved set of requirements. The requirements can be easily fulfilled byextending a typical PE file dumper.

Although the topic of PE file dumper is interesting and important, I decided not to dwell toolong on this subject. Besides, this object is rather "extra" helper object. We are more inter-ested in "Parser" and "Decoder" since it provides the "guts" of a disassembler.

For this reason, I will just show my implementation of "Data stream provider" called PEFile-Wrap.



37

PEFileWrap

Basically, PEFileWrap is a "wrapper" of PE file which provides a couple of methods,among others, to give information about the location and size of "code section" within aPE file.



PE file has different sections like these:

- Standard header

- Optional header

- Section table

- Code section

- Initialized data section

- Uninitialized data section

- Import table

- Export table

- Thread local storage

- etc.

However, you don't really care about anything except "Code section" of a PE file. This sectionis where linker emits all the object codes into. When OS loader starts executing a program,the first byte of this section is executed. This section is the "machine code byte stream" thatwe are going to disassemble.



37

The interface (abstract base class) is as follows. Ones we are going to use are high-lighted. class IPEFileWrap

{

public:

virtual DWORD

GetBase()

= 0;

virtual DWORD

GetCodeSectionOffset()

= 0;

virtual UINT

GetCodeSectionSize()

= 0;

virtual DWORD GetInitializedDataSectionOffset() = 0;

virtual UINT GetInitializedDataSectionSize() = 0;

virtual DWORD GetUninitializedDataSectionOffset() = 0;

virtual UINT GetUninitializedDataSectionSize() = 0;

virtual DWORD GetImportDataSectionOffset() = 0;

virtual UINT GetImportDataSectionSize() = 0;

virtual DWORD GetExportDataSectionOffset() = 0;

virtual UINT GetExportDataSectionSize() = 0;

virtual DWORD GetResourceSectionOffset() = 0;

virtual UINT GetResourceSectionSize() = 0;

};

This interface is defined in the header file "PEUtility.h" and it is in the project's sharedInclude directory as well as under PEUtility project source directory. Our disassemblerneeds to include this file so that we can use the service.



How to create and use IPEFileWrap object

I decided to package this object in a DLL for two reasons:

1.Makes the project simpler

2.I can update the DLL since this PE dump stuff could potentially be another funproject to extend. (you are more than welcome to enhance it to the next generation ofPE file content dumper).

At any rate, this object is "hosted" by a server DLL called PEUtility.DLL. The DLL is under theproject's top level Debug/Release directory. The DLL exports a function that you should callto obtain a pointer to IPEFileWrap object - it is called CreatePEFileWrap().

extern "C" PEUTILITY_API int CreatePEFileWrap (char* filename, IPEFileWrap** ppx); enum PEUTILITY_ERROR_CODE { PEUTILITY_SUCCESS = 1, PEUTILITY_FAILURE = 0, OBJECT_ALREADY_CREATED = -1 };



38

When you call CreatePEFileWrap, it will return PEUTILITY_ERROR_CODE. Use thisreturn code to find out if there was any problem. If you get PEUTILITY_SUCCESS, theneverything went well.

NOTE: this implementation is "asking" for COM implementation. I intentionally made thisnon-COM object because introducing COM here may make the project more complexand hard to understand. Needless to say, it is far better to have it as a COM object.

This PEFileWrap object has a huge drawback. It is not multi-thread ready. In anotherword, you can't open more than one PE file at a time with this object. In fact, with the cur-rent implementation, all you can do is to create a PE file wrapper once, and until DLL isunloaded (which basically means until application is terminated) the object continue toexist. This problem can be solved by making the object COM-compliant.

After all, if you don't like this implementation, you can use your own PE file dumper utility.As mentioned before, the only requirement for it in this project is that it can provide thelocation of "machine code byte stream" and size of that stream!

You might want to look at the classes in PEUtility project. I got the followings:

- PEFile - represent a single PE file

- PEFileHeader - represent "header" part of a PE file

- PEOptionalHeader - represent "optional header" part of a PE file

-PESectionTable - represent "section table" part of a PE file

The code is largely based on Matt Pietrek's PEDUMP. However, the codes are decom-posed into these classes from straight "C" implementation. Each class could be extendedto provide more sophisticated capability. For this project, however, my PEFileWrap fulfillsthe requirement, so there is no point equipping it with other capabilities.



Understanding 32-bit Intel Processor Architecture(IA32) for parsingIn the previous chapter, we learned how to get a "machine code byte stream" by using the service provided by a PE wrapper class called PEFileWrap, which is hosted in PEUtilty.DLL.

Our next task is to parse the raw data of machine code byte stream into a list of smaller groups of bytes, where each element in the list is going to representing a single instruction.

A design issue that we need to agree.

Before we go any further, we have a "design issue". The issue is this:



38

When Parser processes the input byte stream, will it produce an array of raw instruc-tions? - or - Is it going to find out an end of a single instruction, and give control back to the client? OK, to understand this issue, compare the following possible "design" of the Parser:

1.Parser processes the byte stream, and as it find out a single instruction, it addresses an entry into an array of pointer to byte. At the end, there will be a dynamic array with the size of elements being equal to the number of instruc-tions in the input byte stream.

2.Parser processes the byte stream, and as it find out a single instruction, it cop-ies the entire instruction bytes into a buffer. This buffer could be an element of a dynamically growable array.

3.Parser processes the byte stream, and as it find out a single instruction, it returns a pointer to the beginning of the current instruction within the input byte stream. It also returns the number of bytes that current instruction is. When cli-ent says "go ahead", Parser start processing immediately beyond the last byte of the previous instruction.

To make the story short, I used the "design #3". The reason is efficiency in space as well as time. There will be no coping (actually I do perform physical copying for caching pur-pose) of data into another location, which probably needs to be dynamically allocated. There will be no dynamically growable array of pointers. A pointer takes up 4 bytes. Sup-pose there are 1000 instructions, it will take 4k of memory, which is equivalent of an entire page size.

The only requirement with this design is that client, in our case Decoder or whoever own Decoder, must perform the decoding task on the fly. It can't go back to a previous instruc-tion once it proceed to next instruction. This makes "decoding task" more linear.

Anyway, that's how my Parser is going to do. Now, lets get to the topic of "how to parse Intel machine code byte stream" so that it can find a boundary of a current instruction.



Understanding Intel Architecture. The subtitle of this chapter is "Understanding Intel Architecture (IA32) for parsing". The next chapter talks about the design of Parser. Why we need an entire chapter for understanding Intel Architecture? Because our task of "parsing" requires us to understand it. Of course, we just need to understand very small segment of the Intel Architecture to do our job.

Let's dive into it. Ready?

This is the format of Intel x86 instructions. Yes, probably you don't understand things like"ModR/M" and "SIB". You might have some idea of "Opcode" and "Displacement". Theymeans lots of things, but we have to remember this while we study Intel x86 instruction for-mat:

At this point, we don't care what they mean as assembly language point of view. All we wantto know is that how many bytes each instruction we parse is going to be.

To put it in another words, Parser wants to know where the current instruction ends. Howcould it be sure that an instruction ends at a particular location? That's what we are going tofind out in this chapter.



38

Prefix

Lets start with those guys that sit before Opcode.

The ones highlighted with yellow background are called "prefix" as a whole. As the nameimplies, they might appear before Opcode. Their jobs is to "override" the default attributeof the processor mode. For instance, when processor is running in 32-bit mode, thedefault address and data (operand) size are 32-bit. Say, if you wanted to move just 16-bitdata into a register, then only for that instruction, default operand size attribute must beoverridden.

Address size and operand size are similar. Their presence flips the attribute between 16-bit and 32-bit. Let's not go too far on this. Our goal is to understand "how many bytes we



need to read for an instruction". If you are interested in check intel processor manual. (I don'tmean to escape, we just don't have to know about it until we write Decoder).

The most important things that we should know about prefixes are:- Each prefix is exactly one byte

- Every prefix is optional - it may be present, and may be absent.

- The order of appearance is not fixed (this I am not entirely sure but Intel pro-cessor manual says so).

How do they look like? Here you go.

They are all in hex, and Instruction Prefix and Segment Prefix have more than one. Each bytemeans something but we don't care what they mean now. (again, go ahead and find out whatthey mean.)

So, what does this all mean to our parser? It means that:

Any instruction may start with at most 4 prefix bytes, which may appear in any order, so weneed to keep reading all (or none) of the prefix. In addition, Address-size prefix and Operand-

Instruction Prefix

F0 F2 F3 F3

Address-size override prefix

67

Operand-size override prefix

66

Segment override prefix

2E 36 3E 26 64 65



38

size prefix are going to influence subsequent parsing task, so we better remember thatwe saw them, if they exist. That's it.

Opcode

Probably, Opcde is the easiest one to understand what it means (although we don't reallycare). It decides which one of the operations provided by the processor that a particularinstruction wants.

In terms of parsing purpose, first thing we need to understand is that Opcode itself couldtake up either one or two byte. There is no case where Opcode is absent. If Opcode is notfound, parsing must have screwed up somewhere.

Anyway, our Parser must be able to read either one byte or two bytes depending onwhether or not this Opcode is "One Byte opcode" or "Two Byte opcde". How can you tellthis? Easy. If you see 0F (hex), it is an escape character for yet another opcode byte.



Besides the size of an opcode itself, we must find out what kind of "operand(s)" a particularopcode is going to take. Some of the operand don't take any operand, others take just ModR/M byte, some take Immediate, etc.

The red line arrow in the above figure means that the presence of the fields pointed are dic-tated by the field where the red line arrow originates. Therefore, whether ModR/M byte willfollow or not is depending on opcode. The same applies to Displacement and Immediate.

So, it is getting little complicated here. What we do?

This is the part that took most of my time in this project so far (aside Decoder which will be alot more). It would be nice that Intel processor manual has tables that say, "this and this and



38

that instruction take ModR/M. that and these opcodes take immediate" etc. Unfortunately,they don't.

Intel processor manual describes operand requirements for every operand, but there isn'tany nicely formatted tables. I had to basically create tables of requirements by hand. Thetables I made are:- Table of "One Byte" opcodes which take ModR/M field.

- Table of "Two Byte" opcodes which take ModR/M field.

- Table of "One Byte" opcodes which take 1-byte Displacement.

- Table of "One Byte" opcodes which take 2/4-byte Displacement.

- Table of "Two Byte" opcodes which take 1-byte Displacement.

- Table of "Two Byte" opcodes which take 2/4-byte Displacement.

- Table of "One Byte" opcodes which take 1-byte Immediate.

- Table of "One Byte" opcodes which take 2/4-byte Immediate.

- Table of "Two Byte" opcodes which take 1-byte Immediate.

- Table of "Two Byte" opcodes which take 2/4-byte Immediate.

If you are lost, that's natural.

"2/4-byte" part means that the size is either 2-byte (16-bit) or 4-byte (32-bit). How are wegoing to know which size the operand is? This is where "default attribute" and possiblepresence of Operand-size prefix comes into play. Parser must "recall" about any prefixthat it might have already parsed.

Construction of these tables took a while, and since I did it by hand, there may be errors.On top of that, due to my lack of complete understanding of assembly language (did Imention that I never really programmed in assembly language before? I am a C/ C++ pro-grammer!), I might have made some mistakes. So far, my test result says that I got it right,but I will not be surprised if there is a bug or two pops up due to bad table.

I am not going to show the contents of these tables here. It won't be exciting any way. Youcan see the tables in the source file, "IA32OpcodePart.cpp". It is in ".cpp" file becausethese tables are static member of a class.

From Parser's point of view, it has to check the current opcode against these require-ments of operand fields, and if any match is found, it has to remember to parse operandfields. The size of field might have to be determined by learning about operand-size (and



possible override made by prefix). For instance, if Immediate is required by a particularopcode, after we read (or not read) ModR/M, SIB, and displacement fields, we must remem-ber to parse x number of bytes for Immediate operand.

Don't worry to remember all these details. After all, this part is all implemented so you neverhave to implement. (unless you become sick of my spaghetti, and decide to write your own).



39

ModR/M

This field is scary looking. What the &*^# is "ModR/M"? In short, this field possiblyencodes one or two operands, one of which could be a memory data. It also may encodea sort of "sub-opcode", where certain opcode defines a "group operand" and actual oper-and is determined by looking at a part of this ModR/M byte. Even worse, this ModR/Mmay require SIB byte, which follows ModR/M. Again the detail of "meaning" is not soimportant here.

As you can see with a couple of red arrows, ModR/M may say that it needs SIB and/orDisplacement field to completely describe operand(s).

Checking for this conditionals was not difficult. For a given ModR/M byte...- If data at bit position 3,4,5 is equal to "100" (e.g. 00100110), then SIB byte will follow.

- If data at bit position 6,7 is equal to "01", then there will be 1-byte dis-placement.

- If data at bit position 6,7 is equal to "10" and Address-size is 32-bit, then there will be 4-byte displacement.

- If data at bit position 6,7 is equal to "10" and Address-size is 16-bit, then there will be 2-byte displacement.

From parser's point of view, it needs to find out signature at above mentioned bit loca-tions, and remember the field requirement if they occur.

Intel processor manual completely describes the meanings of every possible pattern ofModR/M byte, so it shouldn't be confusing when implementing a decoder.



SIB

This byte is very passive, and you don't have to do anything as far as parsing is concerned.Just go right pass over SIB byte and go to the next fields, (or to the end of the current instruc-tion if no other fields follow).

The meanings of every possibility of SIB byte is completely described by Intel processor man-ual.



39

Displacement

Presence of this field is already determined by either Opcode or ModR/M, including sizeof displacement. Parser needs to advance its location to pass over displacement field, ifexists.

The meaning of this field and how it relates to other fields are not our concern at thismoment. It will probably be used for effective address calculation when fetching somedata from memory.



Immediate

Finally, we see the end of the tunnel. Just like Displacement, if required, we would haveknown by now. Just parse over it for the number of bytes for this field.

When pointer (or location counter) is advance passed this field, we must be looking at thebeginning of the next instruction. This is where Parser would say, "Done, here is the currentinstruction!".



39

Now what?

After going through each part of the instruction format, we have general idea of what kindof operations must be implemented for our parser.

If you can translate this instruction format to software objects, it save our time becauseour understanding of instruction format would immediately reflect on the design of soft-ware objects.

The following figure shows instruction format from our software's point of view. I madesome arbitrary regrouping of parts (fields) of instruction that are reasonable for our soft-ware.



For instance, you see that all the prefix fields are merged to a single "prefix part". This makessense because of the strong relationship among the four prefixes. Another merger occurredbetween ModR/M and SIB. Since SIB is a passive field (it doesn't designate other fields), itbecame part of logic that takes care of ModR/M. The notion of "size" is less strict in this view.

In the next chapter, we map these parts to C++ objects, and refine the relationship and inter-action among the fields.



39

Decoding raw instructionsSimple implementation - SimpeDecoderThis chapter is about Decoder part of a disassembler.

The task of a "decoder" in this context is to take a raw instruction bytes (possibly just onebyte) which represents a single instruction and convert it into a human-readable formatsuch as a line of assembly instruction.

Decoder does not have to worry about figuring out how many bytes the current instructionis made up with since Parser object will tell him.

The most primitive type of Decoder is a decoder which does not "decode" at all.



What SimpleDecoder does?

The following figure shows what my SimpleDecoder does:

As you can see, it entirely skips the most interesting task of translating (decoding) raw instruction into assembly instruction. Instead, it simply converts raw byte data into hexadeci-mal representation in ASCII characters.

Obviously, SimpleDecoder is so simple that it add almost no value at all, but it should be a good example for anybody who want to play around with InstructionParser.



39

Rendering provider SimpleDecoder's job was to "decode" raw instructions, but it is not responsible for "ren-dering decoded information". This task is performed by an implementation of "Rendering service provider".

Rendering provider could be another fun project. It can range from simple console outputto a GUI rendering using icons, list views, or whatever.

For SimpleDecoder, I used std::cout as rendering service provider. In another word, I justdumped into DOS box.

Again, this is the simplest rendering service I can ever ask.

Believe it or not, this is the end of this chapter.



Final wordsMore sophisticated implementation - Disassembler

This is where you can come in!

I haven't wrote any decoder that does more than SimpleDecoder does. Eventually, I wouldlike to write one and share my experience in here, but for now, this section is "under construc-tion"!

Of course, I will be more than happy to work together, or just exchange ideas here!



40

CHAPTER 2 Lets´s build a compiler...

This sixteen-part series, written from 1988 to 1995, is a non-technical introduction to compilerconstruction1 and is Copyright (C) 1988 Jack W. Crenshaw.

You may ask: “This book should be about writing disassemblers not compilers. What the heckare you doing here ?”

The Answer is:Do you know what a compiler is ? How it works ? So let me give you a short introduction andthen you will see WHY I have included the chapter 2.

If you code an application you start with typing your coding language with your favorite IDE.Then you mostly push a button “compile” and after some seconds and some more magic youhave a working application.

1. The original URl is: http://compilers.iecc.com/crenshaw/


Lets´s build a compiler...

34

But how does this magic works ?

Well, first your source-code will be checked for error. This is called “Lexical Scanning”and “Parsing”. One part of the compiler (for languages like JavaScript you call this magi-cian “Interpreter”) scans our source for typos and for the correct “grammar”. If anything goes wrong you will receive an error like “parsing error in line 546” or “if with-out end in line 276”

If everything is OK the compiler will translate your source-code (human-readable) toassembly-code (for example mov eax,0).

If this is finished the new code (assembly-code) will be translated to opcodes.The correct opcode on a 8086 machine for PUSH 0 is 6A 00 or for PUSH DWORD PTR DS:[402048] it is FF35 48204000

As you can see: the machine code will be translated to a hex-value which corrresponds toour command. The hex-values are called “Opcodes” and the corresponding command“Mnemonics”.

If you take a hex-editor and open a file this is exactly what you get !The application is finished with its compilation. Reading these hex-values and doing theprogram running is not our problem. This does the computer with some magic we will notneed now.

Maybe you can see now WHY compilers are related to disassemblers... No ?

Ok, here we go:

We want to disassemble a file. Let´s assume we do this manually. We open the file withan hex-editor. Then we take the first hex-value, look in our opcode/mnemonic table and ifwe found it we write it down (like mov eax,0).

If we have not found the value in our table we take the second hex-value. Then we checkthe combined hex-value (from the first hex and the second hex) in our opcode/mnemonictable. If not found we take the third. So FF35 48204000 may be PUSH DWORD PTRDS:[402048]. Sure the result depends on the processor and the opcode/mnemonictable we use. Remember: after 15 hex-values we should have a result. If not there issomething wrong because the maximum opcode-length should be 15 !


Now you can see that a compiler and a disassembler are exactly the same !

Well, not really...Only in parts.

Imagine this:We add more functionality to our disassembler. After getting the correct opcodes and mne-monics we add another magic function: translating the mnemonics to “source”-code of anylanguage.

Then we would have a reversed compiler. This is what we call decompiler and as you cansee is the disassembler one part of it.

Yep. Now you see: if you know how a compiler works it is easy to understand a decompilerand a disassembler.

If you are not further interested in diving into compiler-construction you can jump over thischapter but I really recommend some reading of it.

So let´s go, this will be a long and hard but interesting part of this book... See you then withsome more grey hairs after this chapter...



34

Part 1 - IntroductionThis series of articles is a tutorial on the theory and practice of developing language pars-ers and compilers. Before we are finished, we will have covered every aspect of compilerconstruction, designed a new programming language, and built a working compiler.

Though I am not a computer scientist by education (my Ph.D. is in a different field, Phys-ics), I have been interested in compilers for many years. I have bought and tried to digestthe contents of virtually every book on the subject ever written. I don't mind telling you thatit was slow going. Compiler texts are written for Computer Science majors, and are toughsledding for the rest of us. But over the years a bit of it began to seep in. What reallycaused it to jell was when I began to branch off on my own and begin to try things on myown computer. Now I plan to share with you what I have learned. At the end of this seriesyou will by no means be a computer scientist, nor will you know all the esoterics of com-piler theory. I intend to completely ignore the more theoretical aspects of the subject.What you _WILL_ know is all the practical aspects that one needs to know to build aworking system.

This is a "learn-by-doing" series. In the course of the series I will be performing experi-ments on a computer. You will be expected to follow along, repeating the experiments thatI do, and performing some on your own. I will be using Turbo Pascal 4.0 on a PC clone. Iwill periodically insert examples written in TP. These will be executable code, which youwill be expected to copy into your own computer and run. If you don't have a copy ofTurbo, you will be severely limited in how well you will be able to follow what's going on. Ifyou don't have a copy, I urge you to get one. After all, it's an excellent product, good formany other uses!

Some articles on compilers show you examples, or show you (as in the case of Small-C)a finished product, which you can then copy and use without a whole lot of understandingof how it works. I hope to do much more than that. I hope to teach you HOW the thingsget done, so that you can go off on your own and not only reproduce what I have done,but improve on it.


Part 1 - Introduction

This is admittedly an ambitious undertaking, and it won't be done in one page. I expect to doit in the course of a number of articles. Each article will cover a single aspect of compiler the-ory, and will pretty much stand alone. If all you're interested in at a given time is one aspect,then you need to look only at that one article. Each article will be uploaded as it is complete,so you will have to wait for the last one before you can consider yourself finished. Please bepatient.

The average text on compiler theory covers a lot of ground that we won't be covering here.The typical sequence is:

o An introductory chapter describing what a compiler is.

o A chapter or two on syntax equations, using Backus-Naur Form (BNF).

o A chapter or two on lexical scanning, with emphasis on deterministic and non-deterministicfinite automata.

o Several chapters on parsing theory, beginning with top-down recursive descent, and endingwith LALR parsers.

o A chapter on intermediate languages, with emphasis on P-code and similar reverse polishrepresentations.

o Many chapters on alternative ways to handle subroutines and parameter passing, type dec-larations, and such.

o A chapter toward the end on code generation, usually for some imaginary CPU with a sim-ple instruction set. Most readers (and in fact, most college classes) never make it this far.

o A final chapter or two on optimization. This chapter often goes unread, too.



34

I'll be taking a much different approach in this series. To begin with, I won't dwell long onoptions. I'll be giving you _A_ way that works. If you want to explore options, well andgood ... I encourage you to do so ... but I'll be sticking to what I know. I also will skip overmost of the theory that puts people to sleep. Don't get me wrong: I don't belittle the theory,and it's vitally important when it comes to dealing with the more tricky parts of a given lan-guage. But I believe in putting first things first. Here we'll be dealing with the 95% of com-piler techniques that don't need a lot of theory to handle.

I also will discuss only one approach to parsing: top-down, recursive descent parsing,which is the _ONLY_ technique that's at all amenable to hand-crafting a compiler. Theother approaches are only useful if you have a tool like YACC, and also don't care howmuch memory space the final product uses.

I also take a page from the work of Ron Cain, the author of the original Small C. Whereasalmost all other compiler authors have historically used an intermediate language like P-code and divided the compiler into two parts (a front end that produces P-code, and aback end that processes P-code to produce executable object code), Ron showed us thatit is a straightforward matter to make a compiler directly produce executable object code,in the form of assembler language statements. The code will _NOT_ be the world's tight-est code ... producing optimized code is a much more difficult job. But it will work, andwork reasonably well. Just so that I don't leave you with the impression that our end prod-uct will be worthless, I _DO_ intend to show you how to "soup up" the compiler with someoptimization.

Finally, I'll be using some tricks that I've found to be most helpful in letting me understandwhat's going on without wading through a lot of boiler plate. Chief among these is the useof single-character tokens, with no embedded spaces, for the early design work. I figurethat if I can get a parser to recognize and deal with I-T-L, I can get it to do the same withIF-THEN- ELSE. And I can. In the second "lesson," I'll show you just how easy it is toextend a simple parser to handle tokens of arbitrary length. As another trick, I completelyignore file I/O, figuring that if I can read source from the keyboard and output object to thescreen, I can also do it from/to disk files. Experience has proven that once a translator isworking correctly, it's a straightforward matter to redirect the I/O to files. The last trick isthat I make no attempt to do error correction/recovery. The programs we'll be building willRECOGNIZE errors, and will not CRASH, but they will simply stop on the first error ... justlike good ol' Turbo does. There will be other tricks that you'll see as you go. Most of themcan't be found in any compiler textbook, but they work.



A word about style and efficiency. As you will see, I tend to write programs in _VERY_ small,easily understood pieces. None of the procedures we'll be working with will be more thanabout 15-20 lines long. I'm a fervent devotee of the KISS (Keep It Simple, Sidney) school ofsoftware development. I try to never do something tricky or complex, when something simplewill do. Inefficient? Perhaps, but you'll like the results. As Brian Kernighan has said, FIRSTmake it run, THEN make it run fast. If, later on, you want to go back and tighten up the codein one of our products, you'll be able to do so, since the code will be quite understandable. Ifyou do so, however, I urge you to wait until the program is doing everything you want it to.

I also have a tendency to delay building a module until I discover that I need it. Trying to antic-ipate every possible future contingency can drive you crazy, and you'll generally guess wronganyway. In this modern day of screen editors and fast compilers, I don't hesitate to change amodule when I feel I need a more powerful one. Until then, I'll write only what I need.

One final caveat: One of the principles we'll be sticking to here is that we don't fool aroundwith P-code or imaginary CPUs, but that we will start out on day one producing working, exe-cutable object code, at least in the form of assembler language source. However, you maynot like my choice of assembler language ... it's 68000 code, which is what works on my sys-tem (under SK*DOS). I think you'll find, though, that the translation to any other CPU such asthe 80x86 will be quite obvious, though, so I don't see a problem here. In fact, I hope some-one out there who knows the '86 language better than I do will offer us the equivalent objectcode fragments as we need them.



34

THE CRADLE Every program needs some boiler plate ... I/O routines, error message routines, etc. Theprograms we develop here will be no exceptions. I've tried to hold this stuff to an absoluteminimum, however, so that we can concentrate on the important stuff without losing itamong the trees. The code given below represents about the minimum that we need toget anything done. It consists of some I/O routines, an error-handling routine and a skele-ton, null main program. I call it our cradle. As we develop other routines, we'll add them tothe cradle, and add the calls to them as we need to. Make a copy of the cradle and saveit, because we'll be using it more than once.

There are many different ways to organize the scanning activities of a parser. In Unix sys-tems, authors tend to use getc and ungetc. I've had very good luck with the approachshown here, which is to use a single, global, lookahead character. Part of the initializationprocedure (the only part, so far!) serves to "prime the pump" by reading the first characterfrom the input stream. No other special techniques are required with Turbo 4.0 ... eachsuccessive call to GetChar will read the next character in the stream.



{--------------------------------------------------------------}program Cradle;

{--------------------------------------------------------------}

{ Constant Declarations }

const TAB = ^I;

{--------------------------------------------------------------}

{ Variable Declarations }

var Look: char; { Lookahead Character }

{--------------------------------------------------------------}

{ Read New Character From Input Stream }

procedure GetChar;

begin

Read(Look);

end;

{--------------------------------------------------------------}

{ Report an Error }

procedure Error(s: string);

begin

WriteLn;

WriteLn(^G, 'Error: ', s, '.');

end;



35

{--------------------------------------------------------------}

{ Report Error and Halt }

procedure Abort(s: string);

begin

Error(s);

Halt;

end;

{--------------------------------------------------------------}

{ Report What Was Expected }

procedure Expected(s: string);

begin

Abort(s + ' Expected');

end;

{--------------------------------------------------------------}

{ Match a Specific Input Character }

procedure Match(x: char);

begin

if Look = x then GetChar

else Expected('''' + x + '''');

end;



{--------------------------------------------------------------}

{ Recognize an Alpha Character }

function IsAlpha(c: char): boolean;

begin

IsAlpha := upcase(c) in ['A'..'Z'];

end;

{--------------------------------------------------------------}

{ Recognize a Decimal Digit }

function IsDigit(c: char): boolean;

begin

IsDigit := c in ['0'..'9'];

end;

{--------------------------------------------------------------}

{ Get an Identifier }

function GetName: char;

begin

if not IsAlpha(Look) then Expected('Name');

GetName := UpCase(Look);

GetChar;

end;



35

{--------------------------------------------------------------}

{ Get a Number }

function GetNum: char;

begin

if not IsDigit(Look) then Expected('Integer');

GetNum := Look;

GetChar;

end;

{--------------------------------------------------------------}

{ Output a String with Tab }

procedure Emit(s: string);

begin

Write(TAB, s);

end;

{--------------------------------------------------------------}

{ Output a String with Tab and CRLF }

procedure EmitLn(s: string);

begin

Emit(s);

WriteLn;

end;



{--------------------------------------------------------------}

{ Initialize }

procedure Init;

begin

GetChar;

end;

{--------------------------------------------------------------}

{ Main Program }

begin

Init;

end.

{--------------------------------------------------------------}

That's it for this introduction. Copy the code above into TP and compile it. Make sure that itcompiles and runs correctly. Then proceed to the first lesson, which is on expression parsing.



35

Part 2 - Expression Parsing

GETTING STARTED If you've read the introduction document to this series, you will already know what we'reabout. You will also have copied the cradle software into your Turbo Pascal system, andhave compiled it. So you should be ready to go.

The purpose of this article is for us to learn how to parse and translate mathematicalexpressions. What we would like to see as output is a series of assembler-languagestatements that perform the desired actions. For purposes of definition, an expression isthe right-hand side of an equation, as in

x = 2*y + 3/(4*z)

In the early going, I'll be taking things in _VERY_ small steps. That's so that the beginnersamong you won't get totally lost. There are also some very good lessons to be learnedearly on, that will serve us well later. For the more experienced readers: bear with me.We'll get rolling soon enough.



SINGLE DIGITS In keeping with the whole theme of this series (KISS, remember?), let's start with the abso-lutely most simple case we can think of. That, to me, is an expression consisting of a singledigit.

Before starting to code, make sure you have a baseline copy of the "cradle" that I gave lasttime. We'll be using it again for other experiments. Then add this code:

{---------------------------------------------------------------}

{ Parse and Translate a Math Expression }

procedure Expression;

begin

EmitLn('MOVE #' + GetNum + ',D0')

end;

{---------------------------------------------------------------}

And add the line "Expression;" to the main program so that it reads:

{---------------------------------------------------------------}

begin

Init;

Expression;

end.

{---------------------------------------------------------------}



35

Now run the program. Try any single-digit number as input. You should get a single line ofassembler-language output. Now try any other character as input, and you'll see that theparser properly reports an error.

CONGRATULATIONS! You have just written a working translator!

OK, I grant you that it's pretty limited. But don't brush it off too lightly. This little "compiler"does, on a very limited scale, exactly what any larger compiler does: it correctly recog-nizes legal statements in the input "language" that we have defined for it, and it producescorrect, executable assembler code, suitable for assembling into object format. Just asimportantly, it correctly recognizes statements that are NOT legal, and gives a meaningfulerror message. Who could ask for more? As we expand our parser, we'd better makesure those two characteristics always hold true.

There are some other features of this tiny program worth mentioning. First, you can seethat we don't separate code generation from parsing ... as soon as the parser knows whatwe want done, it generates the object code directly. In a real compiler, of course, thereads in GetChar would be from a disk file, and the writes to another disk file, but this wayis much easier to deal with while we're experimenting.

Also note that an expression must leave a result somewhere. I've chosen the 68000 reg-ister DO. I could have made some other choices, but this one makes sense.



BINARY EXPRESSIONS Now that we have that under our belt, let's branch out a bit. Admittedly, an "expression" con-sisting of only one character is not going to meet our needs for long, so let's see what we cando to extend it. Suppose we want to handle expressions of the form:

1+2

or 4-3

or, in general, <term> +/- <term>

To do this we need a procedure that recognizes a term and leaves its result somewhere, andanother that recognizes and distinguishes between a '+' and a '-' and generates the appropri-ate code. But if Expression is going to leave its result in DO, where should Term leave itsresult? Answer: the same place. We're going to have to save the first result of Term some-where before we get the next one.



35

OK, basically what we want to do is have procedure Term do what Expression was doingbefore. So just RENAME procedure Expression as Term, and enter the following new ver-sion of Expression:

{---------------------------------------------------------------}

{ Parse and Translate an Expression }


begin

Term;

EmitLn('MOVE D0,D1');

case Look of

'+': Add;

'-': Subtract;

else Expected('Addop');

end;

end;

{--------------------------------------------------------------}



Next, just above Expression enter these two procedures:

{--------------------------------------------------------------}

{ Recognize and Translate an Add }

procedure Add;

begin

Match('+');

Term;

EmitLn('ADD D1,D0');

end;

{-------------------------------------------------------------}

{ Recognize and Translate a Subtract }

procedure Subtract;

begin

Match('-');

Term;

EmitLn('SUB D1,D0');

end;

{-------------------------------------------------------------}



36

When you're finished with that, the order of the routines should be:

o Term (The OLD Expression)

o Add

o Subtract

o Expression

Now run the program. Try any combination you can think of of two single digits, separatedby a '+' or a '-'. You should get a series of four assembler-language instructions out ofeach run. Now try some expressions with deliberate errors in them. Does the parser catchthe errors?

Take a look at the object code generated. There are two observations we can make. First,the code generated is NOT what we would write ourselves. The sequence

MOVE #n,D0

MOVE D0,D1

is inefficient. If we were writing this code by hand, we would probably just load the datadirectly to D1.

There is a message here: code generated by our parser is less efficient than the code wewould write by hand. Get used to it. That's going to be true throughout this series. It's trueof all compilers to some extent. Computer scientists have devoted whole lifetimes to theissue of code optimization, and there are indeed things that can be done to improve thequality of code output. Some compilers do quite well, but there is a heavy price to pay incomplexity, and it's a losing battle anyway ... there will probably never come a time whena good assembler-language programmer can't out-program a compiler. Before this ses-sion is over, I'll briefly mention some ways that we can do a little optimization, just toshow you that we can indeed improve things without too much trouble. But remember,we're here to learn, not to see how tight we can make the object code. For now, and reallythroughout this series of articles, we'll studiously ignore optimization and concentrate ongetting out code that works.



Speaking of which: ours DOESN'T! The code is _WRONG_! As things are working now, thesubtraction process subtracts D1 (which has the FIRST argument in it) from D0 (which hasthe second). That's the wrong way, so we end up with the wrong sign for the result. So let's fixup procedure Subtract with a sign-changer, so that it reads

{-------------------------------------------------------------}


procedure Subtract;

begin

Match('-');

Term;

EmitLn('SUB D1,D0');

EmitLn('NEG D0');

end;

{-------------------------------------------------------------}

Now our code is even less efficient, but at least it gives the right answer! Unfortunately, therules that give the meaning of math expressions require that the terms in an expression comeout in an inconvenient order for us. Again, this is just one of those facts of life you learn to livewith. This one will come back to haunt us when we get to division.

OK, at this point we have a parser that can recognize the sum or difference of two digits. Ear-lier, we could only recognize a single digit. But real expressions can have either form (or aninfinity of others). For kicks, go back and run the program with the single input line '1'.

Didn't work, did it? And why should it? We just finished telling our parser that the only kinds ofexpressions that are legal are those with two terms. We must rewrite procedure Expressionto be a lot more broadminded, and this is where things start to take the shape of a real parser.



36

GENERAL EXPRESSIONS In the REAL world, an expression can consist of one or more terms, separated by"addops" ('+' or '-'). In BNF, this is written

<expression> ::= <term> [<addop> <term>]*

We can accomodate this definition of an expression with the addition of a simple loop toprocedure Expression:

{---------------------------------------------------------------}



begin

Term;

while Look in ['+', '-'] do begin


case Look of

'+': Add;

'-': Subtract;


end;

end;

end;

{--------------------------------------------------------------}



NOW we're getting somewhere! This version handles any number of terms, and it only costus two extra lines of code. As we go on, you'll discover that this is characteristic of top-downparsers ... it only takes a few lines of code to accomodate extensions to the language. That'swhat makes our incremental approach possible. Notice, too, how well the code of procedureExpression matches the BNF definition. That, too, is characteristic of the method. As you getproficient in the approach, you'll find that you can turn BNF into parser code just about as fastas you can type!

OK, compile the new version of our parser, and give it a try. As usual, verify that the "com-piler" can handle any legal expression, and will give a meaningful error message for an illegalone. Neat, eh? You might note that in our test version, any error message comes out sort ofburied in whatever code had already been generated. But remember, that's just because weare using the CRT as our "output file" for this series of experiments. In a production version,the two outputs would be separated ... one to the output file, and one to the screen.



36

USING THE STACK At this point I'm going to violate my rule that we don't introduce any complexity until it'sabsolutely necessary, long enough to point out a problem with the code we're generating.As things stand now, the parser uses D0 for the "primary" register, and D1 as a place tostore the partial sum. That works fine for now, because as long as we deal with only the"addops" '+' and '-', any new term can be added in as soon as it is found. But in generalthat isn't true. Consider, for example, the expression

1+(2-(3+(4-5)))

If we put the '1' in D1, where do we put the '2'? Since a general expression can have anydegree of complexity, we're going to run out of registers fast!

Fortunately, there's a simple solution. Like every modern microprocessor, the 68000 hasa stack, which is the perfect place to save a variable number of items. So instead of mov-ing the term in D0 to D1, let's just push it onto the stack. For the benefit of those unfamil-iar with 68000 assembler language, a push is written

-(SP)

and a pop, (SP)+ .

So let's change the EmitLn in Expression to read:

EmitLn('MOVE D0,-(SP)');

and the two lines in Add and Subtract to

EmitLn('ADD (SP)+,D0')

and EmitLn('SUB (SP)+,D0'),

respectively. Now try the parser again and make sure we haven't broken it. Once again,the generated code is less efficient than before, but it's a necessary step, as you'll see.



MULTIPLICATION AND DIVISION Now let's get down to some REALLY serious business. As you all know, there are other mathoperators than "addops" ... expressions can also have multiply and divide operations. Youalso know that there is an implied operator PRECEDENCE, or hierarchy, associated withexpressions, so that in an expression like

2 + 3 * 4,

we know that we're supposed to multiply FIRST, then add. (See why we needed the stack?)

In the early days of compiler technology, people used some rather complex techniques toinsure that the operator precedence rules were obeyed. It turns out, though, that none of thisis necessary ... the rules can be accommodated quite nicely by our top-down parsing tech-nique. Up till now, the only form that we've considered for a term is that of a single decimaldigit.

More generally, we can define a term as a PRODUCT of FACTORS; i.e.,

<term> ::= <factor> [ <mulop> <factor ]*

What is a factor? For now, it's what a term used to be ... a single digit.

Notice the symmetry: a term has the same form as an expression. As a matter of fact, we canadd to our parser with a little judicious copying and renaming. But to avoid confusion, the list-ing below is the complete set of parsing routines. (Note the way we handle the reversal ofoperands in Divide.)



36

{---------------------------------------------------------------}

{ Parse and Translate a Math Factor }

procedure Factor;

begin

EmitLn('MOVE #' + GetNum + ',D0')

end;

{--------------------------------------------------------------}

{ Recognize and Translate a Multiply }

procedure Multiply;

begin

Match('*');

Factor;

EmitLn('MULS (SP)+,D0');

end;



{-------------------------------------------------------------}

{ Recognize and Translate a Divide }

procedure Divide;

begin

Match('/');

Factor;

EmitLn('MOVE (SP)+,D1');

EmitLn('DIVS D1,D0');

end;



36

{---------------------------------------------------------------}

{ Parse and Translate a Math Term }

procedure Term;

begin

Factor;

while Look in ['*', '/'] do begin


case Look of

'*': Multiply;

'/': Divide;

else Expected('Mulop');

end;

end;

end;



{--------------------------------------------------------------}


procedure Add;

begin

Match('+');

Term;

EmitLn('ADD (SP)+,D0');

end;

{-------------------------------------------------------------}


procedure Subtract;

begin

Match('-');

Term;

EmitLn('SUB (SP)+,D0');

EmitLn('NEG D0');

end;



37

{---------------------------------------------------------------}



begin

Term;

while Look in ['+', '-'] do begin


case Look of

'+': Add;

'-': Subtract;


end;

end;

end;

{--------------------------------------------------------------}

Hot dog! A NEARLY functional parser/translator, in only 55 lines of Pascal! The output isstarting to look really useful, if you continue to overlook the inefficiency, which I hope youwill. Remember, we're not trying to produce tight code here.



PARENTHESES We can wrap up this part of the parser with the addition of parentheses with math expres-sions. As you know, parentheses are a mechanism to force a desired operator precedence.So, for example, in the expression

2*(3+4) ,

the parentheses force the addition before the multiply. Much more importantly, though, paren-theses give us a mechanism for defining expressions of any degree of complexity, as in

(1+2)/((3+4)+(5-6))

The key to incorporating parentheses into our parser is to realize that no matter how compli-cated an expression enclosed by parentheses may be, to the rest of the world it looks like asimple factor. That is, one of the forms for a factor is:

<factor> ::= (<expression>)

This is where the recursion comes in. An expression can contain a factor which containsanother expression which contains a factor, etc., ad infinitum.



37

Complicated or not, we can take care of this by adding just a few lines of Pascal to proce-dure Factor:

{---------------------------------------------------------------}


procedure Expression; Forward;

procedure Factor;

begin

if Look = '(' then begin

Match('(');

Expression;

Match(')');

end

else

EmitLn('MOVE #' + GetNum + ',D0');

end;

{--------------------------------------------------------------}

Note again how easily we can extend the parser, and how well the Pascal code matchesthe BNF syntax.

As usual, compile the new version and make sure that it correctly parses legal sentences,and flags illegal ones with an error message.



UNARY MINUS At this point, we have a parser that can handle just about any expression, right? OK, try thisinput sentence:

-1

WOOPS! It doesn't work, does it? Procedure Expression expects everything to start with aninteger, so it coughs up the leading minus sign. You'll find that +3 won't work either, nor willsomething like

-(3-2) .



37

There are a couple of ways to fix the problem. The easiest (although not necessarily thebest) way is to stick an imaginary leading zero in front of expressions of this type, so that-3 becomes 0-3. We can easily patch this into our existing version of Expression:

{---------------------------------------------------------------}



begin

if IsAddop(Look) then

EmitLn('CLR D0')

else

Term;

while IsAddop(Look) do begin


case Look of

'+': Add;

'-': Subtract;


end;

end;

end;

{--------------------------------------------------------------}



I TOLD you that making changes was easy! This time it cost us only three new lines of Pas-cal. Note the new reference to function IsAddop. Since the test for an addop appeared twice,I chose to embed it in the new function. The form of IsAddop should be apparent from that forIsAlpha. Here it is:

{--------------------------------------------------------------}

{ Recognize an Addop }

function IsAddop(c: char): boolean;

begin

IsAddop := c in ['+', '-'];

end;

{--------------------------------------------------------------}

OK, make these changes to the program and recompile. You should also include IsAddop inyour baseline copy of the cradle. We'll be needing it again later. Now try the input -1 again.Wow! The efficiency of the code is pretty poor ... six lines of code just for loading a simpleconstant ... but at least it's correct. Remember, we're not trying to replace Turbo Pascal here.

At this point we're just about finished with the structure of our expression parser. This versionof the program should correctly parse and compile just about any expression you care tothrow at it. It's still limited in that we can only handle factors involving single decimal digits.But I hope that by now you're starting to get the message that we can accomodate furtherextensions with just some minor changes to the parser. You probably won't be surprised tohear that a variable or even a function call is just another kind of a factor.

In the next session, I'll show you just how easy it is to extend our parser to take care of thesethings too, and I'll also show you just how easily we can accomodate multicharacter numbersand variable names. So you see, we're not far at all from a truly useful parser.



37

A WORD ABOUT OPTIMIZATION Earlier in this session, I promised to give you some hints as to how we can improve thequality of the generated code. As I said, the production of tight code is not the main pur-pose of this series of articles. But you need to at least know that we aren't just wasting ourtime here ... that we can indeed modify the parser further to make it produce better code,without throwing away everything we've done to date. As usual, it turns out that SOMEoptimization is not that difficult to do ... it simply takes some extra code in the parser.

There are two basic approaches we can take:

o Try to fix up the code after it's generatedThis is the concept of "peephole" optimization. The general idea it that weknow what combinations of instructions the compiler is going to generate,and we also know which ones are pretty bad (such as the code for -1, above).So all we do is to scan the produced code, looking for those combina-tions, and replacing them by better ones. It's sort of a macro expansion,in reverse, and a fairly straightforward exercise in pattern-matching.The only complication, really, is that there may be a LOT of such combina-tions to look for. It's called peephole optimization simply because it only looksat a small group of instructions at a time. Peephole optimization can have adramatic effect on the quality of the code, with little change to the struc-ture of the compiler itself. There is a price to pay, though, in both thespeed, size, and complexity of the compiler. Looking for all those combina-tions calls for a lot of IF tests, each one of which is a source of error. And, ofcourse, it takes time. In the classical implementation of a peephole opti-mizer, it's done as a second pass to the compiler. The output code is writtento disk, and then the optimizer reads and processes the disk file again. Asa matter of fact, you can see that the optimizer could even be a separatePROGRAM from the compiler proper. Since the optimizer only looks at thecode through a small "window" of instructions (hence the name), a betterimplementation would be to simply buffer up a few lines of output, and scanthe buffer after each EmitLn.



o Try to generate better code in the first placeThis approach calls for us to look for special cases BEFORE we Emit them. As atrivial example, we should be able to identify a constant zero, and Emit a CLRinstead of a load, or even do nothing at all, as in an add of zero, for example.Closer to home, if we had chosen to recognize the unary minus in Factor insteadof in Expression, we could treat constants like -1 as ordinary constants, ratherthen generating them from positive ones. None of these things are difficult todeal with ... they only add extra tests in the code, which is why I haven't includedthem in our program. The way I see it, once we get to the point that we have aworking compiler, generating useful code that executes, we can always go backand tweak the thing to tighten up the code produced. That's why there areRelease 2.0's in the world.

There IS one more type of optimization worth mentioning, that seems to promise pretty tightcode without too much hassle. It's my "invention" in the sense that I haven't seen it suggestedin print anywhere, though I have no illusions that it's original with me.

This is to avoid such a heavy use of the stack, by making better use of the CPU registers.Remember back when we were doing only addition and subtraction, that we used registersD0 and D1, rather than the stack? It worked, because with only those two operations, the"stack" never needs more than two entries.

Well, the 68000 has eight data registers. Why not use them as a privately managed stack?The key is to recognize that, at any point in its processing, the parser KNOWS how manyitems are on the stack, so it can indeed manage it properly. We can define a private "stackpointer" that keeps track of which stack level we're at, and addresses the corresponding reg-ister. Procedure Factor, for example, would not cause data to be loaded into register D0, butinto whatever the current "top-of-stack" register happened to be.

What we're doing in effect is to replace the CPU's RAM stack with a locally managed stackmade up of registers. For most expressions, the stack level will never exceed eight, so we'llget pretty good code out. Of course, we also have to deal with those odd cases where thestack level DOES exceed eight, but that's no problem either. We simply let the stack spill overinto the CPU stack. For levels beyond eight, the code is no worse than what we're generatingnow, and for levels less than eight, it's considerably better.



37

For the record, I have implemented this concept, just to make sure it works before I men-tioned it to you. It does. In practice, it turns out that you can't really use all eight levels ...you need at least one register free to reverse the operand order for division (sure wish the68000 had an XTHL, like the 8080!). For expressions that include function calls, we wouldalso need a register reserved for them. Still, there is a nice improvement in code size formost expressions.

So, you see, getting better code isn't that difficult, but it does add complexity to the ourtranslator ... complexity we can do without at this point. For that reason, I STRONGLYsuggest that we continue to ignore efficiency issues for the rest of this series, secure inthe knowledge that we can indeed improve the code quality without throwing away whatwe've done.

Next lesson, I'll show you how to deal with variables factors and function calls. I'll alsoshow you just how easy it is to handle multicharacter tokens and embedded white space.


Part 3 - More Expressions


INTRODUCTIONIn the last installment, we examined the techniques used to parse and translate a generalmath expression. We ended up with a simple parser that could handle arbitrarily complexexpressions, with two restrictions:

o No variables were allowed, only numeric factors

o The numeric factors were limited to single digits

In this installment, we'll get rid of those restrictions. We'll also extend what we've done toinclude assignment statements function calls and. Remember, though, that the secondrestriction was mainly self-imposed ... a choice of convenience on our part, to make life eas-ier and to let us concentrate on the fundamental concepts. As you'll see in a bit, it's an easyrestriction to get rid of, so don't get too hung up about it. We'll use the trick when it serves usto do so, confident that we can discard it when we're ready to.



38

VARIABLES Most expressions that we see in practice involve variables, such as

b * b + 4 * a * c

No parser is much good without being able to deal with them. Fortunately, it's also quiteeasy to do.

Remember that in our parser as it currently stands, there are two kinds of factors allowed:integer constants and expressions within parentheses. In BNF notation,

<factor> ::= <number> | (<expression>)

The '|' stands for "or", meaning of course that either form is a legal form for a factor.Remember, too, that we had no trouble knowing which was which ... the lookahead char-acter is a left paren '(' in one case, and a digit in the other.

It probably won't come as too much of a surprise that a variable is just another kind of fac-tor. So we extend the BNF above to read:

<factor> ::= <number> | (<expression>) | <variable>

Again, there is no ambiguity: if the lookahead character is a letter, we have a variable; if adigit, we have a number. Back when we translated the number, we just issued code toload the number, as immediate data, into D0. Now we do the same, only we load a vari-able.

A minor complication in the code generation arises from the fact that most 68000 operat-ing systems, including the SK*DOS that I'm using, require the code to be written in "posi-tion-independent" form, which basically means that everything is PC-relative. The formatfor a load in this language is

MOVE X(PC),D0

where X is, of course, the variable name. Armed with that, let's modify the current versionof Factor to read:



{---------------------------------------------------------------}



procedure Factor;

begin


Match('(');

Expression;

Match(')');

end

else if IsAlpha(Look) then

EmitLn('MOVE ' + GetName + '(PC),D0')

else


end;

{--------------------------------------------------------------}

I've remarked before how easy it is to add extensions to the parser, because of the way it'sstructured. You can see that this still holds true here. This time it cost us all of two extra linesof code. Notice, too, how the if-else-else structure exactly parallels the BNF syntax equation.

OK, compile and test this new version of the parser. That didn't hurt too badly, did it?



38

FUNCTIONS There is only one other common kind of factor supported by most languages: the functioncall. It's really too early for us to deal with functions well, because we haven't yetaddressed the issue of parameter passing. What's more, a "real" language would includea mechanism to support more than one type, one of which should be a function type. Wehaven't gotten there yet, either. But I'd still like to deal with functions now for a couple ofreasons. First, it lets us finally wrap up the parser in something very close to its final form,and second, it brings up a new issue which is very much worth talking about.

Up till now, we've been able to write what is called a "predictive parser." That means thatat any point, we can know by looking at the current lookahead character exactly what todo next. That isn't the case when we add functions. Every language has some namingrules for what constitutes a legal identifier. For the present, ours is simply that it is one ofthe letters 'a'..'z'. The problem is that a variable name and a function name obey the samerules. So how can we tell which is which? One way is to require that they each bedeclared before they are used. Pascal takes that approach. The other is that we mightrequire a function to be followed by a (possibly empty) parameter list. That's the rule usedin C.

Since we don't yet have a mechanism for declaring types, let's use the C rule for now.Since we also don't have a mechanism to deal with parameters, we can only handleempty lists, so our function calls will have the form

x() .

Since we're not dealing with parameter lists yet, there is nothing to do but to call the func-tion, so we need only to issue a BSR (call) instead of a MOVE.



Now that there are two possibilities for the "If IsAlpha" branch of the test in Factor, let's treatthem in a separate procedure. Modify Factor to read:

{---------------------------------------------------------------}



procedure Factor;

begin


Match('(');

Expression;

Match(')');

end


Ident

else


end;

{--------------------------------------------------------------}



38

and insert before it the new procedure

{---------------------------------------------------------------}

{ Parse and Translate an Identifier }

procedure Ident;

var Name: char;

begin

Name := GetName;


Match('(');

Match(')');

EmitLn('BSR ' + Name);

end

else

EmitLn('MOVE ' + Name + '(PC),D0')

end;

{---------------------------------------------------------------}



OK, compile and test this version. Does it parse all legal expressions? Does it correctly flagbadly formed ones?

The important thing to notice is that even though we no longer have a predictive parser, thereis little or no complication added with the recursive descent approach that we're using. At thepoint where Factor finds an identifier (letter), it doesn't know whether it's a variable name or afunction name, nor does it really care. It simply passes it on to Ident and leaves it up to thatprocedure to figure it out. Ident, in turn, simply tucks away the identifier and then reads onemore character to decide which kind of identifier it's dealing with.

Keep this approach in mind. It's a very powerful concept, and it should be used whenever youencounter an ambiguous situation requiring further lookahead. Even if you had to look sev-eral tokens ahead, the principle would still work.



38

MORE ON ERROR HANDLING As long as we're talking philosophy, there's another important issue to point out: errorhandling. Notice that although the parser correctly rejects (almost) every malformedexpression we can throw at it, with a meaningful error message, we haven't really had todo much work to make that happen. In fact, in the whole parser per se (from Ident throughExpression) there are only two calls to the error routine, Expected. Even those aren't nec-essary ... if you'll look again in Term and Expression, you'll see that those statementscan't be reached. I put them in early on as a bit of insurance, but they're no longerneeded. Why don't you delete them now?

So how did we get this nice error handling virtually for free? It's simply that I've carefullyavoided reading a character directly using GetChar. Instead, I've relied on the error han-dling in GetName, GetNum, and Match to do all the error checking for me. Astute readerswill notice that some of the calls to Match (for example, the ones in Add and Subtract) arealso unnecessary ... we already know what the character is by the time we get there ...but it maintains a certain symmetry to leave them in, and the general rule to always useMatch instead of GetChar is a good one.

I mentioned an "almost" above. There is a case where our error handling leaves a bit tobe desired. So far we haven't told our parser what and end-of-line looks like, or what to dowith embedded white space. So a space character (or any other character not part of therecognized character set) simply causes the parser to terminate, ignoring the unrecog-nized characters.

It could be argued that this is reasonable behavior at this point. In a "real" compiler, thereis usually another statement following the one we're working on, so any characters nottreated as part of our expression will either be used for or rejected as part of the next one.

But it's also a very easy thing to fix up, even if it's only temporary. All we have to do isassert that the expression should end with an end-of-line , i.e., a carriage return.

To see what I'm talking about, try the input line

1+2 <space> 3+4



See how the space was treated as a terminator? Now, to make the compiler properly flag this,add the line

if Look <> CR then Expected('Newline');

in the main program, just after the call to Expression. That catches anything left over in theinput stream. Don't forget to define CR in the const statement:

CR = ^M;

As usual, recompile the program and verify that it does what it's supposed to.



38

ASSIGNMENT STATEMENTS OK, at this point we have a parser that works very nicely. I'd like to point out that we got itusing only 88 lines of executable code, not counting what was in the cradle. The compiledobject file is a whopping 4752 bytes. Not bad, considering we weren't trying very hard tosave either source code or object size. We just stuck to the KISS principle.

Of course, parsing an expression is not much good without having something to do with itafterwards. Expressions USUALLY (but not always) appear in assignment statements, inthe form

<Ident> = <Expression>

We're only a breath away from being able to parse an assignment statement, so let's takethat last step. Just after procedure Expression, add the following new procedure:

{--------------------------------------------------------------}

{ Parse and Translate an Assignment Statement }

procedure Assignment;

var Name: char;

begin

Name := GetName;

Match('=');

Expression;

EmitLn('LEA ' + Name + '(PC),A0');

EmitLn('MOVE D0,(A0)')

end;

{--------------------------------------------------------------}



Note again that the code exactly parallels the BNF. And notice further that the error checkingwas painless, handled by GetName and Match.

The reason for the two lines of assembler has to do with a peculiarity in the 68000, whichrequires this kind of construct for PC-relative code.

Now change the call to Expression, in the main program, to one to Assignment. That's allthere is to it.

Son of a gun! We are actually compiling assignment statements. If those were the only kindof statements in a language, all we'd have to do is put this in a loop and we'd have a full-fledged compiler!

Well, of course they're not the only kind. There are also little items like control statements (IFsand loops), procedures, declarations, etc. But cheer up. The arithmetic expressions thatwe've been dealing with are among the most challenging in a language. Compared to whatwe've already done, control statements will be easy. I'll be covering them in the fifth install-ment. And the other statements will all fall in line, as long as we remember to KISS.



39

MULTI-CHARACTER TOKENS Throughout this series, I've been carefully restricting everything we do to single-charactertokens, all the while assuring you that it wouldn't be difficult to extend to multi- characterones. I don't know if you believed me or not ... I wouldn't really blame you if you were a bitskeptical. I'll continue to use that approach in the sessions which follow, because it helpskeep complexity away. But I'd like to back up those assurances, and wrap up this portionof the parser, by showing you just how easy that extension really is. In the process, we'llalso provide for embedded white space. Before you make the next few changes, though,save the current version of the parser away under another name. I have some more usesfor it in the next installment, and we'll be working with the single- character version.

Most compilers separate out the handling of the input stream into a separate modulecalled the lexical scanner. The idea is that the scanner deals with all the character-by-character input, and returns the separate units (tokens) of the stream. There may come atime when we'll want to do something like that, too, but for now there is no need. We canhandle the multi-character tokens that we need by very slight and very local modificationsto GetName and GetNum.

The usual definition of an identifier is that the first character must be a letter, but the restcan be alphanumeric (letters or numbers). To deal with this, we need one other recog-nizer function

{--------------------------------------------------------------}

{ Recognize an Alphanumeric }

function IsAlNum(c: char): boolean;

begin

IsAlNum := IsAlpha(c) or IsDigit(c);

end;

{--------------------------------------------------------------}



Add this function to your parser. I put mine just after IsDigit. While you're at it, might as wellinclude it as a permanent member of Cradle, too.

Now, we need to modify function GetName to return a string instead of a character:

{--------------------------------------------------------------}


function GetName: string;

var Token: string;

begin

Token := '';


while IsAlNum(Look) do begin

Token := Token + UpCase(Look);

GetChar;

end;

GetName := Token;

end;

{--------------------------------------------------------------}



39

Similarly, modify GetNum to read:

{--------------------------------------------------------------}

{ Get a Number }

function GetNum: string;

var Value: string;

begin

Value := '';


while IsDigit(Look) do begin

Value := Value + Look;

GetChar;

end;

GetNum := Value;

end;

{--------------------------------------------------------------}

Amazingly enough, that is virtually all the changes required to the parser! The local vari-able Name in procedures Ident and Assignment was originally declared as "char", andmust now be declared string[8]. (Clearly, we could make the string length longer if wechose, but most assemblers limit the length anyhow.) Make this change, and then recom-pile and test. _NOW_ do you believe that it's a simple change?



WHITE SPACE Before we leave this parser for awhile, let's address the issue of white space. As it standsnow, the parser will barf (or simply terminate) on a single space character embedded any-where in the input stream. That's pretty unfriendly behavior. So let's "productionize" the thinga bit by eliminating this last restriction.

The key to easy handling of white space is to come up with a simple rule for how the parsershould treat the input stream, and to enforce that rule everywhere. Up till now, because whitespace wasn't permitted, we've been able to assume that after each parsing action, the looka-head character Look contains the next meaningful character, so we could test it immediately.Our design was based upon this principle.

It still sounds like a good rule to me, so that's the one we'll use. This means that every routinethat advances the input stream must skip over white space, and leave the next non-whitecharacter in Look. Fortunately, because we've been careful to use GetName, GetNum, andMatch for most of our input processing, it is only those three routines (plus Init) that we needto modify.

Not surprisingly, we start with yet another new recognizer routine:

{--------------------------------------------------------------}

{ Recognize White Space }

function IsWhite(c: char): boolean;

begin

IsWhite := c in [' ', TAB];

end;

{--------------------------------------------------------------}



39

We also need a routine that will eat white-space characters, until it finds a non-whiteone:

{--------------------------------------------------------------}

{ Skip Over Leading White Space }

procedure SkipWhite;

begin

while IsWhite(Look) do

GetChar;

end;

{--------------------------------------------------------------}

Now, add calls to SkipWhite to Match, GetName, and GetNum as shown below:

{--------------------------------------------------------------}



begin

if Look <> x then Expected('''' + x + '''')

else begin

GetChar;

SkipWhite;

end;

end;



{--------------------------------------------------------------}



var Token: string;

begin

Token := '';




GetChar;

end;

GetName := Token;

SkipWhite;

end;



39

{--------------------------------------------------------------}

{ Get a Number }


var Value: string;

begin

Value := '';




GetChar;

end;

GetNum := Value;

SkipWhite;

end;

{--------------------------------------------------------------}

(Note that I rearranged Match a bit, without changing the functionality.)



Finally, we need to skip over leading blanks where we "prime the pump" in Init:

{--------------------------------------------------------------}

{ Initialize }

procedure Init;

begin

GetChar;

SkipWhite;

end;

{--------------------------------------------------------------}

Make these changes and recompile the program. You will find that you will have to moveMatch below SkipWhite, to avoid an error message from the Pascal compiler. Test the pro-gram as always to make sure it works properly.

Since we've made quite a few changes during this session, I'm reproducing the entire parserbelow:

{--------------------------------------------------------------}

program parse;

{--------------------------------------------------------------}


const TAB = ^I;

CR = ^M;

{--------------------------------------------------------------}



39



{--------------------------------------------------------------}


procedure GetChar;

begin

Read(Look);

end;

{--------------------------------------------------------------}

{ Report an Error }


begin

WriteLn;


end;



{--------------------------------------------------------------}



begin

Error(s);

Halt;

end;

{--------------------------------------------------------------}



begin


end;

{--------------------------------------------------------------}



begin

IsAlpha := UpCase(c) in ['A'..'Z'];

end;



40

{--------------------------------------------------------------}



begin

IsDigit := c in ['0'..'9'];

end;

{--------------------------------------------------------------}

{ Recognize an Alphanumeric }


begin


end;

{--------------------------------------------------------------}



begin

IsAddop := c in ['+', '-'];

end;



{--------------------------------------------------------------}



begin


end;

{--------------------------------------------------------------}



begin


GetChar;

end;



40

{--------------------------------------------------------------}



begin

if Look <> x then Expected('''' + x + '''')

else begin

GetChar;

SkipWhite;

end;

end;



{--------------------------------------------------------------}



var Token: string;

begin

Token := '';




GetChar;

end;

GetName := Token;

SkipWhite;

end;



40

{--------------------------------------------------------------}

{ Get a Number }


var Value: string;

begin

Value := '';




GetChar;

end;

GetNum := Value;

SkipWhite;

end;

{--------------------------------------------------------------}



begin

Write(TAB, s);

end;

{--------------------------------------------------------------}





begin

Emit(s);

WriteLn;

end;

{---------------------------------------------------------------}

{ Parse and Translate a Identifier }

procedure Ident;

var Name: string[8];

begin

Name:= GetName;


Match('(');

Match(')');


end

else

EmitLn('MOVE ' + Name + '(PC),D0');

end;



40

{---------------------------------------------------------------}



procedure Factor;

begin


Match('(');

Expression;

Match(')');

end


Ident

else


end;



{--------------------------------------------------------------}


procedure Multiply;

begin

Match('*');

Factor;


end;

{-------------------------------------------------------------}


procedure Divide;

begin

Match('/');

Factor;


EmitLn('EXS.L D0');


end;



40

{---------------------------------------------------------------}


procedure Term;

begin

Factor;



case Look of

'*': Multiply;

'/': Divide;

end;

end;

end;

{--------------------------------------------------------------}


procedure Add;

begin

Match('+');

Term;


end;



{-------------------------------------------------------------}


procedure Subtract;

begin

Match('-');

Term;


EmitLn('NEG D0');

end;



41

{---------------------------------------------------------------}



begin


EmitLn('CLR D0')

else

Term;



case Look of

'+': Add;

'-': Subtract;

end;

end;

end;



{--------------------------------------------------------------}



var Name: string[8];

begin

Name := GetName;

Match('=');

Expression;



end;

{--------------------------------------------------------------}

{ Initialize }

procedure Init;

begin

GetChar;

SkipWhite;

end;



41

{--------------------------------------------------------------}

{ Main Program }

begin

Init;

Assignment;

If Look <> CR then Expected('NewLine');

end.

{--------------------------------------------------------------}

Now the parser is complete. It's got every feature we can put in a one-line "compiler."Tuck it away in a safe place. Next time we'll move on to a new subject, but we'll still betalking about expressions for quite awhile. Next installment, I plan to talk a bit about inter-preters as opposed to compilers, and show you how the structure of the parser changes abit as we change what sort of action has to be taken. The information we pick up there willserve us in good stead later on, even if you have no interest in interpreters. See you nexttime.


Part 4 - Interpreters


INTRODUCTIONIn the first three installments of this series, we've looked at parsing and compiling mathexpressions, and worked our way grad- ually and methodically from dealing with very simpleone-term, one-character "expressions" up through more general ones, finally arriving at avery complete parser that could parse and translate complete assignment statements, withmulti-character tokens, embedded white space, and function calls. This time, I'm going towalk you through the process one more time, only with the goal of interpreting rather thancompiling object code.

Since this is a series on compilers, why should we bother with interpreters? Simply because Iwant you to see how the nature of the parser changes as we change the goals. I also want tounify the concepts of the two types of translators, so that you can see not only the differ-ences, but also the similarities.

Consider the assignment statement

x = 2 * y + 3

In a compiler, we want the target CPU to execute this assignment at EXECUTION time. Thetranslator itself doesn't do any arithmetic ... it only issues the object code that will cause theCPU to do it when the code is executed. For the example above, the compiler would issuecode to compute the expression and store the results in variable x.

For an interpreter, on the other hand, no object code is generated. Instead, the arithmetic iscomputed immediately, as the parsing is going on. For the example, by the time parsing ofthe statement is complete, x will have a new value.



41

The approach we've been taking in this whole series is called "syntax-driven translation."As you are aware by now, the structure of the parser is very closely tied to the syntax ofthe productions we parse. We have built Pascal procedures that recognize every lan-guage construct. Associated with each of these constructs (and procedures) is a corre-sponding "action," which does whatever makes sense to do once a construct has beenrecognized. In our compiler so far, every action involves emitting object code, to be exe-cuted later at execution time. In an interpreter, every action involves something to bedone immediately.

What I'd like you to see here is that the layout ... the structure ... of the parser doesn'tchange. It's only the actions that change. So if you can write an interpreter for a given lan-guage, you can also write a compiler, and vice versa. Yet, as you will see, there ARE dif-ferences, and significant ones. Because the actions are different, the procedures that dothe recognizing end up being written differently. Specifically, in the interpreter the recog-nizing procedures end up being coded as FUNCTIONS that return numeric values to theircallers. None of the parsing routines for our compiler did that.

Our compiler, in fact, is what we might call a "pure" compiler. Each time a construct is rec-ognized, the object code is emitted IMMEDIATELY. (That's one reason the code is notvery efficient.) The interpreter we'll be building here is a pure interpreter, in the sense thatthere is no translation, such as "tokenizing," performed on the source code. These repre-sent the two extremes of translation. In the real world, translators are rarely so pure, buttend to have bits of each technique.

I can think of several examples. I've already mentioned one: most interpreters, such asMicrosoft BASIC, for example, translate the source code (tokenize it) into an intermedi-ate form so that it'll be easier to parse real time.

Another example is an assembler. The purpose of an assembler, of course, is to produceobject code, and it normally does that on a one-to-one basis: one object instruction perline of source code. But almost every assembler also permits expressions as arguments.In this case, the expressions are always constant expressions, and so the assembler isn'tsupposed to issue object code for them. Rather, it "interprets" the expressions and com-putes the corresponding constant result, which is what it actually emits as object code.



As a matter of fact, we could use a bit of that ourselves. The translator we built in the previousinstallment will dutifully spit out object code for complicated expressions, even though everyterm in the expression is a constant. In that case it would be far better if the translatorbehaved a bit more like an interpreter, and just computed the equivalent constant result.

There is a concept in compiler theory called "lazy" translation. The idea is that you typicallydon't just emit code at every action. In fact, at the extreme you don't emit anything at all, untilyou absolutely have to. To accomplish this, the actions associated with the parsing routinestypically don't just emit code. Sometimes they do, but often they simply return in- formationback to the caller. Armed with such information, the caller can then make a better choice ofwhat to do.

For example, given the statement

x = x + 3 - 2 - (5 - 4) ,

our compiler will dutifully spit out a stream of 18 instructions to load each parameter into reg-isters, perform the arithmetic, and store the result. A lazier evaluation would recognize thatthe arithmetic involving constants can be evaluated at compile time, and would reduce theexpression to

x = x + 0 .

An even lazier evaluation would then be smart enough to figure out that this is equivalent to

x = x ,

which calls for no action at all. We could reduce 18 instructions to zero!

Note that there is no chance of optimizing this way in our translator as it stands, becauseevery action takes place immediately.

Lazy expression evaluation can produce significantly better object code than we have beenable to so far. I warn you, though: it complicates the parser code considerably, because eachroutine now has to make decisions as to whether to emit object code or not. Lazy evaluationis certainly not named that because it's easier on the compiler writer!



41

Since we're operating mainly on the KISS principle here, I won't go into much more depthon this subject. I just want you to be aware that you can get some code optimization bycombining the techniques of compiling and interpreting. In particular, you should knowthat the parsing routines in a smarter translator will generally return things to their caller,and sometimes expect things as well. That's the main reason for going over interpretationin this installment.



THE INTERPRETER OK, now that you know WHY we're going into all this, let's do it. Just to give you practice,we're going to start over with a bare cradle and build up the translator all over again. Thistime, of course, we can go a bit faster.

Since we're now going to do arithmetic, the first thing we need to do is to change functionGetNum, which up till now has always returned a character (or string). Now, it's better for it toreturn an integer. MAKE A COPY of the cradle (for goodness's sake, don't change the ver-sion in Cradle itself!!) and modify GetNum as follows:

{--------------------------------------------------------------}

{ Get a Number }

function GetNum: integer;

begin


GetNum := Ord(Look) - Ord('0');

GetChar;

end;

{--------------------------------------------------------------}



41

Now, write the following version of Expression:

{---------------------------------------------------------------}


function Expression: integer;

begin

Expression := GetNum;

end;

{--------------------------------------------------------------}

Finally, insert the statement

Writeln(Expression);

at the end of the main program. Now compile and test.

All this program does is to "parse" and translate a single integer "expression." As always,you should make sure that it does that with the digits 0..9, and gives an error message foranything else. Shouldn't take you very long!



OK, now let's extend this to include addops. Change Expression to read:

{---------------------------------------------------------------}


function Expression: integer;

var Value: integer;

begin


Value := 0

else

Value := GetNum;


case Look of

'+': begin Match('+'); Value := Value + GetNum; end;

'-': begin Match('-'); Value := Value - GetNum; end; end; end;

Expression := Value;

end;

{--------------------------------------------------------------}



42

The structure of Expression, of course, parallels what we did before, so we shouldn't havetoo much trouble debugging it. There's been a SIGNIFICANT development, though,hasn't there? Procedures Add and Subtract went away! The reason is that the action tobe taken requires BOTH arguments of the operation. I could have chosen to retain theprocedures and pass into them the value of the expression to date, which is Value. But itseemed cleaner to me to keep Value as strictly a local variable, which meant that thecode for Add and Subtract had to be moved in line. This result suggests that, while thestructure we had developed was nice and clean for our simple-minded translationscheme, it probably wouldn't do for use with lazy evaluation. That's a little tidbit we'll prob-ably want to keep in mind for later.



OK, did the translator work? Then let's take the next step. It's not hard to figure out what pro-cedure Term should now look like. Change every call to GetNum in function Expression to acall to Term, and then enter the following form for Term:

{---------------------------------------------------------------}


function Term: integer;

var Value: integer;

begin

Value := GetNum;


case Look of

'*': begin

Match('*');

Value := Value * GetNum;

end;

'/': begin

Match('/');

Value := Value div GetNum;

end; end; end; Term := Value;end;

{--------------------------------------------------------------}



42

Now, try it out. Don't forget two things: first, we're dealing with integer division, so, forexample, 1/3 should come out zero. Second, even though we can output multi-digitresults, our input is still restricted to single digits.

That seems like a silly restriction at this point, since we have already seen how easilyfunction GetNum can be extended. So let's go ahead and fix it right now. The new versionis

{--------------------------------------------------------------}

{ Get a Number }


var Value: integer;

begin

Value := 0;



Value := 10 * Value + Ord(Look) - Ord('0');

GetChar;

end;

GetNum := Value;

end;

{--------------------------------------------------------------}



If you've compiled and tested this version of the interpreter, the next step is to install functionFactor, complete with parenthesized expressions. We'll hold off a bit longer on the variablenames. First, change the references to GetNum, in function Term, so that they call Factorinstead. Now code the following version of Factor:

{---------------------------------------------------------------}


function Expression: integer; Forward;

function Factor: integer;

begin


Match('(');

Factor := Expression;

Match(')');

end

else

Factor := GetNum;

end;

{---------------------------------------------------------------}

That was pretty easy, huh? We're rapidly closing in on a useful interpreter.



42

A LITTLE PHILOSOPHY Before going any further, there's something I'd like to call to your attention. It's a conceptthat we've been making use of in all these sessions, but I haven't explicitly mentioned itup till now. I think it's time, because it's a concept so useful, and so powerful, that it makesall the difference between a parser that's trivially easy, and one that's too complex to dealwith.

In the early days of compiler technology, people had a terrible time figuring out how todeal with things like operator precedence ... the way that multiply and divide operatorstake precedence over add and subtract, etc. I remember a colleague of some thirty yearsago, and how excited he was to find out how to do it. The technique used involved build-ing two stacks, upon which you pushed each operator or operand. Associated with eachoperator was a precedence level, and the rules required that you only actually performedan operation ("reducing" the stack) if the precedence level showing on top of the stackwas correct. To make life more interesting, an operator like ')' had different precedencelevels, depending upon whether or not it was already on the stack. You had to give it onevalue before you put it on the stack, and another to decide when to take it off. Just for theexperience, I worked all of this out for myself a few years ago, and I can tell you that it'svery tricky.

We haven't had to do anything like that. In fact, by now the parsing of an arithmetic state-ment should seem like child's play. How did we get so lucky? And where did the prece-dence stacks go?

A similar thing is going on in our interpreter above. You just KNOW that in order for it to dothe computation of arithmetic statements (as opposed to the parsing of them), there haveto be numbers pushed onto a stack somewhere. But where is the stack?

Finally, in compiler textbooks, there are a number of places where stacks and other struc-tures are discussed. In the other leading parsing method (LR), an explicit stack is used. Infact, the technique is very much like the old way of doing arithmetic expressions. Anotherconcept is that of a parse tree. Authors like to draw diagrams of the tokens in a statement,connected into a tree with operators at the internal nodes. Again, where are the trees andstacks in our technique? We haven't seen any. The answer in all cases is that the struc-tures are implicit, not explicit. In any computer language, there is a stack involved everytime you call a subroutine. Whenever a subroutine is called, the return address is pushed



onto the CPU stack. At the end of the subroutine, the address is popped back off and controlis transferred there. In a recursive language such as Pascal, there can also be local datapushed onto the stack, and it, too, returns when it's needed.

For example, function Expression contains a local parameter called Value, which it fills by acall to Term. Suppose, in its next call to Term for the second argument, that Term calls Factor,which recursively calls Expression again. That "instance" of Expression gets another valuefor its copy of Value. What happens to the first Value? Answer: it's still on the stack, and willbe there again when we return from our call sequence.

In other words, the reason things look so simple is that we've been making maximum use ofthe resources of the language. The hierarchy levels and the parse trees are there, all right,but they're hidden within the structure of the parser, and they're taken care of by the orderwith which the various procedures are called. Now that you've seen how we do it, it's proba-bly hard to imagine doing it any other way. But I can tell you that it took a lot of years for com-piler writers to get that smart. The early compilers were too complex too imagine. Funny howthings get easier with a little practice.

The reason I've brought all this up is as both a lesson and a warning. The lesson: things canbe easy when you do them right. The warning: take a look at what you're doing. If, as youbranch out on your own, you begin to find a real need for a separate stack or tree structure, itmay be time to ask yourself if you're looking at things the right way. Maybe you just aren'tusing the facilities of the language as well as you could be.

The next step is to add variable names. Now, though, we have a slight problem. For the com-piler, we had no problem in dealing with variable names ... we just issued the names to theassembler and let the rest of the program take care of allocating storage for them. Here, onthe other hand, we need to be able to fetch the values of the variables and return them as thereturn values of Factor. We need a storage mechanism for these variables.

Back in the early days of personal computing, Tiny BASIC lived. It had a grand total of 26possible variables: one for each letter of the alphabet. This fits nicely with our concept of sin-gle-character tokens, so we'll try the same trick. In the beginning of your interpreter, just afterthe declaration of variable Look, insert the line:

Table: Array['A'..'Z'] of integer;



42

We also need to initialize the array, so add this procedure:

{---------------------------------------------------------------}

{ Initialize the Variable Area }

procedure InitTable;

var i: char;

begin

for i := 'A' to 'Z' do

Table[i] := 0;

end;

{---------------------------------------------------------------}

You must also insert a call to InitTable, in procedure Init. DON'T FORGET to do that, orthe results may surprise you!



Now that we have an array of variables, we can modify Factor to use it. Since we don't havea way (so far) to set the variables, Factor will always return zero values for them, but let'sgo ahead and extend it anyway. Here's the new version:

{---------------------------------------------------------------}


function Expression: integer; Forward;

function Factor: integer;

begin


Match('(');


Match(')');

end


Factor := Table[GetName]

else

Factor := GetNum;

end;

{---------------------------------------------------------------}



42

As always, compile and test this version of the program. Even though all the variables arenow zeros, at least we can correctly parse the complete expressions, as well as catch anybadly formed expressions.

I suppose you realize the next step: we need to do an assignment statement so we canput something INTO the variables. For now, let's stick to one-liners, though we will soonbe handling multiple statements.

The assignment statement parallels what we did before:

{--------------------------------------------------------------}



var Name: char;

begin

Name := GetName;

Match('=');

Table[Name] := Expression;

end;

{--------------------------------------------------------------}

To test this, I added a temporary write statement in the main program, to print out thevalue of A. Then I tested it with various assignments to it.



Of course, an interpretive language that can only accept a single line of program is not ofmuch value. So we're going to want to handle multiple statements. This merely means puttinga loop around the call to Assignment. So let's do that now. But what should be the loop exitcriterion? Glad you asked, because it brings up a point we've been able to ignore up till now.

One of the most tricky things to handle in any translator is to determine when to bail out of agiven construct and go look for something else. This hasn't been a problem for us so farbecause we've only allowed for a single kind of construct ... either an expression or anassignment statement. When we start adding loops and different kinds of statements, you'llfind that we have to be very careful that things terminate properly. If we put our interpreter ina loop, we need a way to quit. Terminating on a newline is no good, because that's whatsends us back for another line. We could always let an unrecognized character take us out,but that would cause every run to end in an error message, which certainly seems uncool.

What we need is a termination character. I vote for Pascal's ending period ('.'). A minor com-plication is that Turbo ends every normal line with TWO characters, the carriage return (CR)and line feed (LF). At the end of each line, we need to eat these characters before processingthe next one. A natural way to do this would be with procedure Match, except that Match'serror message prints the character, which of course for the CR and/or LF won't look so great.What we need is a special procedure for this, which we'll no doubt be using over and over.Here it is:

{--------------------------------------------------------------}

{ Recognize and Skip Over a Newline }

procedure NewLine;

begin

if Look = CR then begin GetChar; if Look = LF then GetChar; end;

end;

{--------------------------------------------------------------}



43

Insert this procedure at any convenient spot ... I put mine just after Match. Now, rewritethe main program to look like this:

{--------------------------------------------------------------}

{ Main Program }

begin

Init;

repeat

Assignment;

NewLine;

until Look = '.';

end.

{--------------------------------------------------------------}

Note that the test for a CR is now gone, and that there are also no error tests within New-Line itself. That's OK, though ... whatever is left over in terms of bogus characters will becaught at the beginning of the next assignment statement.

Well, we now have a functioning interpreter. It doesn't do us a lot of good, however, sincewe have no way to read data in or write it out. Sure would help to have some I/O!



Let's wrap this session up, then, by adding the I/O routines. Since we're sticking to single-character tokens, I'll use '?' to stand for a read statement, and '!' for a write, with the char-acter immediately following them to be used as a one-token "parameter list." Here are theroutines:

{--------------------------------------------------------------}

{ Input Routine }

procedure Input;

begin

Match('?');

Read(Table[GetName]);

end;

{--------------------------------------------------------------}

{ Output Routine }

procedure Output;

begin

Match('!');

WriteLn(Table[GetName]);

end;

{--------------------------------------------------------------}



43

They aren't very fancy, I admit ... no prompt character on input, for example ... but theyget the job done. The corresponding changes in the main program are shown below. Notethat we use the usual trick of a case statement based upon the current lookahead charac-ter, to decide what to do.

{--------------------------------------------------------------}

{ Main Program }

begin

Init;

repeat

case Look of

'?': Input;

'!': Output;

else Assignment;

end;

NewLine;

until Look = '.';

end.

{--------------------------------------------------------------}



You have now completed a real, working interpreter. It's pretty sparse, but it works just likethe "big boys." It includes three kinds of program statements (and can tell the difference!), 26variables, and I/O statements. The only things that it lacks, really, are control statements,subroutines, and some kind of program editing function. The program editing part, I'm goingto pass on. After all, we're not here to build a product, but to learn things. The control state-ments, we'll cover in the next installment, and the subroutines soon after. I'm anxious to geton with that, so we'll leave the interpreter as it stands.

I hope that by now you're convinced that the limitation of single-character names and theprocessing of white space are easily taken care of, as we did in the last session. This time, ifyou'd like to play around with these extensions, be my guest ... they're "left as an exercise forthe student." See you next time.



43

Part 5 - Control Constructs

INTRODUCTIONIn the first four installments of this series, we've been concentrating on the parsing ofmath expressions and assignment statements.

In this installment, we'll take off on a new and exciting tangent: that of parsing and trans-lating control constructs such as IF statements. This subject is dear to my heart, becauseit represents a turning point for me. I had been playing with the parsing of expressions,just as we have done in this series, but I still felt that I was a LONG way from being ableto handle a complete language. After all, REAL languages have branches and loops andsubroutines and all that. Perhaps you've shared some of the same thoughts. Awhile back,though, I had to produce control constructs for a structured assembler preprocessor I waswriting. Imagine my surprise to discover that it was far easier than the expression parsingI had already been through. I remember thinking, "Hey! This is EASY!" After we've fin-ished this session, I'll bet you'll be thinking so, too.



THE PLAN In what follows, we'll be starting over again with a bare cradle, and as we've done twicebefore now, we'll build things up one at a time. We'll also be retaining the concept of single-character tokens that has served us so well to date. This means that the "code" will look a lit-tle funny, with 'i' for IF, 'w' for WHILE, etc. But it helps us get the concepts down pat withoutfussing over lexical scanning. Fear not ... eventually we'll see something looking like "real"code.

I also don't want to have us get bogged down in dealing with statements other than branches,such as the assignment statements we've been working on. We've already demonstrated thatwe can handle them, so there's no point carrying them around as excess baggage during thisexercise. So what I'll do instead is to use an anonymous statement, "other", to take the placeof the non- control statements and serve as a place-holder for them. We have to generatesome kind of object code for them (we're back into compiling, not interpretation), so for wantof anything else I'll just echo the character input.

OK, then, starting with yet another copy of the cradle, let's define the procedure:

{--------------------------------------------------------------}

{ Recognize and Translate an "Other" }

procedure Other;

begin

EmitLn(GetName);

end;

{--------------------------------------------------------------}



43

Now include a call to it in the main program, thus:

{--------------------------------------------------------------}

{ Main Program }

begin

Init;

Other;

end.

{--------------------------------------------------------------}

Run the program and see what you get. Not very exciting, is it? But hang in there, it's astart, and things will get better.

The first thing we need is the ability to deal with more than one statement, since a single-line branch is pretty limited. We did that in the last session on interpreting, but this timelet's get a little more formal. Consider the following BNF:

<program> ::= <block> END

<block> ::= [ <statement> ]*

This says that, for our purposes here, a program is defined as a block, followed by anEND statement. A block, in turn, consists of zero or more statements. We only have onekind of statement, so far.

What signals the end of a block? It's simply any construct that isn't an "other" statement.For now, that means only the END statement.



Armed with these ideas, we can proceed to build up our parser. The code for a program (wehave to call it DoProgram, or Pascal will complain, is:

{--------------------------------------------------------------}

{ Parse and Translate a Program }

procedure DoProgram;

begin

Block;

if Look <> 'e' then Expected('End');

EmitLn('END')

end;

{--------------------------------------------------------------}

Notice that I've arranged to emit an "END" command to the assembler, which sort of punctu-ates the output code, and makes sense considering that we're parsing a complete programhere.



43

The code for Block is:

{--------------------------------------------------------------}

{ Recognize and Translate a Statement Block }

procedure Block;

begin

while not(Look in ['e']) do begin

Other;

end;

end;

{--------------------------------------------------------------}

(From the form of the procedure, you just KNOW we're going to be adding to it in a bit!)

OK, enter these routines into your program. Replace the call to Block in the main pro-gram, by a call to DoProgram. Now try it and see how it works. Well, it's still not much, butwe're getting closer.



SOME GROUNDWORK Before we begin to define the various control constructs, we need to lay a bit more ground-work. First, a word of warning: I won't be using the same syntax for these constructs as you'refamiliar with from Pascal or C. For example, the Pascal syntax for an IF is:

IF <condition> THEN <statement>

(where the statement, of course, may be compound).

The C version is similar:

IF ( <condition> ) <statement>

Instead, I'll be using something that looks more like Ada:

IF <condition> <block> ENDIF

In other words, the IF construct has a specific termination symbol. This avoids the dangling-else of Pascal and C and also precludes the need for the brackets {} or begin-end. The syn-tax I'm showing you here, in fact, is that of the language KISS that I'll be detailing in laterinstallments. The other constructs will also be slightly different. That shouldn't be a real prob-lem for you. Once you see how it's done, you'll realize that it really doesn't matter so muchwhich specific syntax is involved. Once the syntax is defined, turning it into code is straight-forward.



44

Now, all of the constructs we'll be dealing with here involve transfer of control, which atthe assembler-language level means conditional and/or unconditional branches. Forexample, the simple IF statement

IF <condition> A ENDIF B ....

must get translated into

Branch if NOT condition to L

A

L: B

...

It's clear, then, that we're going to need some more procedures to help us deal with thesebranches. I've defined two of them below. Procedure NewLabel generates unique labels.This is done via the simple expedient of calling every label 'Lnn', where nn is a label num-ber starting from zero. Procedure PostLabel just outputs the labels at the proper place.



Here are the two routines:

{--------------------------------------------------------------}

{ Generate a Unique Label }

function NewLabel: string;

var S: string;

begin

Str(LCount, S);

NewLabel := 'L' + S;

Inc(LCount);

end;

{--------------------------------------------------------------}

{ Post a Label To Output }

procedure PostLabel(L: string);

begin

WriteLn(L, ':');

end;

{--------------------------------------------------------------}



44

Notice that we've added a new global variable, LCount, so you need to change the VARdeclarations at the top of the program to look like this:

var Look : char; { Lookahead Character }

Lcount: integer; { Label Counter }

Also, add the following extra initialization to Init:

LCount := 0;

(DON'T forget that, or your labels can look really strange!)

At this point I'd also like to show you a new kind of notation. If you compare the form ofthe IF statement above with the assembler code that must be produced, you can seethat there are certain actions associated with each of the keywords in the statement:

IF: First, get the condition and issue the code for it.

Then, create a unique label and emit a branch if false.

ENDIF: Emit the label.

These actions can be shown very concisely if we write the syntax this way:

IF

<condition> { Condition;

L = NewLabel;

Emit(Branch False to L); }

<block>

ENDIF { PostLabel(L) }



This is an example of syntax-directed translation. We've been doing it all along ... we've justnever written it down this way before. The stuff in curly brackets represents the ACTIONS tobe taken. The nice part about this representation is that it not only shows what we have torecognize, but also the actions we have to perform, and in which order. Once we have thissyntax, the code almost writes itself.

About the only thing left to do is to be a bit more specific about what we mean by "Branch iffalse."

I'm assuming that there will be code executed for <condition> that will perform Boolean alge-bra and compute some result. It should also set the condition flags corresponding to thatresult. Now, the usual convention for a Boolean variable is to let 0000 represent "false," andanything else (some use FFFF, some 0001) represent "true."

On the 68000 the condition flags are set whenever any data is moved or calculated. If thedata is a 0000 (corresponding to a false condition, remember), the zero flag will be set. Thecode for "Branch on zero" is BEQ. So for our purposes here,

BEQ <=> Branch if false

BNE <=> Branch if true

It's the nature of the beast that most of the branches we see will be BEQ's ... we'll be branch-ing AROUND the code that's supposed to be executed when the condition is true.



44

THE IF STATEMENT With that bit of explanation out of the way, we're finally ready to begin coding the IF-state-ment parser. In fact, we've almost already done it! As usual, I'll be using our single-char-acter approach, with the character 'i' for IF, and 'e' for ENDIF (as well as END ... that dualnature causes no confusion). I'll also, for now, skip completely the character for thebranch condition, which we still have to define.

The code for DoIf is:

{--------------------------------------------------------------}

{ Recognize and Translate an IF Construct }

procedure Block; Forward;

procedure DoIf;

var L: string;

begin

Match('i');

L := NewLabel;

Condition;

EmitLn('BEQ ' + L);

Block;

Match('e');

PostLabel(L);

end;

{--------------------------------------------------------------}



Add this routine to your program, and change Block to reference it as follows:

{--------------------------------------------------------------}


procedure Block;

begin


case Look of

'i': DoIf;

'o': Other;

end;

end;

end;

{--------------------------------------------------------------}



44

Notice the reference to procedure Condition. Eventually, we'll write a routine that canparse and translate any Boolean condition we care to give it. But that's a whole install-ment by itself (the next one, in fact). For now, let's just make it a dummy that emits sometext. Write the following routine:

{--------------------------------------------------------------}

{ Parse and Translate a Boolean Condition }

{ This version is a dummy }

Procedure Condition;

begin

EmitLn('<condition>');

end;

{--------------------------------------------------------------}

Insert this procedure in your program just before DoIf. Now run the program. Try a stringlike

aibece

As you can see, the parser seems to recognize the construct and inserts the object codeat the right places. Now try a set of nested IF's, like

aibicedefe

It's starting to look real, eh?



Now that we have the general idea (and the tools such as the notation and the proceduresNewLabel and PostLabel), it's a piece of cake to extend the parser to include other con-structs. The first (and also one of the trickiest) is to add the ELSE clause to IF. The BNF is

IF <condition> <block> [ ELSE <block>] ENDIF

The tricky part arises simply because there is an optional part, which doesn't occur in theother constructs.

The corresponding output code should be

<condition>

BEQ L1

<block>

BRA L2

L1: <block>

L2: ...

This leads us to the following syntax-directed translation:

IF

<condition> { L1 = NewLabel;

L2 = NewLabel;

Emit(BEQ L1) }

<block> ELSE { Emit(BRA L2);

PostLabel(L1) } <block>

ENDIF { PostLabel(L2) }



44

Comparing this with the case for an ELSE-less IF gives us a clue as to how to handleboth situations. The code below does it. (Note that I use an 'l' for the ELSE, since 'e' isotherwise occupied):

{--------------------------------------------------------------}


procedure DoIf;

var L1, L2: string;

begin

Match('i'); Condition; L1 := NewLabel; L2 := L1; EmitLn('BEQ ' + L1); Block;

if Look = 'l' then begin Match('l');

L2 := NewLabel;

EmitLn('BRA ' + L2);

PostLabel(L1);

Block;

end;

Match('e');

PostLabel(L2);

end;

{--------------------------------------------------------------}



There you have it. A complete IF parser/translator, in 19 lines of code. Give it a try now.

Try something like

aiblcede

Did it work? Now, just to be sure we haven't broken the ELSE- less case, try

aibece

Now try some nested IF's. Try anything you like, including some badly formed statements.Just remember that 'e' is not a legal "other" statement.



45

THE WHILE STATEMENT The next type of statement should be easy, since we already have the process down pat.The syntax I've chosen for the WHILE statement is

WHILE <condition> <block> ENDWHILE

I know, I know, we don't REALLY need separate kinds of terminators for each construct... you can see that by the fact that in our one-character version, 'e' is used for all of them.But I also remember MANY debugging sessions in Pascal, trying to track down a way-ward END that the compiler obviously thought I meant to put somewhere else. It's beenmy experience that specific and unique keywords, although they add to the vocabulary ofthe language, give a bit of error-checking that is worth the extra work for the compilerwriter.

Now, consider what the WHILE should be translated into. It should be:

L1: <condition>

BEQ L2

<block>

BRA L1

L2:

As before, comparing the two representations gives us the actions needed at each point.

WHILE { L1 = NewLabel; PostLabel(L1) }

<condition> { Emit(BEQ L2) }

<block>

ENDWHILE { Emit(BRA L1);

PostLabel(L2) }



The code follows immediately from the syntax:

{--------------------------------------------------------------}

{ Parse and Translate a WHILE Statement }

procedure DoWhile;

var L1, L2: string;

begin

Match('w');

L1 := NewLabel;

L2 := NewLabel;

PostLabel(L1);

Condition;

EmitLn('BEQ ' + L2);

Block;

Match('e');


PostLabel(L2);

end;

{--------------------------------------------------------------}



45

Since we've got a new statement, we have to add a call to it within procedure Block:

{--------------------------------------------------------------}


procedure Block;

begin

while not(Look in ['e', 'l']) do begin

case Look of

'i': DoIf;

'w': DoWhile;

else Other;

end;

end;

end;

{--------------------------------------------------------------}

No other changes are necessary.

OK, try the new program. Note that this time, the <condition> code is INSIDE the upperlabel, which is just where we wanted it. Try some nested loops. Try some loops withinIF's, and some IF's within loops. If you get a bit confused as to what you should type,don't be discouraged: you write bugs in other languages, too, don't you? It'll look a lotmore meaningful when we get full keywords.



I hope by now that you're beginning to get the idea that this really IS easy. All we have to doto accomodate a new construct is to work out the syntax-directed translation of it. The codealmost falls out from there, and it doesn't affect any of the other routines. Once you've gottenthe feel of the thing, you'll see that you can add new constructs about as fast as you candream them up.



45

THE LOOP STATEMENT We could stop right here, and have a language that works. It's been shown many timesthat a high-order language with only two constructs, the IF and the WHILE, is sufficient towrite structured code. But we're on a roll now, so let's richen up the repertoire a bit.

This construct is even easier, since it has no condition test at all ... it's an infinite loop.What's the point of such a loop? Not much, by itself, but later on we're going to add aBREAK command, that will give us a way out. This makes the language considerablyricher than Pascal, which has no break, and also avoids the funny WHILE(1) or WHILETRUE of C and Pascal.

The syntax is simply

LOOP <block> ENDLOOP

and the syntax-directed translation is:

LOOP { L = NewLabel;

PostLabel(L) }

<block>

ENDLOOP { Emit(BRA L }



The corresponding code is shown below. Since I've already used 'l' for the ELSE, I've usedthe last letter, 'p', as the "keyword" this time.

{--------------------------------------------------------------}

{ Parse and Translate a LOOP Statement }

procedure DoLoop;

var L: string;

begin

Match('p');

L := NewLabel;

PostLabel(L);

Block;

Match('e');

EmitLn('BRA ' + L);

end;

{--------------------------------------------------------------}

When you insert this routine, don't forget to add a line in Block to call it.



45

REPEAT-UNTIL Here's one construct that I lifted right from Pascal. The syntax is

REPEAT <block> UNTIL <condition> ,

and the syntax-directed translation is:

REPEAT { L = NewLabel;

PostLabel(L) }

<block>

UNTIL

<condition> { Emit(BEQ L) }



As usual, the code falls out pretty easily:

{--------------------------------------------------------------}

{ Parse and Translate a REPEAT Statement }

procedure DoRepeat;

var L: string;

begin

Match('r');

L := NewLabel;

PostLabel(L);

Block;

Match('u');

Condition;

EmitLn('BEQ ' + L);

end;

{--------------------------------------------------------------}

As before, we have to add the call to DoRepeat within Block. This time, there's a difference,though. I decided to use 'r' for REPEAT (naturally), but I also decided to use 'u' for UNTIL.This means that the 'u' must be added to the set of characters in the while-test. These are thecharacters that signal an exit from the current block ... the "follow" characters, in compiler jar-gon.



45

{--------------------------------------------------------------}


procedure Block;

begin

while not(Look in ['e', 'l', 'u']) do begin

case Look of

'i': DoIf;

'w': DoWhile;

'p': DoLoop;

'r': DoRepeat;

else Other;

end;

end;

end;

{--------------------------------------------------------------}



THE FOR LOOP The FOR loop is a very handy one to have around, but it's a bear to translate. That's not somuch because the construct itself is hard ... it's only a loop after all ... but simply because it'shard to implement in assembler language. Once the code is figured out, the translation isstraightforward enough.

C fans love the FOR-loop of that language (and, in fact, it's easier to code), but I've choseninstead a syntax very much like the one from good ol' BASIC:

FOR <ident> = <expr1> TO <expr2> <block> ENDFOR

The translation of a FOR loop can be just about as difficult as you choose to make it, depend-ing upon the way you decide to define the rules as to how to handle the limits. Does expr2 getevaluated every time through the loop, for example, or is it treated as a constant limit? Do youalways go through the loop at least once, as in FORTRAN, or not? It gets simpler if you adoptthe point of view that the construct is equivalent to:

<ident> = <expr1>

TEMP = <expr2>

WHILE <ident> <= TEMP

<block>

ENDWHILE

Notice that with this definition of the loop, <block> will not be executed at all if <expr1> is ini-tially larger than <expr2>.



46

The 68000 code needed to do this is trickier than anything we've done so far. I had a cou-ple of tries at it, putting both the counter and the upper limit on the stack, both in registers,etc. I finally arrived at a hybrid arrangement, in which the loop counter is in memory (sothat it can be accessed within the loop), and the upper limit is on the stack. The translatedcode came out like this:

<ident> get name of loop counter

<expr1> get initial value

LEA <ident>(PC),A0 address the loop counter

SUBQ #1,D0 predecrement it

MOVE D0,(A0) save it

<expr1> get upper limit

MOVE D0,-(SP) save it on stack

L1: LEA <ident>(PC),A0 address loop counter

MOVE (A0),D0 fetch it to D0

ADDQ #1,D0 bump the counter

MOVE D0,(A0) save new value

CMP (SP),D0 check for range

BLE L2 skip out if D0 > (SP)

<block>

BRA L1 loop for next pass

L2: ADDQ #2,SP clean up the stack



Wow! That seems like a lot of code ... the line containing <block> seems to almost get lost.But that's the best I could do with it. I guess it helps to keep in mind that it's really only sixteenwords, after all. If anyone else can optimize this better, please let me know.

Still, the parser routine is pretty easy now that we have the code:

{--------------------------------------------------------------}

{ Parse and Translate a FOR Statement }

procedure DoFor;

var L1, L2: string;

Name: char;

begin

Match('f');

L1 := NewLabel;

L2 := NewLabel;

Name := GetName;

Match('=');

Expression;

EmitLn('SUBQ #1,D0');


EmitLn('MOVE D0,(A0)');

Expression;


PostLabel(L1);



46


EmitLn('MOVE (A0),D0');

EmitLn('ADDQ #1,D0');


EmitLn('CMP (SP),D0');

EmitLn('BGT ' + L2);

Block;

Match('e');


PostLabel(L2);

EmitLn('ADDQ #2,SP');

end;

{--------------------------------------------------------------}



Since we don't have expressions in this parser, I used the same trick as for Condition, andwrote the routine

{--------------------------------------------------------------}



Procedure Expression;

begin

EmitLn('<expr>');

end;

{--------------------------------------------------------------}

Give it a try. Once again, don't forget to add the call in Block. Since we don't have any inputfor the dummy version of Expression, a typical input line would look something like

afi=bece

Well, it DOES generate a lot of code, doesn't it? But at least it's the RIGHT code.



46

THE DO STATEMENT All this made me wish for a simpler version of the FOR loop. The reason for all the codeabove is the need to have the loop counter accessible as a variable within the loop. If allwe need is a counting loop to make us go through something a specified number of times,but don't need access to the counter itself, there is a much easier solution. The 68000 hasa "decrement and branch nonzero" instruction built in which is ideal for counting. Forgood measure, let's add this construct, too. This will be the last of our loop structures.

The syntax and its translation is:

DO

<expr> { Emit(SUBQ #1,D0);

L = NewLabel;

PostLabel(L);

Emit(MOVE D0,-(SP) }

<block>

ENDDO { Emit(MOVE (SP)+,D0;

Emit(DBRA D0,L) }



That's quite a bit simpler! The loop will execute <expr> times. Here's the code:

{--------------------------------------------------------------}

{ Parse and Translate a DO Statement }

procedure Dodo;

var L: string;

begin

Match('d');

L := NewLabel;

Expression;


PostLabel(L);


Block;


EmitLn('DBRA D0,' + L);

end;

{--------------------------------------------------------------}

I think you'll have to agree, that's a whole lot simpler than the classical FOR. Still, each con-struct has its place.



46

THE BREAK STATEMENT Earlier I promised you a BREAK statement to accompany LOOP. This is one I'm sort ofproud of. On the face of it a BREAK seems really tricky. My first approach was to just useit as an extra terminator to Block, and split all the loops into two parts, just as I did with theELSE half of an IF. That turns out not to work, though, because the BREAK statement isalmost certainly not going to show up at the same level as the loop itself. The most likelyplace for a BREAK is right after an IF, which would cause it to exit to the IF construct, notthe enclosing loop. WRONG. The BREAK has to exit the inner LOOP, even if it's nesteddown into several levels of IFs.

My next thought was that I would just store away, in some global variable, the endinglabel of the innermost loop. That doesn't work either, because there may be a break froman inner loop followed by a break from an outer one. Storing the label for the inner loopwould clobber the label for the outer one. So the global variable turned into a stack.Things were starting to get messy.

Then I decided to take my own advice. Remember in the last session when I pointed outhow well the implicit stack of a recursive descent parser was serving our needs? I saidthat if you begin to see the need for an external stack you might be doing somethingwrong. Well, I was. It is indeed possible to let the recursion built into our parser take careof everything, and the solution is so simple that it's surprising.

The secret is to note that every BREAK statement has to occur within a block ... there'sno place else for it to be. So all we have to do is to pass into Block the exit address of theinnermost loop. Then it can pass the address to the routine that translates the breakinstruction. Since an IF statement doesn't change the loop level, procedure DoIf doesn'tneed to do anything except pass the label into ITS blocks (both of them). Since loops DOchange the level, each loop construct simply ignores whatever label is above it andpasses its own exit label along.



All this is easier to show you than it is to describe. I'll demonstrate with the easiest loop,which is LOOP:

{--------------------------------------------------------------}


procedure DoLoop;

var L1, L2: string;

begin

Match('p');

L1 := NewLabel;

L2 := NewLabel;

PostLabel(L1);

Block(L2);

Match('e');


PostLabel(L2);

end;

{--------------------------------------------------------------}

Notice that DoLoop now has TWO labels, not just one. The second is to give the BREAKinstruction a target to jump to. If there is no BREAK within the loop, we've wasted a label andcluttered up things a bit, but there's no harm done.



46

Note also that Block now has a parameter, which for loops will always be the exit address.The new version of Block is:

{--------------------------------------------------------------}


procedure Block(L: string);

begin


case Look of

'i': DoIf(L);

'w': DoWhile;

'p': DoLoop;

'r': DoRepeat;

'f': DoFor;

'd': DoDo;

'b': DoBreak(L);

else Other;

end;

end;

end;

{--------------------------------------------------------------}



Again, notice that all Block does with the label is to pass it into DoIf and DoBreak. The loopconstructs don't need it, because they are going to pass their own label anyway.

The new version of DoIf is:

{--------------------------------------------------------------}


procedure Block(L: string); Forward;procedure DoIf(L: string);var L1, L2: string;begin

Match('i');

Condition;

L1 := NewLabel;

L2 := L1;


Block(L);

if Look = 'l' then begin

Match('l');

L2 := NewLabel;


PostLabel(L1); Block(L); end; Match('e'); PostLabel(L2);end;{--------------------------------------------------------------}



47

Here, the only thing that changes is the addition of the parameter to procedure Block. AnIF statement doesn't change the loop nesting level, so DoIf just passes the label along.No matter how many levels of IF nesting we have, the same label will be used.

Now, remember that DoProgram also calls Block, so it now needs to pass it a label. Anattempt to exit the outermost block is an error, so DoProgram passes a null label which iscaught by DoBreak:

{--------------------------------------------------------------}

{ Recognize and Translate a BREAK }

procedure DoBreak(L: string);

begin

Match('b'); if L <> '' then

EmitLn('BRA ' + L)

else Abort('No loop to break from');

end;

{--------------------------------------------------------------}



begin

Block(''); if Look <> 'e' then Expected('End');

EmitLn('END')

end;{--------------------------------------------------------------}



That ALMOST takes care of everything. Give it a try, see if you can "break" it <pun>. Careful,though. By this time we've used so many letters, it's hard to think of characters that aren'tnow representing reserved words. Remember: before you try the program, you're going tohave to edit every occurence of Block in the other loop constructs to include the new parame-ter. Do it just like I did for LOOP.

I said ALMOST above. There is one slight problem: if you take a hard look at the code gener-ated for DO, you'll see that if you break out of this loop, the value of the loop counter is stillleft on the stack. We're going to have to fix that! A shame ... that was one of our smaller rou-tines, but it can't be helped. Here's a version that doesn't have the problem:

{--------------------------------------------------------------}


procedure Dodo;var L1, L2: string;begin Match('d'); L1 := NewLabel; L2 := NewLabel; Expression; EmitLn('SUBQ #1,D0'); PostLabel(L1); EmitLn('MOVE D0,-(SP)'); Block(L2); EmitLn('MOVE (SP)+,D0'); EmitLn('DBRA D0,' + L1); EmitLn('SUBQ #2,SP'); PostLabel(L2); EmitLn('ADDQ #2,SP');end;

{--------------------------------------------------------------}

The two extra instructions, the SUBQ and ADDQ, take care of leaving the stack in the rightshape.



47

CONCLUSIONAt this point we have created a number of control constructs ... a richer set, really, thanthat provided by almost any other programming language. And, except for the FORloop, it was pretty easy to do. Even that one was tricky only because it's tricky in assem-bler language.

I'll conclude this session here. To wrap the thing up with a red ribbon, we really shouldhave a go at having real keywords instead of these mickey-mouse single-characterthings. You've already seen that the extension to multi-character words is not difficult, butin this case it will make a big difference in the appearance of our input code. I'll save thatlittle bit for the next installment. In that installment we'll also address Boolean expres-sions, so we can get rid of the dummy version of Condition that we've used here. See youthen.

For reference purposes, here is the completed parser for this session:

{--------------------------------------------------------------}

program Branch;

{--------------------------------------------------------------}


const TAB = ^I;

CR = ^M;

{--------------------------------------------------------------}






{--------------------------------------------------------------}


procedure GetChar;

begin

Read(Look);

end;

{--------------------------------------------------------------}

{ Report an Error }


begin

WriteLn;


end;

{--------------------------------------------------------------}



begin

Error(s);

Halt;

end;



47

{--------------------------------------------------------------}



begin


end;

{--------------------------------------------------------------}



begin



end;

{--------------------------------------------------------------}



begin


end;



{--------------------------------------------------------------}



begin

IsDigit := c in ['0'..'9'];

end;

{--------------------------------------------------------------}



begin

IsAddop := c in ['+', '-'];

end;

{--------------------------------------------------------------}



begin


end;



47

{--------------------------------------------------------------}



begin


GetChar;

end;

{--------------------------------------------------------------}



begin



GetChar;

end;



{--------------------------------------------------------------}

{ Get a Number }


begin


GetNum := Look;

GetChar;

end;

{--------------------------------------------------------------}



var S: string;

begin

Str(LCount, S);


Inc(LCount);

end;



47

{--------------------------------------------------------------}



begin

WriteLn(L, ':');

end;

{--------------------------------------------------------------}



begin

Write(TAB, s);

end;

{--------------------------------------------------------------}



begin

Emit(s);

WriteLn;

end;



{--------------------------------------------------------------}


procedure Condition;

begin

EmitLn('<condition>');

end;

{--------------------------------------------------------------}

{ Parse and Translate a Math Expression }


begin

EmitLn('<expr>');

end;



48

{--------------------------------------------------------------}


procedure Block(L: string); Forward;

procedure DoIf(L: string);

var L1, L2: string;

begin

Match('i');

Condition;

L1 := NewLabel;

L2 := L1;


Block(L);


Match('l');

L2 := NewLabel;


PostLabel(L1);

Block(L);

end; Match('e'); PostLabel(L2);

end;



{--------------------------------------------------------------}


procedure DoWhile;

var L1, L2: string;

begin

Match('w');

L1 := NewLabel;

L2 := NewLabel;

PostLabel(L1);

Condition;


Block(L2);

Match('e');


PostLabel(L2);

end;



48

{--------------------------------------------------------------}


procedure DoLoop;

var L1, L2: string;

begin

Match('p');

L1 := NewLabel;

L2 := NewLabel;

PostLabel(L1);

Block(L2);

Match('e');


PostLabel(L2);

end;



{--------------------------------------------------------------}

{ Parse and Translate a REPEAT Statement }

procedure DoRepeat;

var L1, L2: string;

begin

Match('r');

L1 := NewLabel;

L2 := NewLabel;

PostLabel(L1);

Block(L2);

Match('u');

Condition;


PostLabel(L2);

end;



48

{--------------------------------------------------------------}

{ Parse and Translate a FOR Statement }

procedure DoFor;

var L1, L2: string;

Name: char;

begin

Match('f');

L1 := NewLabel;

L2 := NewLabel;

Name := GetName;

Match('=');

Expression;




Expression;


PostLabel(L1);



EmitLn('ADDQ #1,D0');




EmitLn('CMP (SP),D0');

EmitLn('BGT ' + L2);

Block(L2);

Match('e');


PostLabel(L2);


end;

{--------------------------------------------------------------}


procedure Dodo;

var L1, L2: string;

begin

Match('d');

L1 := NewLabel;

L2 := NewLabel;

Expression;


PostLabel(L1);




48

Block(L2);


EmitLn('DBRA D0,' + L1);

EmitLn('SUBQ #2,SP');

PostLabel(L2);


end;

{--------------------------------------------------------------}

{ Recognize and Translate a BREAK }

procedure DoBreak(L: string);

begin

Match('b');

EmitLn('BRA ' + L);

end;

{--------------------------------------------------------------}

{ Recognize and Translate an "Other" }

procedure Other;

begin

EmitLn(GetName);

end;



{--------------------------------------------------------------}



begin


case Look of

'i': DoIf(L);

'w': DoWhile;

'p': DoLoop;

'r': DoRepeat;

'f': DoFor;

'd': DoDo;

'b': DoBreak(L);

else Other;

end;

end;

end;



48

{--------------------------------------------------------------}



begin

Block('');

if Look <> 'e' then Expected('End');

EmitLn('END')

end;

{--------------------------------------------------------------}

{ Initialize }

procedure Init;

begin

LCount := 0;

GetChar;

end;



{--------------------------------------------------------------}

{ Main Program }

begin

Init;

DoProgram;

end.

{--------------------------------------------------------------}



49

Part 6 - Boolean Expressions

INTRODUCTION In Part V of this series, we took a look at control constructs, and developed parsing rou-tines to translate them into object code. We ended up with a nice, relatively rich set ofconstructs.

As we left the parser, though, there was one big hole in our capabilities: we did notaddress the issue of the branch condition. To fill the void, I introduced to you a dummyparse routine called Condition, which only served as a place-keeper for the real thing.

One of the things we'll do in this session is to plug that hole by expanding Condition into atrue parser/translator.

THE PLAN We're going to approach this installment a bit differently than any of the others. In thoseother installments, we started out immediately with experiments using the Pascal com-piler, building up the parsers from very rudimentary beginnings to their final forms, withoutspending much time in planning beforehand. That's called coding without specs, and it'susually frowned upon. We could get away with it before because the rules of arithmeticare pretty well established ... we know what a '+' sign is supposed to mean without havingto discuss it at length. The same is true for branches and loops. But the ways in whichprogramming languages implement logic vary quite a bit from language to language. Sobefore we begin serious coding, we'd better first make up our minds what it is we want.And the way to do that is at the level of the BNF syntax rules (the GRAMMAR).



THE GRAMMAR For some time now, we've been implementing BNF syntax equations for arithmetic expres-sions, without ever actually writing them down all in one place. It's time that we did so. Theyare:

<expression> ::= <unary op> <term> [<addop> <term>]*

<term> ::= <factor> [<mulop> factor]*

<factor> ::= <integer> | <variable> | ( <expression> )

(Remember, the nice thing about this grammar is that it enforces the operator precedencehierarchy that we normally expect for algebra.)

Actually, while we're on the subject, I'd like to amend this grammar a bit right now. The waywe've handled the unary minus is a bit awkward. I've found that it's better to write the gram-mar this way:


<term> ::= <signed factor> [<mulop> factor]*

<signed factor> ::= [<addop>] <factor>

<factor> ::= <integer> | <variable> | (<expression>)

This puts the job of handling the unary minus onto Factor, which is where it really belongs.

This doesn't mean that you have to go back and recode the programs you've already written,although you're free to do so if you like. But I will be using the new syntax from now on.



49

Now, it probably won't come as a shock to you to learn that we can define an analogousgrammar for Boolean algebra. A typical set or rules is:

<b-expression>::= <b-term> [<orop> <b-term>]*

<b-term> ::= <not-factor> [AND <not-factor>]*

<not-factor> ::= [NOT] <b-factor>

<b-factor> ::= <b-literal> | <b-variable> | (<b-expression>)

Notice that in this grammar, the operator AND is analogous to '*', and OR (and exclusiveOR) to '+'. The NOT operator is analogous to a unary minus. This hierarchy is not abso-lutely standard ... some languages, notably Ada, treat all logical operators as having thesame precedence level ... but it seems natural.

Notice also the slight difference between the way the NOT and the unary minus are han-dled. In algebra, the unary minus is considered to go with the whole term, and so neverappears but once in a given term. So an expression like

a * -b

or worse yet,

a - -b

is not allowed. In Boolean algebra, though, the expression

a AND NOT b

makes perfect sense, and the syntax shown allows for that.



RELOPS OK, assuming that you're willing to accept the grammar I've shown here, we now have syntaxrules for both arithmetic and Boolean algebra. The sticky part comes in when we have tocombine the two. Why do we have to do that? Well, the whole subject came up because ofthe need to process the "predicates" (conditions) associated with control statements such asthe IF. The predicate is required to have a Boolean value; that is, it must evaluate to eitherTRUE or FALSE. The branch is then taken or not taken, depending on that value. What weexpect to see going on in procedure Condition, then, is the evaluation of a Boolean expres-sion.

But there's more to it than that. A pure Boolean expression can indeed be the predicate of acontrol statement ... things like

IF a AND NOT b THEN ....

But more often, we see Boolean algebra show up in such things as

IF (x >= 0) and (x <= 100) THEN ...

Here, the two terms in parens are Boolean expressions, but the individual terms being com-pared: x, 0, and 100, are NUMERIC in nature. The RELATIONAL OPERATORS >= and <=are the catalysts by which the Boolean and the arithmetic ingredients get merged together.

Now, in the example above, the terms being compared are just that: terms. However, in gen-eral each side can be a math expression. So we can define a RELATION to be:

<relation> ::= <expression> <relop> <expression> ,

where the expressions we're talking about here are the old numeric type, and the relops areany of the usual symbols

=, <> (or !=), <, >, <=, and >=



49

If you think about it a bit, you'll agree that, since this kind of predicate has a single Bool-ean value, TRUE or FALSE, as its result, it is really just another kind of factor. So we canexpand the definition of a Boolean factor above to read:

<b-factor> ::= <b-literal>

| <b-variable>

| (<b-expression>)

| <relation>

THAT's the connection! The relops and the relation they define serve to wed the two kindsof algebra. It is worth noting that this implies a hierarchy where the arithmetic expressionhas a HIGHER precedence that a Boolean factor, and therefore than all the Booleanoperators. If you write out the precedence levels for all the operators, you arrive at the fol-lowing list:

Level Syntax Element Operator

0 factor literal, variable

1 signed factor unary minus

2 term *, /

3 expression +, -

4 b-factor literal, variable, relop

5 not-factor NOT

6 b-term AND

7 b-expression OR, XOR



If we're willing to accept that many precedence levels, this grammar seems reasonable.Unfortunately, it won't work! The grammar may be great in theory, but it's no good at all in thepractice of a top-down parser. To see the problem, consider the code fragment:

IF ((((((A + B + C) < 0 ) AND ....

When the parser is parsing this code, it knows after it sees the IF token that a Booleanexpression is supposed to be next. So it can set up to begin evaluating such an expression.But the first expression in the example is an ARITHMETIC expression, A + B + C. What'sworse, at the point that the parser has read this much of the input line:

IF ((((((A ,

it still has no way of knowing which kind of expression it's dealing with. That won't do,because we must have different recognizers for the two cases. The situation can be handledwithout changing any of our definitions, but only if we're willing to accept an arbitrary amountof backtracking to work our way out of bad guesses. No compiler writer in his right mindwould agree to that.

What's going on here is that the beauty and elegance of BNF grammar has met face to facewith the realities of compiler technology.

To deal with this situation, compiler writers have had to make compromises so that a singleparser can handle the grammar without backtracking.



49

FIXING THE GRAMMAR The problem that we've encountered comes up because our definitions of both arithmeticand Boolean factors permit the use of parenthesized expressions. Since the definitionsare recursive, we can end up with any number of levels of parentheses, and the parsercan't know which kind of expression it's dealing with.

The solution is simple, although it ends up causing profound changes to our grammar.We can only allow parentheses in one kind of factor. The way to do that varies consider-ably from language to language. This is one place where there is NO agreement or con-vention to help us.

When Niklaus Wirth designed Pascal, the desire was to limit the number of levels of pre-cedence (fewer parse routines, after all). So the OR and exclusive OR operators aretreated just like an Addop and processed at the level of a math expression. Similarly, theAND is treated like a Mulop and processed with Term. The precedence levels are

Level Syntax Element Operator

0 factor literal, variable

1 signed factor unary minus, NOT

2 term *, /, AND

3 expression +, -, OR



Notice that there is only ONE set of syntax rules, applying to both kinds of operators. Accord-ing to this grammar, then, expressions like

x + (y AND NOT z) DIV 3

are perfectly legal. And, in fact, they ARE ... as far as the parser is concerned. Pascal doesn'tallow the mixing of arithmetic and Boolean variables, and things like this are caught at theSEMANTIC level, when it comes time to generate code for them, rather than at the syntaxlevel.

The authors of C took a diametrically opposite approach: they treat the operators as different,and have something much more akin to our seven levels of precedence. In fact, in C thereare no fewer than 17 levels! That's because C also has the operators '=', '+=' and its kin, '<<','>>', '++', '--', etc. Ironically, although in C the arithmetic and Boolean operators are treatedseparately, the variables are NOT ... there are no Boolean or logical variables in C, so a Bool-ean test can be made on any integer value.

We'll do something that's sort of in-between. I'm tempted to stick mostly with the Pascalapproach, since that seems the simplest from an implementation point of view, but it results insome funnies that I never liked very much, such as the fact that, in the expression

IF (c >= 'A') and (c <= 'Z') then ...

the parens above are REQUIRED. I never understood why before, and neither my compilernor any human ever explained it very well, either. But now, we can all see that the 'and' oper-ator, having the precedence of a multiply, has a higher one than the relational operators, sowithout the parens the expression is equivalent to

IF c >= ('A' and c) <= 'Z' then

which doesn't make sense.



49

In any case, I've elected to separate the operators into different levels, although not asmany as in C.

<b-expression> ::= <b-term> [<orop> <b-term>]*

<b-term> ::= <not-factor> [AND <not-factor>]*

<not-factor> ::= [NOT] <b-factor>

<b-factor> ::= <b-literal> | <b-variable> | <relation>

<relation> ::= | <expression> [<relop> <expression]


<term> ::= <signed factor> [<mulop> factor]*

<signed factor>::= [<addop>] <factor>

<factor> ::= <integer> | <variable> | (<b-expression>)

This grammar results in the same set of seven levels that I showed earlier. Really, it'salmost the same grammar ... I just removed the option of parenthesized b-expressions asa possible b-factor, and added the relation as a legal form of b-factor.

There is one subtle but crucial difference, which is what makes the whole thing work.Notice the square brackets in the definition of a relation. This means that the relop andthe second expression are OPTIONAL.

A strange consequence of this grammar (and one shared by C) is that EVERY expressionis potentially a Boolean expression. The parser will always be looking for a Booleanexpression, but will "settle" for an arithmetic one. To be honest, that's going to slow downthe parser, because it has to wade through more layers of procedure calls. That's onereason why Pascal compilers tend to compile faster than C compilers. If it's raw speedyou want, stick with the Pascal syntax.



THE PARSER Now that we've gotten through the decision-making process, we can press on with develop-ment of a parser. You've done this with me several times now, so you know the drill: we beginwith a fresh copy of the cradle, and begin adding procedures one by one. So let's do it.

We begin, as we did in the arithmetic case, by dealing only with Boolean literals rather thanvariables. This gives us a new kind of input token, so we're also going to need a new recog-nizer, and a new procedure to read instances of that token type. Let's start by defining the twonew procedures:

{--------------------------------------------------------------}

{ Recognize a Boolean Literal }

function IsBoolean(c: char): Boolean;

begin

IsBoolean := UpCase(c) in ['T', 'F'];

end;

{--------------------------------------------------------------}

{ Get a Boolean Literal }

function GetBoolean: Boolean;

var c: char;

begin if not IsBoolean(Look) then Expected('Boolean Literal'); GetBoolean := UpCase(Look) = 'T'; GetChar;end;

{--------------------------------------------------------------}



50

Type these routines into your program. You can test them by adding into the main pro-gram the print statement

WriteLn(GetBoolean);

OK, compile the program and test it. As usual, it's not very impressive so far, but it soonwill be.

Now, when we were dealing with numeric data we had to arrange to generate code toload the values into D0. We need to do the same for Boolean data. The usual way toencode Boolean variables is to let 0 stand for FALSE, and some other value for TRUE.Many languages, such as C, use an integer 1 to represent it. But I prefer FFFF hex (or -1), because a bitwise NOT also becomes a Boolean NOT. So now we need to emit theright assembler code to load those values. The first cut at the Boolean expression parser(BoolExpression, of course) is:

{---------------------------------------------------------------}

{ Parse and Translate a Boolean Expression }

procedure BoolExpression;

begin

if not IsBoolean(Look) then Expected('Boolean Literal');

if GetBoolean then

EmitLn('MOVE #-1,D0')

else

EmitLn('CLR D0');

end;

{---------------------------------------------------------------}



Add this procedure to your parser, and call it from the main program (replacing the print state-ment you had just put there). As you can see, we still don't have much of a parser, but theoutput code is starting to look more realistic.

Next, of course, we have to expand the definition of a Boolean expression. We already havethe BNF rule:

<b-expression> ::= <b-term> [<orop> <b-term>]*

I prefer the Pascal versions of the "orops", OR and XOR. But since we are keeping to single-character tokens here, I'll encode those with '|' and '~'. The next version of BoolExpression isalmost a direct copy of the arithmetic procedure Expression:

{--------------------------------------------------------------}

{ Recognize and Translate a Boolean OR }

procedure BoolOr;

begin

Match('|');

BoolTerm;

EmitLn('OR (SP)+,D0');

end;



50

{--------------------------------------------------------------}

{ Recognize and Translate an Exclusive Or }

procedure BoolXor;

begin

Match('~');

BoolTerm;

EmitLn('EOR (SP)+,D0');

end;

{---------------------------------------------------------------}



begin

BoolTerm;

while IsOrOp(Look) do begin


case Look of

'|': BoolOr;

'~': BoolXor;

end; end;end;

{---------------------------------------------------------------}



Note the new recognizer IsOrOp, which is also a copy, this time of IsAddOp:

{--------------------------------------------------------------}

{ Recognize a Boolean Orop }

function IsOrop(c: char): Boolean;

begin

IsOrop := c in ['|', '~'];

end;

{--------------------------------------------------------------}

OK, rename the old version of BoolExpression to BoolTerm, then enter the code above. Com-pile and test this version. At this point, the output code is starting to look pretty good. Ofcourse, it doesn't make much sense to do a lot of Boolean algebra on constant values, butwe'll soon be expanding the types of Booleans we deal with.

You've probably already guessed what the next step is: The Boolean version of Term.

Rename the current procedure BoolTerm to NotFactor, and enter the following new version ofBoolTerm. Note that is is much simpler than the numeric version, since there is no equivalentof division.



50

{---------------------------------------------------------------}

{ Parse and Translate a Boolean Term }

procedure BoolTerm;

begin

NotFactor;

while Look = '&' do begin


Match('&');

NotFactor;

EmitLn('AND (SP)+,D0');

end;

end;

{--------------------------------------------------------------}



Now, we're almost home. We are translating complex Boolean expressions, although only forconstant values. The next step is to allow for the NOT. Write the following procedure:

{--------------------------------------------------------------}

{ Parse and Translate a Boolean Factor with NOT }

procedure NotFactor;

begin

if Look = '!' then begin

Match('!');

BoolFactor;

EmitLn('EOR #-1,D0');

end

else

BoolFactor;

end;

{--------------------------------------------------------------}

And rename the earlier procedure to BoolFactor. Now try that. At this point the parser shouldbe able to handle any Boolean expression you care to throw at it. Does it? Does it trap badlyformed expressions?



50

If you've been following what we did in the parser for math expressions, you know thatwhat we did next was to expand the definition of a factor to include variables and parens.We don't have to do that for the Boolean factor, because those little items get taken careof by the next step. It takes just a one line addition to BoolFactor to take care of relations:

{--------------------------------------------------------------}

{ Parse and Translate a Boolean Factor }

procedure BoolFactor;

begin

if IsBoolean(Look) then

if GetBoolean then

EmitLn('MOVE #-1,D0')

else

EmitLn('CLR D0')

else Relation;

end;

{--------------------------------------------------------------}

You might be wondering when I'm going to provide for Boolean variables and parenthe-sized Boolean expressions. The answer is, I'm NOT! Remember, we took those out of thegrammar earlier. Right now all I'm doing is encoding the grammar we've already agreedupon. The compiler itself can't tell the difference between a Boolean variable or expres-sion and an arithmetic one ... all of those will be handled by Relation, either way.



Of course, it would help to have some code for Relation. I don't feel comfortable, though, add-ing any more code without first checking out what we already have. So for now let's just writea dummy version of Relation that does nothing except eat the current character, and write alittle message:

{---------------------------------------------------------------}

{ Parse and Translate a Relation }

procedure Relation;

begin

WriteLn('<Relation>');

GetChar;

end;

{--------------------------------------------------------------}

OK, key in this code and give it a try. All the old things should still work ... you should be ableto generate the code for ANDs, ORs, and NOTs. In addition, if you type any alphabetic char-acter you should get a little <Relation> place-holder, where a Boolean factor should be. Didyou get that? Fine, then let's move on to the full-blown version of Relation.

To get that, though, there is a bit of groundwork that we must lay first. Recall that a relationhas the form

<relation> ::= | <expression> [<relop> <expression]



50

Since we have a new kind of operator, we're also going to need a new Boolean function torecognize it. That function is shown below. Because of the single-character limitation, I'msticking to the four operators that can be encoded with such a character (the "not equals"is encoded by '#').

{--------------------------------------------------------------}

{ Recognize a Relop }

function IsRelop(c: char): Boolean;

begin

IsRelop := c in ['=', '#', '<', '>'];

end;

{--------------------------------------------------------------}

Now, recall that we're using a zero or a -1 in register D0 to represent a Boolean value,and also that the loop constructs expect the flags to be set to correspond. In implement-ing all this on the 68000, things get a a little bit tricky.

Since the loop constructs operate only on the flags, it would be nice (and also quite effi-cient) just to set up those flags, and not load anything into D0 at all. This would be fine forthe loops and branches, but remember that the relation can be used ANYWHERE a Bool-ean factor could be used. We may be storing its result to a Boolean variable. Since wecan't know at this point how the result is going to be used, we must allow for BOTH cases.

Comparing numeric data is easy enough ... the 68000 has an operation for that ... but itsets the flags, not a value. What's more, the flags will always be set the same (zero ifequal, etc.), while we need the zero flag set differently for the each of the different relops.



The solution is found in the 68000 instruction Scc, which sets a byte value to 0000 or FFFF(funny how that works!) depending upon the result of the specified condition. If we make thedestination byte to be D0, we get the Boolean value needed.

Unfortunately, there's one final complication: unlike almost every other instruction in the68000 set, Scc does NOT reset the condition flags to match the data being stored. So wehave to do one last step, which is to test D0 and set the flags to match it. It must seem to bea trip around the moon to get what we want: we first perform the test, then test the flags to setdata into D0, then test D0 to set the flags again. It is sort of roundabout, but it's the moststraightforward way to get the flags right, and after all it's only a couple of instructions.

I might mention here that this area is, in my opinion, the one that represents the biggest dif-ference between the efficiency of hand-coded assembler language and compiler-generatedcode. We have seen already that we lose efficiency in arithmetic operations, although later Iplan to show you how to improve that a bit. We've also seen that the control constructs them-selves can be done quite efficiently ... it's usually very difficult to improve on the code gener-ated for an IF or a WHILE. But virtually every compiler I've ever seen generates terrible code,compared to assembler, for the computation of a Boolean function, and particularly for rela-tions. The reason is just what I've hinted at above. When I'm writing code in assembler, I goahead and perform the test the most convenient way I can, and then set up the branch sothat it goes the way it should. In effect, I "tailor" every branch to the situation. The compilercan't do that (practically), and it also can't know that we don't want to store the result of thetest as a Boolean variable. So it must generate the code in a very strict order, and it oftenends up loading the result as a Boolean that never gets used for anything.



51

In any case, we're now ready to look at the code for Relation. It's shown below with itscompanion procedures:

{---------------------------------------------------------------}

{ Recognize and Translate a Relational "Equals" }

procedure Equals;

begin

Match('=');

Expression;

EmitLn('CMP (SP)+,D0');

EmitLn('SEQ D0');

end;

{---------------------------------------------------------------}

{ Recognize and Translate a Relational "Not Equals" }

procedure NotEquals;

begin

Match('#');

Expression;


EmitLn('SNE D0');

end;



{---------------------------------------------------------------}

{ Recognize and Translate a Relational "Less Than" }

procedure Less;

begin

Match('<');

Expression;


EmitLn('SGE D0');

end;

{---------------------------------------------------------------}

{ Recognize and Translate a Relational "Greater Than" }

procedure Greater;

begin

Match('>');

Expression;


EmitLn('SLE D0');

end;



51

{---------------------------------------------------------------}


procedure Relation;

begin

Expression;

if IsRelop(Look) then begin


case Look of

'=': Equals;

'#': NotEquals;

'<': Less;

'>': Greater;

end;

EmitLn('TST D0');

end;

end;

{---------------------------------------------------------------}



Now, that call to Expression looks familiar! Here is where the editor of your system comes inhandy. We have already generated code for Expression and its buddies in previous sessions.You can copy them into your file now. Remember to use the single- character versions. Justto be certain, I've duplicated the arithmetic procedures below. If you're observant, you'll alsosee that I've changed them a little to make them correspond to the latest version of the syn-tax. This change is NOT necessary, so you may prefer to hold off on that until you're sureeverything is working.

{---------------------------------------------------------------}


procedure Ident;

var Name: char;

begin

Name:= GetName;


Match('(');

Match(')');


end

else


end;



51

{---------------------------------------------------------------}



procedure Factor;

begin


Match('(');

Expression;

Match(')');

end


Ident

else


end;



{---------------------------------------------------------------}

{ Parse and Translate the First Math Factor }

procedure SignedFactor;

begin

if Look = '+' then

GetChar;

if Look = '-' then begin

GetChar;

if IsDigit(Look) then

EmitLn('MOVE #-' + GetNum + ',D0')

else begin

Factor;

EmitLn('NEG D0');

end;

end

else Factor;

end;



51

{--------------------------------------------------------------}


procedure Multiply;

begin

Match('*');

Factor;


end;

{-------------------------------------------------------------}


procedure Divide;

begin

Match('/');

Factor;


EmitLn('EXS.L D0');


end;



{---------------------------------------------------------------}


procedure Term;

begin

SignedFactor;



case Look of

'*': Multiply;

'/': Divide;

end;

end;

end;



51

{---------------------------------------------------------------}


procedure Add;

begin

Match('+');

Term;


end;

{---------------------------------------------------------------}


procedure Subtract;

begin

Match('-');

Term;


EmitLn('NEG D0');

end;



{---------------------------------------------------------------}



begin

Term;



case Look of

'+': Add;

'-': Subtract;

end;

end;

end;

{---------------------------------------------------------------}

There you have it ... a parser that can handle both arithmetic AND Boolean algebra, andthings that combine the two through the use of relops. I suggest you file away a copy of thisparser in a safe place for future reference, because in our next step we're going to be chop-ping it up.



52

MERGING WITH CONTROL CONSTRUCTS At this point, let's go back to the file we had previously built that parses control constructs.Remember those little dummy procedures called Condition and Expression? Now youknow what goes in their places!

I warn you, you're going to have to do some creative editing here, so take your time andget it right. What you need to do is to copy all of the procedures from the logic parser,from Ident through BoolExpression, into the parser for control constructs. Insert them atthe current location of Condition. Then delete that procedure, as well as the dummyExpression. Next, change every call to Condition to refer to BoolExpression instead.Finally, copy the procedures IsMulop, IsOrOp, IsRelop, IsBoolean, and GetBoolean intoplace. That should do it.

Compile the resulting program and give it a try. Since we haven't used this program inawhile, don't forget that we used single-character tokens for IF, WHILE, etc. Also don'tforget that any letter not a keyword just gets echoed as a block.

Try

ia=bxlye

which stands for "IF a=b X ELSE Y ENDIF".

What do you think? Did it work? Try some others.



ADDING ASSIGNMENTS As long as we're this far, and we already have the routines for expressions in place, we mightas well replace the "blocks" with real assignment statements. We've already done that before,so it won't be too hard. Before taking that step, though, we need to fix something else.

We're soon going to find that the one-line "programs" that we're having to write here will reallycramp our style. At the moment we have no cure for that, because our parser doesn't recog-nize the end-of-line characters, the carriage return (CR) and the line feed (LF). So beforegoing any further let's plug that hole.

There are a couple of ways to deal with the CR/LFs. One (the C/Unix approach) is just totreat them as additional white space characters and ignore them. That's actually not such abad approach, but it does sort of produce funny results for our parser as it stands now. If itwere reading its input from a source file as any self-respecting REAL compiler does, therewould be no problem. But we're reading input from the keyboard, and we're sort of condi-tioned to expect something to happen when we hit the return key. It won't, if we just skip overthe CR and LF (try it). So I'm going to use a different method here, which is NOT necessarilythe best approach in the long run. Consider it a temporary kludge until we're further along.

Instead of skipping the CR/LF, We'll let the parser go ahead and catch them, then introduce aspecial procedure, analogous to SkipWhite, that skips them only in specified "legal" spots.



52

Here's the procedure:

{--------------------------------------------------------------}

{ Skip a CRLF }

procedure Fin;

begin

if Look = CR then GetChar;

if Look = LF then GetChar;

end;

{--------------------------------------------------------------}



Now, add two calls to Fin in procedure Block, like this:

{--------------------------------------------------------------}



begin


Fin;

case Look of

'i': DoIf(L);

'w': DoWhile;

'p': DoLoop;

'r': DoRepeat;

'f': DoFor;

'd': DoDo;

'b': DoBreak(L);

else Other;

end;

Fin;

end;

end;

{--------------------------------------------------------------}



52

Now, you'll find that you can use multiple-line "programs." The only restriction is that youcan't separate an IF or WHILE token from its predicate.

Now we're ready to include the assignment statements. Simply change that call to Otherin procedure Block to a call to Assignment, and add the following procedure, copied fromone of our earlier programs. Note that Assignment now calls BoolExpression, so that wecan assign Boolean variables.

{--------------------------------------------------------------}



var Name: char;

begin

Name := GetName;

Match('=');

BoolExpression;



end;

{--------------------------------------------------------------}

With that change, you should now be able to write reasonably realistic-looking programs,subject only to our limitation on single-character tokens. My original intention was to getrid of that limitation for you, too. However, that's going to require a fairly major change towhat we've done so far. We need a true lexical scanner, and that requires some structuralchanges. They are not BIG changes that require us to throw away all of what we've doneso far ... with care, it can be done with very minimal changes, in fact. But it does requirethat care.



This installment has already gotten pretty long, and it contains some pretty heavy stuff, soI've decided to leave that step until next time, when you've had a little more time to digestwhat we've done and are ready to start fresh.

In the next installment, then, we'll build a lexical scanner and eliminate the single-characterbarrier once and for all. We'll also write our first complete compiler, based on what we'vedone in this session. See you then.



52

Part 7 -Lexical Scanning

INTRODUCTION In the last installment, I left you with a compiler that would ALMOST work, except that wewere still limited to single- character tokens. The purpose of this session is to get rid ofthat restriction, once and for all. This means that we must deal with the concept of the lex-ical scanner.

Maybe I should mention why we need a lexical scanner at all ... after all, we've been ableto manage all right without one, up till now, even when we provided for multi-charactertokens.

The ONLY reason, really, has to do with keywords. It's a fact of computer life that the syn-tax for a keyword has the same form as that for any other identifier. We can't tell until weget the complete word whether or not it IS a keyword. For example, the variable IFILEand the keyword IF look just alike, until you get to the third character. In the examples todate, we were always able to make a decision based upon the first character of the token,but that's no longer possible when keywords are present. We need to know that a givenstring is a keyword BEFORE we begin to process it. And that's why we need a scanner.

In the last session, I also promised that we would be able to provide for normal tokenswithout making wholesale changes to what we have already done. I didn't lie ... we can,as you will see later. But every time I set out to install these elements of the software intothe parser we have already built, I had bad feelings about it. The whole thing felt entirelytoo much like a band-aid. I finally figured out what was causing the problem: I was install-ing lexical scanning software without first explaining to you what scanning is all about,and what the alternatives are. Up till now, I have studiously avoided giving you a lot oftheory, and certainly not alternatives. I generally don't respond well to the textbooks thatgive you twenty-five different ways to do something, but no clue as to which way best fitsyour needs. I've tried to avoid that pitfall by just showing you ONE method, that WORKS.

But this is an important area. While the lexical scanner is hardly the most exciting part ofa compiler, it often has the most profound effect on the general "look & feel" of the lan-guage, since after all it's the part closest to the user. I have a particular structure in mind



for the scanner to be used with KISS. It fits the look & feel that I want for that language. But itmay not work at all for the language YOU'RE cooking up, so in this one case I feel that it'simportant for you to know your options.

So I'm going to depart, again, from my usual format. In this session we'll be getting muchdeeper than usual into the basic theory of languages and grammars. I'll also be talking aboutareas OTHER than compilers in which lexical scanning plays an important role. Finally, I willshow you some alternatives for the structure of the lexical scanner. Then, and only then, willwe get back to our parser from the last installment. Bear with me ... I think you'll find it's worththe wait. In fact, since scanners have many applications outside of compilers, you may wellfind this to be the most useful session for you.



52

LEXICAL SCANNING Lexical scanning is the process of scanning the stream of input characters and separatingit into strings called tokens. Most compiler texts start here, and devote several chapters todiscussing various ways to build scanners. This approach has its place, but as you havealready seen, there is a lot you can do without ever even addressing the issue, and in factthe scanner we'll end up with here won't look much like what the texts describe. The rea-son? Compiler theory and, consequently, the programs resulting from it, must deal withthe most general kind of parsing rules. We don't. In the real world, it is possible to specifythe language syntax in such a way that a pretty simple scanner will suffice. And asalways, KISS is our motto.

Typically, lexical scanning is done in a separate part of the compiler, so that the parser perse sees only a stream of input tokens. Now, theoretically it is not necessary to separatethis function from the rest of the parser. There is only one set of syntax equations thatdefine the whole language, so in theory we could write the whole parser in one module.

Why the separation? The answer has both practical and theoretical bases.

In 1956, Noam Chomsky defined the "Chomsky Hierarchy" of grammars. They are:

o Type 0: Unrestricted (e.g., English)

o Type 1: Context-Sensitive

o Type 2: Context-Free

o Type 3: Regular

A few features of the typical programming language (particularly the older ones, such asFORTRAN) are Type 1, but for the most part all modern languages can be describedusing only the last two types, and those are all we'll be dealing with here.

The neat part about these two types is that there are very specific ways to parse them. Ithas been shown that any regular grammar can be parsed using a particular form ofabstract machine called the state machine (finite automaton). We have already imple-mented state machines in some of our recognizers.



Similarly, Type 2 (context-free) grammars can always be parsed using a push-down automa-ton (a state machine augmented by a stack). We have also implemented these machines.Instead of implementing a literal stack, we have relied on the built-in stack associated withrecursive coding to do the job, and that in fact is the preferred approach for top-down parsing.

Now, it happens that in real, practical grammars, the parts that qualify as regular expressionstend to be the lower-level parts, such as the definition of an identifier:

<ident> ::= <letter> [ <letter> | <digit> ]*

Since it takes a different kind of abstract machine to parse the two types of grammars, itmakes sense to separate these lower- level functions into a separate module, the lexicalscanner, which is built around the idea of a state machine. The idea is to use the simplestparsing technique needed for the job.

There is another, more practical reason for separating scanner from parser. We like to think ofthe input source file as a stream of characters, which we process right to left without back-tracking. In practice that isn't possible. Almost every language has certain keywords such asIF, WHILE, and END. As I mentioned earlier, we can't really know whether a given characterstring is a keyword, until we've reached the end of it, as defined by a space or other delimiter.So in that sense, we MUST save the string long enough to find out whether we have a key-word or not. That's a limited form of backtracking.

So the structure of a conventional compiler involves splitting up the functions of the lower-level and higher-level parsing. The lexical scanner deals with things at the character level,collecting characters into strings, etc., and passing them along to the parser proper as indivis-ible tokens. It's also considered normal to let the scanner have the job of identifying key-words.



53

STATE MACHINES AND ALTERNATIVES I mentioned that the regular expressions can be parsed using a state machine. In mostcompiler texts, and indeed in most compilers as well, you will find this taken literally.There is typically a real implementation of the state machine, with integers used to definethe current state, and a table of actions to take for each combination of current state andinput character. If you write a compiler front end using the popular Unix tools LEX andYACC, that's what you'll get. The output of LEX is a state machine implemented in C, plusa table of actions corresponding to the input grammar given to LEX. The YACC output issimilar ... a canned table-driven parser, plus the table corresponding to the language syn-tax.

That is not the only choice, though. In our previous installments, you have seen over andover that it is possible to implement parsers without dealing specifically with tables,stacks, or state variables. In fact, in Installment V I warned you that if you find yourselfneeding these things you might be doing something wrong, and not taking advantage ofthe power of Pascal. There are basically two ways to define a state machine's state:explicitly, with a state number or code, and implicitly, simply by virtue of the fact that I'm ata certain place in the code (if it's Tuesday, this must be Belgium). We've relied heavily onthe implicit approaches before, and I think you'll find that they work well here, too.

In practice, it may not even be necessary to HAVE a well-defined lexical scanner. Thisisn't our first experience at dealing with multi-character tokens. In Installment III, weextended our parser to provide for them, and we didn't even NEED a lexical scanner. Thatwas because in that narrow context, we could always tell, just by looking at the single loo-kahead character, whether we were dealing with a number, a variable, or an operator. Ineffect, we built a distributed lexical scanner, using procedures GetName and GetNum.

With keywords present, we can't know anymore what we're dealing with, until the entiretoken is read. This leads us to a more localized scanner; although, as you will see, theidea of a distributed scanner still has its merits.



SOME EXPERIMENTS IN SCANNING Before getting back to our compiler, it will be useful to experiment a bit with the general con-cepts.

Let's begin with the two definitions most often seen in real programming languages:

<ident> ::= <letter> [ <letter> | <digit> ]*

<number ::= [<digit>]+

(Remember, the '*' indicates zero or more occurences of the terms in brackets, and the '+',one or more.)

We have already dealt with similar items in Installment III. Let's begin (as usual) with a barecradle. Not surprisingly, we are going to need a new recognizer:

{--------------------------------------------------------------}

{ Recognize an Alphanumeric Character }


begin


end;

{--------------------------------------------------------------}



53

Using this let's write the following two routines, which are very similar to those we've usedbefore:

{--------------------------------------------------------------}



var x: string[8];

begin

x := '';



x := x + UpCase(Look);

GetChar;

end;

GetName := x;

end;



{--------------------------------------------------------------}

{ Get a Number }


var x: string[16];

begin

x := '';



x := x + Look;

GetChar;

end;

GetNum := x;

end;

{--------------------------------------------------------------}

(Notice that this version of GetNum returns a string, not an integer as before.)

You can easily verify that these routines work by calling them from the main program, as in

WriteLn(GetName);

This program will print any legal name typed in (maximum eight characters, since that's whatwe told GetName). It will reject anything else.

Test the other routine similarly.



53

WHITE SPACE We also have dealt with embedded white space before, using the two routines IsWhiteand SkipWhite. Make sure that these routines are in your current version of the cradle,and add the the line

SkipWhite;

at the end of both GetName and GetNum.

Now, let's define the new procedure:

{--------------------------------------------------------------}{ Lexical Scanner }

Function Scan: string;

begin

if IsAlpha(Look) then

Scan := GetName

else if IsDigit(Look) then

Scan := GetNum

else begin

Scan := Look;

GetChar;

end;

SkipWhite;

end;{--------------------------------------------------------------}



We can call this from the new main program:

{--------------------------------------------------------------}

{ Main Program }

begin

Init;

repeat

Token := Scan;

writeln(Token);

until Token = CR;

end.

{--------------------------------------------------------------}

(You will have to add the declaration of the string Token at the beginning of the program.Make it any convenient length, say 16 characters.) Now, run the program. Note how the inputstring is, indeed, separated into distinct tokens.



53

STATE MACHINES For the record, a parse routine like GetName does indeed implement a state machine.The state is implicit in the current position in the code. A very useful trick for visualizingwhat's going on is the syntax diagram, or "railroad-track" diagram. It's a little difficult todraw one in this medium, so I'll use them very sparingly, but the figure below should giveyou the idea:

|-----> Other---------------------------> Error

|

Start -------> Letter ---------------> Other -----> Finish

^ V

| |

|<----- Letter <---------|

| |

|<----- Digit <----------

As you can see, this diagram shows how the logic flows as characters are read. Thingsbegin, of course, in the start state, and end when a character other than an alphanumericis found. If the first character is not alpha, an error occurs. Otherwise the machine willcontinue looping until the terminating delimiter is found.

Note that at any point in the flow, our position is entirely dependent on the past history ofthe input characters. At that point, the action to be taken depends only on the currentstate, plus the current input character. That's what make this a state machine.



Because of the difficulty of drawing railroad-track diagrams in this medium, I'll continue tostick to syntax equations from now on. But I highly recommend the diagrams to you for any-thing you do that involves parsing. After a little practice you can begin to see how to write aparser directly from the diagrams. Parallel paths get coded into guarded actions (guarded byIF's or CASE statements), serial paths into sequential calls. It's almost like working from aschematic.

We didn't even discuss SkipWhite, which was introduced earlier, but it also is a simple statemachine, as is GetNum. So is their parent procedure, Scan. Little machines make bigmachines.

The neat thing that I'd like you to note is how painlessly this implicit approach creates thesestate machines. I personally prefer it a lot over the table-driven approach. It also results is asmall, tight, and fast scanner.



53

NEWLINES Moving right along, let's modify our scanner to handle more than one line. As I mentionedlast time, the most straightforward way to do this is to simply treat the newline characters,carriage return and line feed, as white space. This is, in fact, the way the C standardlibrary routine, iswhite, works. We didn't actually try this before. I'd like to do it now, so youcan get a feel for the results.

To do this, simply modify the single executable line of IsWhite to read:

IsWhite := c in [' ', TAB, CR, LF];

We need to give the main program a new stop condition, since it will never see a CR.Let's just use:

until Token = '.';

OK, compile this program and run it. Try a couple of lines, terminated by the period. Iused:

now is the time for all good men.

Hey, what happened? When I tried it, I didn't get the last token, the period. The programdidn't halt. What's more, when I pressed the 'enter' key a few times, I still didn't get theperiod.

If you're still stuck in your program, you'll find that typing a period on a new line will termi-nate it.

What's going on here? The answer is that we're hanging up in SkipWhite. A quick look atthat routine will show that as long as we're typing null lines, we're going to just continue toloop. After SkipWhite encounters an LF, it tries to execute a GetChar. But since the inputbuffer is now empty, GetChar's read statement insists on having another line. ProcedureScan gets the terminating period, all right, but it calls SkipWhite to clean up, and Skip-White won't return until it gets a non-null line.



This kind of behavior is not quite as bad as it seems. In a real compiler, we'd be reading froman input file instead of the console, and as long as we have some procedure for dealing withend-of-files, everything will come out OK. But for reading data from the console, the behavioris just too bizarre. The fact of the matter is that the C/Unix convention is just not compatiblewith the structure of our parser, which calls for a lookahead character. The code that the Bellwizards have implemented doesn't use that convention, which is why they need 'ungetc'.

OK, let's fix the problem. To do that, we need to go back to the old definition of IsWhite(delete the CR and LF characters) and make use of the procedure Fin that I introduced lasttime. If it's not in your current version of the cradle, put it there now.

Also, modify the main program to read:

{--------------------------------------------------------------}

{ Main Program }

begin

Init;

repeat

Token := Scan;

writeln(Token);

if Token = CR then Fin;

until Token = '.';

end.

{--------------------------------------------------------------}



54

Note the "guard" test preceding the call to Fin. That's what makes the whole thing work,and ensures that we don't try to read a line ahead.

Try the code now. I think you'll like it better.

If you refer to the code we did in the last installment, you'll find that I quietly sprinkled callsto Fin throughout the code, wherever a line break was appropriate. This is one of thoseareas that really affects the look & feel that I mentioned. At this point I would urge you toexperiment with different arrangements and see how you like them. If you want your lan-guage to be truly free-field, then newlines should be transparent. In this case, the bestapproach is to put the following lines at the BEGINNING of Scan:

while Look = CR do

Fin;

If, on the other hand, you want a line-oriented language like Assembler, BASIC, or FOR-TRAN (or even Ada... note that it has comments terminated by newlines), then you'll needfor Scan to return CR's as tokens. It must also eat the trailing LF. The best way to do thatis to use this line, again at the beginning of Scan:

if Look = LF then Fin;

For other conventions, you'll have to use other arrangements. In my example of the lastsession, I allowed newlines only at specific places, so I was somewhere in the middleground. In the rest of these sessions, I'll be picking ways to handle newlines that I happento like, but I want you to know how to choose other ways for yourselves.



OPERATORS We could stop now and have a pretty useful scanner for our purposes. In the fragments ofKISS that we've built so far, the only tokens that have multiple characters are the identifiersand numbers. All operators were single characters. The only exception I can think of is therelops <=, >=, and <>, but they could be dealt with as special cases.

Still, other languages have multi-character operators, such as the ':=' of Pascal or the '++' and'>>' of C. So while we may not need multi-character operators, it's nice to know how to getthem if necessary.

Needless to say, we can handle operators very much the same way as the other tokens. Let'sstart with a recognizer:

{--------------------------------------------------------------}

{ Recognize Any Operator }

function IsOp(c: char): boolean;

begin

IsOp := c in ['+', '-', '*', '/', '<', '>', ':', '='];

end;

{--------------------------------------------------------------}

It's important to note that we DON'T have to include every possible operator in this list. Forexample, the paretheses aren't included, nor is the terminating period. The current version ofScan handles single-character operators just fine as it is. The list above includes only thosecharacters that can appear in multi-character operators. (For specific languages, of course,the list can always be edited.)



54

Now, let's modify Scan to read:

{--------------------------------------------------------------}{ Lexical Scanner }

Function Scan: string;

begin

while Look = CR do

Fin;


Scan := GetName


Scan := GetNum

else if IsOp(Look) then

Scan := GetOp

else begin

Scan := Look;

GetChar;

end;

SkipWhite;

end;

{--------------------------------------------------------------}

Try the program now. You will find that any code fragments you care to throw at it will beneatly broken up into individual tokens.



LISTS, COMMAS AND COMMAND LINES Before getting back to the main thrust of our study, I'd like to get on my soapbox for amoment.

How many times have you worked with a program or operating system that had rigid rulesabout how you must separate items in a list? (Try, the last time you used MSDOS!) Someprograms require spaces as delimiters, and some require commas. Worst of all, some requireboth, in different places. Most are pretty unforgiving about violations of their rules.

I think this is inexcusable. It's too easy to write a parser that will handle both spaces and com-mas in a flexible way. Consider the following procedure:

{--------------------------------------------------------------}

{ Skip Over a Comma }

procedure SkipComma;

begin

SkipWhite;

if Look = ',' then begin

GetChar;

SkipWhite;

end;

end;

{--------------------------------------------------------------}

This eight-line procedure will skip over a delimiter consisting of any number (including zero)of spaces, with zero or one comma embedded in the string.



54

TEMPORARILY, change the call to SkipWhite in Scan to a call to SkipComma, and tryinputting some lists. Works nicely, eh? Don't you wish more software authors knew aboutSkipComma?

For the record, I found that adding the equivalent of SkipComma to my Z80 assembler-language programs took all of 6 (six) extra bytes of code. Even in a 64K machine, that'snot a very high price to pay for user-friendliness!

I think you can see where I'm going here. Even if you never write a line of a compiler codein your life, there are places in every program where you can use the concepts of parsing.Any program that processes a command line needs them. In fact, if you think about it fora bit, you'll have to conclude that any time you write a program that processes userinputs, you're defining a language. People communicate with languages, and the syntaximplicit in your program defines that language. The real question is: are you going todefine it deliberately and explicitly, or just let it turn out to be whatever the program endsup parsing?

I claim that you'll have a better, more user-friendly program if you'll take the time to definethe syntax explicitly. Write down the syntax equations or draw the railroad-track diagrams,and code the parser using the techniques I've shown you here. You'll end up with a betterprogram, and it will be easier to write, to boot.



GETTING FANCY OK, at this point we have a pretty nice lexical scanner that will break an input stream up intotokens. We could use it as it stands and have a servicable compiler. But there are some otheraspects of lexical scanning that we need to cover.

The main consideration is <shudder> efficiency. Remember when we were dealing with sin-gle-character tokens, every test was a comparison of a single character, Look, with a byteconstant. We also used the Case statement heavily.

With the multi-character tokens being returned by Scan, all those tests now become stringcomparisons. Much slower. And not only slower, but more awkward, since there is no stringequivalent of the Case statement in Pascal. It seems especially wasteful to test for what usedto be single characters ... the '=', '+', and other operators ... using string comparisons.

Using string comparison is not impossible ... Ron Cain used just that approach in writingSmall C. Since we're sticking to the KISS principle here, we would be truly justified in settlingfor this approach. But then I would have failed to tell you about one of the key approachesused in "real" compilers.

You have to remember: the lexical scanner is going to be called a _LOT_! Once for everytoken in the whole source program, in fact. Experiments have indicated that the averagecompiler spends anywhere from 20% to 40% of its time in the scanner routines. If there wereever a place where efficiency deserves real consideration, this is it.

For this reason, most compiler writers ask the lexical scanner to do a little more work, by"tokenizing" the input stream. The idea is to match every token against a list of acceptablekeywords and operators, and return unique codes for each one recognized. In the case ofordinary variable names or numbers, we just return a code that says what kind of token theyare, and save the actual string somewhere else.

One of the first things we're going to need is a way to identify keywords. We can always do itwith successive IF tests, but it surely would be nice if we had a general-purpose routine thatcould compare a given string with a table of keywords. (By the way, we're also going to needsuch a routine later, for dealing with symbol tables.) This usually presents a problem in Pas-cal, because standard Pascal doesn't allow for arrays of variable lengths. It's a real bother to



54

have to declare a different search routine for every table. Standard Pascal also doesn'tallow for initializing arrays, so you tend to see code like

Table[1] := 'IF';

Table[2] := 'ELSE';

.

.

Table[n] := 'END';

which can get pretty old if there are many keywords.

Fortunately, Turbo Pascal 4.0 has extensions that eliminate both of these problems. Con-stant arrays can be declared using TP's "typed constant" facility, and the variable dimen-sions can be handled with its C-like extensions for pointers.

First, modify your declarations like this:

{--------------------------------------------------------------}

{ Type Declarations }

type Symbol = string[8];

SymTab = array[1..1000] of Symbol;

TabPtr = ^SymTab;

{--------------------------------------------------------------}



(The dimension used in SymTab is not real ... no storage is allocated by the declaration itself,and the number need only be "big enough.")

Now, just beneath those declarations, add the following:

{--------------------------------------------------------------}

{ Definition of Keywords and Token Types }

const KWlist: array [1..4] of Symbol =

('IF', 'ELSE', 'ENDIF', 'END');

{--------------------------------------------------------------}



54

Next, insert the following new function:

{--------------------------------------------------------------}

{ Table Lookup }

{ If the input string matches a table entry, return the entry index. If not, return a zero. }

function Lookup(T: TabPtr; s: string; n: integer): integer;

var i: integer;

found: boolean;

begin

found := false;

i := n;

while (i > 0) and not found do

if s = T^[i] then

found := true

else

dec(i);

Lookup := i;

end;

{--------------------------------------------------------------}



To test it, you can temporarily change the main program as follows:

{--------------------------------------------------------------}

{ Main Program }

begin

ReadLn(Token);

WriteLn(Lookup(Addr(KWList), Token, 4));

end.

{--------------------------------------------------------------}

Notice how Lookup is called: The Addr function sets up a pointer to KWList, which getspassed to Lookup.

OK, give this a try. Since we're bypassing Scan here, you'll have to type the keywords inupper case to get any matches.

Now that we can recognize keywords, the next thing is to arrange to return codes for them.

So what kind of code should we return? There are really only two reasonable choices. Thisseems like an ideal application for the Pascal enumerated type. For example, you can definesomething like

SymType = (IfSym, ElseSym, EndifSym, EndSym, Ident, Number, Operator);

and arrange to return a variable of this type. Let's give it a try. Insert the line above into yourtype definitions.



55

Now, add the two variable declarations:

Token: Symtype; { Current Token }

Value: String[16]; { String Token of Look }

Modify the scanner to read:

{--------------------------------------------------------------}

{ Lexical Scanner }

procedure Scan;

var k: integer;

begin

while Look = CR do

Fin;

if IsAlpha(Look) then begin

Value := GetName;

k := Lookup(Addr(KWlist), Value, 4);

if k = 0 then

Token := Ident

else

Token := SymType(k - 1);

end

else if IsDigit(Look) then begin



Value := GetNum;

Token := Number;

end

else if IsOp(Look) then begin

Value := GetOp;

Token := Operator;

end

else begin

Value := Look;

Token := Operator;

GetChar;

end;

SkipWhite;

end;

{--------------------------------------------------------------}

(Notice that Scan is now a procedure, not a function.)



55

Finally, modify the main program to read:

{--------------------------------------------------------------}

{ Main Program }

begin

Init;

repeat

Scan;

case Token of

Ident: write('Ident ');

Number: Write('Number ');

Operator: Write('Operator ');

IfSym, ElseSym, EndifSym, EndSym: Write('Keyword ');

end;

Writeln(Value);

until Token = EndSym;

end.

{--------------------------------------------------------------}

What we've done here is to replace the string Token used earlier with an enumeratedtype. Scan returns the type in variable Token, and returns the string itself in the new vari-able Value.



OK, compile this and give it a whirl. If everything goes right, you should see that we are nowrecognizing keywords.

What we have now is working right, and it was easy to generate from what we had earlier.However, it still seems a little "busy" to me. We can simplify things a bit by letting GetName,GetNum, GetOp, and Scan be procedures working with the global variables Token and Value,thereby eliminating the local copies. It also seems a little cleaner to move the table lookupinto GetName. The new form for the four procedures is, then:

{--------------------------------------------------------------}


procedure GetName;

var k: integer;

begin

Value := '';



Value := Value + UpCase(Look);

GetChar; end; k := Lookup(Addr(KWlist), Value, 4); if k = 0 then Token := Ident else Token := SymType(k-1);end;



55

{--------------------------------------------------------------}

{ Get a Number }

procedure GetNum;

begin

Value := '';




GetChar;

end;

Token := Number;

end;



{--------------------------------------------------------------}

{ Get an Operator }

procedure GetOp;

begin

Value := '';

if not IsOp(Look) then Expected('Operator');

while IsOp(Look) do begin


GetChar;

end;

Token := Operator;

end;



55

{--------------------------------------------------------------}

{ Lexical Scanner }

procedure Scan;

var k: integer;

begin

while Look = CR do

Fin;


GetName


GetNum

else if IsOp(Look) then

GetOp

else begin

Value := Look;

Token := Operator;

GetChar;

end;

SkipWhite;

end;

{--------------------------------------------------------------}



RETURNING A CHARACTER Essentially every scanner I've ever seen that was written in Pascal used the mechanism of anenumerated type that I've just described. It is certainly a workable mechanism, but it doesn'tseem the simplest approach to me.

For one thing, the list of possible symbol types can get pretty long. Here, I've used just onesymbol, "Operator," to stand for all of the operators, but I've seen other designs that actuallyreturn different codes for each one.

There is, of course, another simple type that can be returned as a code: the character.Instead of returning the enumeration value 'Operator' for a '+' sign, what's wrong with justreturning the character itself? A character is just as good a variable for encoding the differenttoken types, it can be used in case statements easily, and it's sure a lot easier to type. Whatcould be simpler?

Besides, we've already had experience with the idea of encoding keywords as single charac-ters. Our previous programs are already written that way, so using this approach will minimizethe changes to what we've already done.

Some of you may feel that this idea of returning character codes is too mickey-mouse. I mustadmit it gets a little awkward for multi-character operators like '<='. If you choose to stay withthe enumerated type, fine. For the rest, I'd like to show you how to change what we've doneabove to support that approach.

First, you can delete the SymType declaration now ... we won't be needing that. And you canchange the type of Token to char.

Next, to replace SymType, add the following constant string:

const KWcode: string[5] = 'xilee';

(I'll be encoding all idents with the single character 'x'.)



55

Lastly, modify Scan and its relatives as follows:

{--------------------------------------------------------------}


procedure GetName;

begin

Value := '';




GetChar;

end;

Token := KWcode[Lookup(Addr(KWlist), Value, 4) + 1];

end;



{--------------------------------------------------------------}

{ Get a Number }

procedure GetNum;

begin

Value := '';




GetChar;

end;

Token := '#';

end;



56

{--------------------------------------------------------------}

{ Get an Operator }

procedure GetOp;

begin

Value := '';

if not IsOp(Look) then Expected('Operator');

while IsOp(Look) do begin


GetChar;

end;

if Length(Value) = 1 then

Token := Value[1]

else

Token := '?';

end;



{--------------------------------------------------------------}

{ Lexical Scanner }

procedure Scan;

var k: integer;

begin

while Look = CR do

Fin;


GetName


GetNum

else if IsOp(Look) then begin

GetOp

else begin

Value := Look;

Token := '?';

GetChar;

end;

SkipWhite;

end;



56

{--------------------------------------------------------------}

{ Main Program }

begin

Init;

repeat

Scan;

case Token of

'x': write('Ident ');

'#': Write('Number ');

'i', 'l', 'e': Write('Keyword ');

else Write('Operator ');

end;

Writeln(Value);

until Value = 'END';

end.

{--------------------------------------------------------------}

This program should work the same as the previous version. A minor difference in struc-ture, maybe, but it seems more straightforward to me.



DISTRIBUTED vs CENTRALIZED SCANNERS The structure for the lexical scanner that I've just shown you is very conventional, and about99% of all compilers use something very close to it. This is not, however, the only possiblestructure, or even always the best one.

The problem with the conventional approach is that the scanner has no knowledge of con-text. For example, it can't distinguish between the assignment operator '=' and the relationaloperator '=' (perhaps that's why both C and Pascal use different strings for the two). All thescanner can do is to pass the operator along to the parser, which can hopefully tell from thecontext which operator is meant. Similarly, a keyword like 'IF' has no place in the middle of amath expression, but if one happens to appear there, the scanner will see no problem with it,and will return it to the parser, properly encoded as an 'IF'.

With this kind of approach, we are not really using all the information at our disposal. In themiddle of an expression, for example, the parser "knows" that there is no need to look forkeywords, but it has no way of telling the scanner that. So the scanner continues to do so.This, of course, slows down the compilation.

In real-world compilers, the designers often arrange for more information to be passedbetween parser and scanner, just to avoid this kind of problem. But that can get awkward,and certainly destroys a lot of the modularity of the structure.

The alternative is to seek some way to use the contextual information that comes from know-ing where we are in the parser. This leads us back to the notion of a distributed scanner, inwhich various portions of the scanner are called depending upon the context.

In KISS, as in most languages, keywords ONLY appear at the beginning of a statement. Inplaces like expressions, they are not allowed. Also, with one minor exception (the multi-char-acter relops) that is easily handled, all operators are single characters, which means that wedon't need GetOp at all.

So it turns out that even with multi-character tokens, we can still always tell from the currentlookahead character exactly what kind of token is coming, except at the very beginning of astatement.



56

Even at that point, the ONLY kind of token we can accept is an identifier. We need only todetermine if that identifier is a keyword or the target of an assignment statement.

We end up, then, still needing only GetName and GetNum, which are used very much aswe've used them in earlier installments.

It may seem at first to you that this is a step backwards, and a rather primitive approach.In fact, it is an improvement over the classical scanner, since we're using the scanningroutines only where they're really needed. In places where keywords are not allowed, wedon't slow things down by looking for them.



MERGING SCANNER AND PARSER Now that we've covered all of the theory and general aspects of lexical scanning that we'll beneeding, I'm FINALLY ready to back up my claim that we can accomodate multi-charactertokens with minimal change to our previous work. To keep things short and simple I willrestrict myself here to a subset of what we've done before; I'm allowing only one control con-struct (the IF) and no Boolean expressions. That's enough to demonstrate the parsing of bothkeywords and expressions. The extension to the full set of constructs should be pretty appar-ent from what we've already done.

All the elements of the program to parse this subset, using single-character tokens, existalready in our previous programs. I built it by judicious copying of these files, but I wouldn'tdare try to lead you through that process. Instead, to avoid any confusion, the whole programis shown below:



56

{--------------------------------------------------------------}

program KISS;

{--------------------------------------------------------------}


const TAB = ^I;

CR = ^M;

LF = ^J;

{--------------------------------------------------------------}




TabPtr = ^SymTab;

{--------------------------------------------------------------}






{--------------------------------------------------------------}


procedure GetChar;

begin

Read(Look);

end;

{--------------------------------------------------------------}

{ Report an Error }


begin

WriteLn;


end;

{--------------------------------------------------------------}



begin

Error(s);

Halt;

end;



56

{--------------------------------------------------------------}



begin


end;

{--------------------------------------------------------------}



begin


end;

{--------------------------------------------------------------}



begin

IsDigit := c in ['0'..'9'];

end;



{--------------------------------------------------------------}

{ Recognize an AlphaNumeric Character }


begin


end;

{--------------------------------------------------------------}



begin

IsAddop := c in ['+', '-'];

end;

{--------------------------------------------------------------}

{ Recognize a Mulop }

function IsMulop(c: char): boolean;

begin

IsMulop := c in ['*', '/'];

end;



57

{--------------------------------------------------------------}



begin


end;

{--------------------------------------------------------------}



begin


GetChar;

end;

{--------------------------------------------------------------}



begin

if Look <> x then Expected('''' + x + '''');

GetChar;

SkipWhite;

end;



{--------------------------------------------------------------}

{ Skip a CRLF }

procedure Fin;

begin



SkipWhite;

end;

{--------------------------------------------------------------}



begin

while Look = CR do

Fin;


Getname := UpCase(Look);

GetChar;

SkipWhite;

end;



57

{--------------------------------------------------------------}

{ Get a Number }


begin


GetNum := Look;

GetChar;

SkipWhite;

end;

{--------------------------------------------------------------}



var S: string;

begin

Str(LCount, S);


Inc(LCount);

end;



{--------------------------------------------------------------}



begin

WriteLn(L, ':');

end;

{--------------------------------------------------------------}



begin

Write(TAB, s);

end;

{--------------------------------------------------------------}



begin

Emit(s);

WriteLn;

end;



57

{---------------------------------------------------------------}


procedure Ident;

var Name: char;

begin

Name := GetName;


Match('(');

Match(')');


end

else


end;



{---------------------------------------------------------------}



procedure Factor;

begin


Match('(');

Expression;

Match(')');

end


Ident

else


end;



57

{---------------------------------------------------------------}



var s: boolean;

begin

s := Look = '-';

if IsAddop(Look) then begin

GetChar;

SkipWhite;

end;

Factor;

if s then

EmitLn('NEG D0');

end;

{--------------------------------------------------------------}

{ Recognize and Translate a Multiply }procedure Multiply;begin

Match('*');

Factor;


end;



{-------------------------------------------------------------}


procedure Divide;

begin

Match('/');

Factor;


EmitLn('EXS.L D0');


end;

{---------------------------------------------------------------}

{ Completion of Term Processing (called by Term and FirstTerm }

procedure Term1;

begin

while IsMulop(Look) do begin


case Look of

'*': Multiply;

'/': Divide; end; end;

end;



57

{---------------------------------------------------------------}


procedure Term;

begin

Factor;

Term1;

end;

{---------------------------------------------------------------}

{ Parse and Translate a Math Term with Possible Leading Sign }

procedure FirstTerm;

begin

SignedFactor;

Term1;

end;

{---------------------------------------------------------------}


procedure Add;begin Match('+');

Term;


end;



{---------------------------------------------------------------}


procedure Subtract;

begin

Match('-');

Term;


EmitLn('NEG D0');

end;

{---------------------------------------------------------------}



begin

FirstTerm;



case Look of

'+': Add;

'-': Subtract; end; end;

end;



58

{---------------------------------------------------------------}




begin

EmitLn('Condition');

end;

{---------------------------------------------------------------}


procedure Block;

Forward;

procedure DoIf;

var L1, L2: string;

begin

Match('i');

Condition;

L1 := NewLabel;

L2 := L1;


Block;




Match('l');

L2 := NewLabel;


PostLabel(L1);

Block;

end;

PostLabel(L2);

Match('e');

end;

{--------------------------------------------------------------}



var Name: char;

begin

Name := GetName;

Match('=');

Expression;



end;



58

{--------------------------------------------------------------}


procedure Block;

begin


case Look of

'i': DoIf;

CR: while Look = CR do

Fin;

else Assignment;

end;

end;

end;

{--------------------------------------------------------------}



begin

Block;

if Look <> 'e' then Expected('END');

EmitLn('END')

end;



{--------------------------------------------------------------}

{ Initialize }

procedure Init;

begin

LCount := 0;

GetChar;

end;

{--------------------------------------------------------------}

{ Main Program }

begin

Init;

DoProgram;

end.

{--------------------------------------------------------------}



58

A couple of comments:

(1) The form for the expression parser, using FirstTerm, etc.,

is a little different from what you've seen before. It's

yet another variation on the same theme. Don't let it throw

you ... the change is not required for what follows.

(2) Note that, as usual, I had to add calls to Fin at strategic

spots to allow for multiple lines.

Before we proceed to adding the scanner, first copy this file and verify that it does indeedparse things correctly. Don't forget the "codes": 'i' for IF, 'l' for ELSE, and 'e' for END orENDIF.

If the program works, then let's press on. In adding the scanner modules to the program,it helps to have a systematic plan. In all the parsers we've written to date, we've stuck to aconvention that the current lookahead character should always be a non-blank character.We preload the lookahead character in Init, and keep the "pump primed" after that. Tokeep the thing working right at newlines, we had to modify this a bit and treat the newlineas a legal token.

In the multi-character version, the rule is similar: The current lookahead character shouldalways be left at the BEGINNING of the next token, or at a newline.



The multi-character version is shown next. To get it, I've made the following changes:

o Added the variables Token and Value, and the type definitions

needed by Lookup.

o Added the definitions of KWList and KWcode.

o Added Lookup.

o Replaced GetName and GetNum by their multi-character versions.

(Note that the call to Lookup has been moved out of GetName,

so that it will not be executed for calls within an

expression.)

o Created a new, vestigial Scan that calls GetName, then scans

for keywords.

o Created a new procedure, MatchString, that looks for a

specific keyword. Note that, unlike Match, MatchString does

NOT read the next keyword.

o Modified Block to call Scan.

o Changed the calls to Fin a bit. Fin is now called within

GetName.



58

Here is the program in its entirety:

{--------------------------------------------------------------}

program KISS;

{--------------------------------------------------------------}


const TAB = ^I;

CR = ^M;

LF = ^J;

{--------------------------------------------------------------}




TabPtr = ^SymTab;



{--------------------------------------------------------------}



Token : char; { Encoded Token }

Value : string[16]; { Unencoded Token }


{--------------------------------------------------------------}


const KWlist: array [1..4] of Symbol =

('IF', 'ELSE', 'ENDIF', 'END');

const KWcode: string[5] = 'xilee';

{--------------------------------------------------------------}


procedure GetChar;

begin

Read(Look);

end;



58

{--------------------------------------------------------------}

{ Report an Error }


begin

WriteLn;


end;

{--------------------------------------------------------------}



begin

Error(s);

Halt;

end;

{--------------------------------------------------------------}



begin


end;



{--------------------------------------------------------------}



begin


end;

{--------------------------------------------------------------}



begin

IsDigit := c in ['0'..'9'];

end;

{--------------------------------------------------------------}



begin


end;



59

{--------------------------------------------------------------}



begin

IsAddop := c in ['+', '-'];

end;

{--------------------------------------------------------------}



begin

IsMulop := c in ['*', '/'];

end;

{--------------------------------------------------------------}



begin


end;



{--------------------------------------------------------------}



begin


GetChar;

end;

{--------------------------------------------------------------}



begin

if Look <> x then Expected('''' + x + '''');

GetChar;

SkipWhite;

end;

{--------------------------------------------------------------}

{ Skip a CRLF }

procedure Fin;

begin





59

SkipWhite;

end;

{--------------------------------------------------------------}

{ Table Lookup }


var i: integer;

found: boolean;

begin

found := false;

i := n;


if s = T^[i] then

found := true

else

dec(i);

Lookup := i;

end;



{--------------------------------------------------------------}


procedure GetName;

begin

while Look = CR do

Fin;


Value := '';



GetChar;

end;

SkipWhite;

end;



59

{--------------------------------------------------------------}

{ Get a Number }

procedure GetNum;

begin


Value := '';



GetChar;

end;

Token := '#';

SkipWhite;

end;

{--------------------------------------------------------------}

{ Get an Identifier and Scan it for Keywords }

procedure Scan;

begin

GetName;

Token := KWcode[Lookup(Addr(KWlist), Value, 4) + 1];

end;



{--------------------------------------------------------------}

{ Match a Specific Input String }

procedure MatchString(x: string);

begin

if Value <> x then Expected('''' + x + '''');

end;

{--------------------------------------------------------------}



var S: string;

begin

Str(LCount, S);


Inc(LCount);

end;

{--------------------------------------------------------------}



begin

WriteLn(L, ':');

end;



59

{--------------------------------------------------------------}



begin

Write(TAB, s);

end;

{--------------------------------------------------------------}



begin

Emit(s);

WriteLn;

end;



{---------------------------------------------------------------}


procedure Ident;

begin

GetName;


Match('(');

Match(')');

EmitLn('BSR ' + Value);

end

else

EmitLn('MOVE ' + Value + '(PC),D0');

end;



59

{---------------------------------------------------------------}



procedure Factor;

begin


Match('(');

Expression;

Match(')');

end


Ident

else begin

GetNum;

EmitLn('MOVE #' + Value + ',D0');

end;

end;



{---------------------------------------------------------------}



var s: boolean;

begin

s := Look = '-';

if IsAddop(Look) then begin

GetChar;

SkipWhite;

end;

Factor;

if s then

EmitLn('NEG D0');

end;

{--------------------------------------------------------------}

{ Recognize and Translate a Multiply }procedure Multiply;begin

Match('*');

Factor;


end;



60

{-------------------------------------------------------------}


procedure Divide;

begin

Match('/');

Factor;


EmitLn('EXS.L D0');


end;

{---------------------------------------------------------------}

{ Completion of Term Processing (called by Term and FirstTerm }

procedure Term1;

begin



case Look of

'*': Multiply;

'/': Divide; end; end;

end;



{---------------------------------------------------------------}


procedure Term;

begin

Factor;

Term1;

end;

{---------------------------------------------------------------}

{ Parse and Translate a Math Term with Possible Leading Sign }


begin

SignedFactor;

Term1;

end;

{---------------------------------------------------------------}


procedure Add;begin Match('+');

Term;


end;



60

{---------------------------------------------------------------}


procedure Subtract;

begin

Match('-');

Term;


EmitLn('NEG D0');

end;

{---------------------------------------------------------------}



begin

FirstTerm;



case Look of

'+': Add;

'-': Subtract; end;

end;end;



{---------------------------------------------------------------}




begin

EmitLn('Condition');

end;



60

{---------------------------------------------------------------}



procedure DoIf;

var L1, L2: string;

begin

Condition;

L1 := NewLabel;

L2 := L1;


Block;

if Token = 'l' then begin

L2 := NewLabel;


PostLabel(L1);

Block;

end;

PostLabel(L2);

MatchString('ENDIF');

end;



{--------------------------------------------------------------}



var Name: string;

begin

Name := Value;

Match('=');

Expression;



end;

{--------------------------------------------------------------}

{ Recognize and Translate a Statement Block }procedure Block;begin Scan; while not (Token in ['e', 'l']) do begin case Token of 'i': DoIf;

else Assignment;

end;

Scan;

end;

end;



60

{--------------------------------------------------------------}



begin

Block;

MatchString('END');

EmitLn('END')

end;

{--------------------------------------------------------------}

{ Initialize }

procedure Init;

begin

LCount := 0;

GetChar;

end;

{--------------------------------------------------------------}{ Main Program }begin Init; DoProgram;end.

{--------------------------------------------------------------}

Compare this program with its single-character counterpart. I think you will agree that thedifferences are minor.



CONCLUSION At this point, you have learned how to parse and generate code for expressions, Booleanexpressions, and control structures. You have now learned how to develop lexical scanners,and how to incorporate their elements into a translator. You have still not seen ALL the ele-ments combined into one program, but on the basis of what we've done before you shouldfind it a straightforward matter to extend our earlier programs to include scanners.

We are very close to having all the elements that we need to build a real, functional compiler.There are still a few things missing, notably procedure calls and type definitions. We will dealwith those in the next few sessions. Before doing so, however, I thought it would be fun toturn the translator above into a true compiler. That's what we'll be doing in the next install-ment.

Up till now, we've taken a rather bottom-up approach to parsing, beginning with low-level con-structs and working our way up. In the next installment, I'll also be taking a look from the topdown, and we'll discuss how the structure of the translator is altered by changes in the lan-guage definition.

See you then.



60

Part 8 - A Little Philosophy

INTRODUCTION This is going to be a different kind of session than the others in our series on parsing andcompiler construction. For this session, there won't be any experiments to do or code towrite. This once, I'd like to just talk with you for a while. Mercifully, it will be a short ses-sion, and then we can take up where we left off, hopefully with renewed vigor.

When I was in college, I found that I could always follow a prof's lecture a lot better if Iknew where he was going with it. I'll bet you were the same.

So I thought maybe it's about time I told you where we're going with this series: what'scoming up in future installments, and in general what all this is about. I'll also share somegeneral thoughts concerning the usefulness of what we've been doing.



THE ROAD HOME So far, we've covered the parsing and translation of arithmetic expressions, Boolean expres-sions, and combinations connected by relational operators. We've also done the same forcontrol constructs. In all of this we've leaned heavily on the use of top-down, recursivedescent parsing, BNF definitions of the syntax, and direct generation of assembly-languagecode. We also learned the value of such tricks as single-character tokens to help us see theforest through the trees. In the last installment we dealt with lexical scanning, and I showedyou simple but powerful ways to remove the single-character barriers.

Throughout the whole study, I've emphasized the KISS philosophy ... Keep It Simple, Sidney... and I hope by now you've realized just how simple this stuff can really be. While there arefor sure areas of compiler theory that are truly intimidating, the ultimate message of thisseries is that in practice you can just politely sidestep many of these areas. If the languagedefinition cooperates or, as in this series, if you can define the language as you go, it's possi-ble to write down the language definition in BNF with reasonable ease. And, as we've seen,you can crank out parse procedures from the BNF just about as fast as you can type.

As our compiler has taken form, it's gotten more parts, but each part is quite small and sim-ple, and very much like all the others.

At this point, we have many of the makings of a real, practical compiler. As a matter of fact,we already have all we need to build a toy compiler for a language as powerful as, say, TinyBASIC. In the next couple of installments, we'll go ahead and define that language.



61

To round out the series, we still have a few items to cover. These include:

o Procedure calls, with and without parameters

o Local and global variables

o Basic types, such as character and integer types

o Arrays

o Strings

o User-defined types and structures

o Tree-structured parsers and intermediate languages

o Optimization

These will all be covered in future installments. When we're finished, you'll have all thetools you need to design and build your own languages, and the compilers to translatethem.

I can't design those languages for you, but I can make some comments and recommen-dations. I've already sprinkled some throughout past installments. You've seen, for exam-ple, the control constructs I prefer.

These constructs are going to be part of the languages I build. I have three languages inmind at this point, two of which you will see in installments to come:

TINY - A minimal, but usable language on the order of Tiny BASIC or Tiny C. It won't be very practical, but it will have enough power to let you write and run real programs that do something worthwhile.

KISS - The language I'm building for my own use. KISS is intended to be a systems programming language. It won't have strong typing or fancy data structures, but it will support most of the things I want to do with a higher-order language (HOL), except perhaps writing compilers.



I've also been toying for years with the idea of a HOL-like assembler, with structured controlconstructs and HOL-like assignment statements. That, in fact, was the impetus behind myoriginal foray into the jungles of compiler theory. This one may never be built, simply becauseI've learned that it's actually easier to implement a language like KISS, that only uses a sub-set of the CPU instructions. As you know, assembly language can be bizarre and irregular inthe extreme, and a language that maps one-for-one onto it can be a real challenge. Still, I'vealways felt that the syntax used in conventional assemblers is dumb ... why is

MOVE.L A,B

better, or easier to translate, than

B=A ?

I think it would be an interesting exercise to develop a "compiler" that would give the pro-grammer complete access to and control over the full complement of the CPU instruction set,and would allow you to generate programs as efficient as assembly language, without thepain of learning a set of mnemonics. Can it be done? I don't know. The real question may be,"Will the resulting language be any easier to write than assembly"? If not, there's no point init. I think that it can be done, but I'm not completely sure yet how the syntax should look.

Perhaps you have some comments or suggestions on this one. I'd love to hear them.

You probably won't be surprised to learn that I've already worked ahead in most of the areasthat we will cover. I have some good news: Things never get much harder than they've beenso far. It's possible to build a complete, working compiler for a real language, using nothingbut the same kinds of techniques you've learned so far. And THAT brings up some interestingquestions.



61

WHY IS IT SO SIMPLE? Before embarking on this series, I always thought that compilers were just naturally com-plex computer programs ... the ultimate challenge. Yet the things we have done here haveusually turned out to be quite simple, sometimes even trivial.

For awhile, I thought is was simply because I hadn't yet gotten into the meat of the sub-ject. I had only covered the simple parts. I will freely admit to you that, even when I beganthe series, I wasn't sure how far we would be able to go before things got too complex todeal with in the ways we have so far. But at this point I've already been down the road farenough to see the end of it. Guess what?

THERE ARE NO HARD PARTS!

Then, I thought maybe it was because we were not generating very good object code.Those of you who have been following the series and trying sample compiles know that,while the code works and is rather foolproof, its efficiency is pretty awful. I figured that ifwe were concentrating on turning out tight code, we would soon find all that missing com-plexity.

To some extent, that one is true. In particular, my first few efforts at trying to improve effi-ciency introduced complexity at an alarming rate. But since then I've been tinkeringaround with some simple optimizations and I've found some that result in very respect-able code quality, WITHOUT adding a lot of complexity.

Finally, I thought that perhaps the saving grace was the "toy compiler" nature of the study.I have made no pretense that we were ever going to be able to build a compiler to com-pete with Borland and Microsoft. And yet, again, as I get deeper into this thing the differ-ences are starting to fade away.

Just to make sure you get the message here, let me state it flat out:

USING THE TECHNIQUES WE'VE USED HERE, IT IS POSSIBLE TO BUILD A PRO-DUCTION-QUALITY, WORKING COMPILER WITHOUT ADDING A LOT OF COMPLEX-ITY TO WHAT WE'VE ALREADY DONE.



Since the series began I've received some comments from you. Most of them echo my ownthoughts: "This is easy! Why do the textbooks make it seem so hard?" Good question.

Recently, I've gone back and looked at some of those texts again, and even bought and readsome new ones. Each time, I come away with the same feeling: These guys have made itseem too hard.

What's going on here? Why does the whole thing seem difficult in the texts, but easy to us?Are we that much smarter than Aho, Ullman, Brinch Hansen, and all the rest?

Hardly. But we are doing some things differently, and more and more I'm starting to appreci-ate the value of our approach, and the way that it simplifies things. Aside from the obviousshortcuts that I outlined in Part I, like single-character tokens and console I/O, we have madesome implicit assumptions and done some things differently from those who have designedcompilers in the past. As it turns out, our approach makes life a lot easier.

So why didn't all those other guys use it?

You have to remember the context of some of the earlier compiler development. These peo-ple were working with very small computers of limited capacity. Memory was very limited, theCPU instruction set was minimal, and programs ran in batch mode rather than interactively.As it turns out, these caused some key design decisions that have really complicated thedesigns. Until recently, I hadn't realized how much of classical compiler design was driven bythe available hardware.

Even in cases where these limitations no longer apply, people have tended to structure theirprograms in the same way, since that is the way they were taught to do it.

In our case, we have started with a blank sheet of paper. There is a danger there, of course,that you will end up falling into traps that other people have long since learned to avoid. But italso has allowed us to take different approaches that, partly by design and partly by puredumb luck, have allowed us to gain simplicity.



61

Here are the areas that I think have led to complexity in the past:

o Limited RAM Forcing Multiple Passes

I just read "Brinch Hansen on Pascal Compilers" (an excellent book, BTW). Hedeveloped a Pascal compiler for a PC, but he started the effort in 1981 with a 64Ksystem, and so almost every design decision he made was aimed at making thecompiler fit into RAM. To do this, his compiler has three passes, one of which isthe lexical scanner. There is no way he could, for example, use the distributedscanner I introduced in the last installment, because the program structurewouldn't allow it. He also required not one but two intermediate languages, to pro-vide the communication between phases.

All the early compiler writers had to deal with this issue: Break the compiler up intoenough parts so that it will fit in memory. When you have multiple passes, youneed to add data structures to support the information that each pass leavesbehind for the next. That adds complexity, and ends up driving the design. Lee'sbook, "The Anatomy of a Compiler," mentions a FORTRAN compiler developedfor an IBM 1401. It had no fewer than 63 separate passes! Needless to say, in acompiler like this the separation into phases would dominate the design.

Even in situations where RAM is plentiful, people have tended to use the sametechniques because that is what they're familiar with. It wasn't until Turbo Pascalcame along that we found how simple a compiler could be if you started with dif-ferent assumptions.



o Batch Processing

In the early days, batch processing was the only choice ... there was no interactivecomputing. Even today, compilers run in essentially batch mode.

In a mainframe compiler as well as many micro compilers, considerable effort isexpended on error recovery ... it can consume as much as 30-40% of the compilerand completely drive the design. The idea is to avoid halting on the first error, butrather to keep going at all costs, so that you can tell the programmer about as manyerrors in the whole program as possible.

All of that harks back to the days of the early mainframes, where turnaround time wasmeasured in hours or days, and it was important to squeeze every last ounce of infor-mation out of each run.

In this series, I've been very careful to avoid the issue of error recovery, and insteadour compiler simply halts with an error message on the first error. I will frankly admitthat it was mostly because I wanted to take the easy way out and keep things simple.But this approach, pioneered by Borland in Turbo Pascal, also has a lot going for itanyway. Aside from keeping the compiler simple, it also fits very well with the idea ofan interactive system. When compilation is fast, and especially when you have an edi-tor such as Borland's that will take you right to the point of the error, then it makes a lotof sense to stop there, and just restart the compilation after the error is fixed.



61

o Large Programs

Early compilers were designed to handle large programs ... essentially infiniteones. In those days there was little choice; the idea of subroutine libraries andseparate compilation were still in the future. Again, this assumption led to multi-pass designs and intermediate files to hold the results of partial processing.

Brinch Hansen's stated goal was that the compiler should be able to compile itself.Again, because of his limited RAM, this drove him to a multi-pass design. Heneeded as little resident compiler code as possible, so that the necessary tablesand other data structures would fit into RAM.

I haven't stated this one yet, because there hasn't been a need ... we've alwaysjust read and written the data as streams, anyway. But for the record, my plan hasalways been that, in a production compiler, the source and object data should allcoexist in RAM with the compiler, a la the early Turbo Pascals. That's why I'vebeen careful to keep routines like GetChar and Emit as separate routines, in spiteof their small size. It will be easy to change them to read to and write from mem-ory.

o Emphasis on Efficiency

John Backus has stated that, when he and his colleagues developed the originalFORTRAN compiler, they KNEW that they had to make it produce tight code. Inthose days, there was a strong sentiment against HOLs and in favor of assemblylanguage, and efficiency was the reason. If FORTRAN didn't produce very goodcode by assembly standards, the users would simply refuse to use it. For therecord, that FORTRAN compiler turned out to be one of the most efficient everbuilt, in terms of code quality. But it WAS complex!

Today, we have CPU power and RAM size to spare, so code efficiency is not somuch of an issue. By studiously ignoring this issue, we have indeed been able toKeep It Simple. Ironically, though, as I have said, I have found some optimizationsthat we can add to the basic compiler structure, without having to add a lot of com-plexity. So in this case we get to have our cake and eat it too: we will end up withreasonable code quality, anyway.



o Limited Instruction Sets

The early computers had primitive instruction sets. Things that we take for granted,such as stack operations and indirect addressing, came only with great difficulty.

Example: In most compiler designs, there is a data structure called the literal pool.The compiler typically identifies all literals used in the program, and collects them intoa single data structure. All references to the literals are done indirectly to this pool. Atthe end of the compilation, the compiler issues commands to set aside storage andinitialize the literal pool.

We haven't had to address that issue at all. When we want to load a literal, we just doit, in line, as in

MOVE #3,D0

There is something to be said for the use of a literal pool, particularly on a machinelike the 8086 where data and code can be separated. Still, the whole thing adds afairly large amount of complexity with little in return.

Of course, without the stack we would be lost. In a micro, both subroutine calls andtemporary storage depend heavily on the stack, and we have used it even more thannecessary to ease expression parsing.



61

o Desire for Generality

Much of the content of the typical compiler text is taken up with issues we haven'taddressed here at all ... things like automated translation of grammars, or genera-tion of LALR parse tables. This is not simply because the authors want to impressyou. There are good, practical reasons why the subjects are there.

We have been concentrating on the use of a recursive-descent parser to parse adeterministic grammar, i.e., a grammar that is not ambiguous and, therefore, canbe parsed with one level of lookahead. I haven't made much of this limitation, butthe fact is that this represents a small subset of possible grammars. In fact, thereis an infinite number of grammars that we can't parse using our techniques. TheLR technique is a more powerful one, and can deal with grammars that we can't.

In compiler theory, it's important to know how to deal with these other grammars,and how to transform them into grammars that are easier to deal with. For exam-ple, many (but not all) ambiguous grammars can be transformed into unambigu-ous ones. The way to do this is not always obvious, though, and so many peoplehave devoted years to develop ways to transform them automatically.

In practice, these issues turn out to be considerably less important. Modern lan-guages tend to be designed to be easy to parse, anyway. That was a key motiva-tion in the design of Pascal. Sure, there are pathological grammars that you wouldbe hard pressed to write unambiguous BNF for, but in the real world the bestanswer is probably to avoid those grammars!

In our case, of course, we have sneakily let the language evolve as we go, so wehaven't painted ourselves into any corners here. You may not always have thatluxury. Still, with a little care you should be able to keep the parser simple withouthaving to resort to automatic translation of the grammar.



We have taken a vastly different approach in this series. We started with a clean sheet ofpaper, and developed techniques that work in the context that we are in; that is, a single-userPC with rather ample CPU power and RAM space. We have limited ourselves to reasonablegrammars that are easy to parse, we have used the instruction set of the CPU to advantage,and we have not concerned ourselves with efficiency. THAT's why it's been easy.

Does this mean that we are forever doomed to be able to build only toy compilers? No, I don'tthink so. As I've said, we can add certain optimizations without changing the compiler struc-ture. If we want to process large files, we can always add file buffering to do that. Thesethings do not affect the overall program design.

And I think that's a key factor. By starting with small and limited cases, we have been able toconcentrate on a structure for the compiler that is natural for the job. Since the structure nat-urally fits the job, it is almost bound to be simple and transparent. Adding capability doesn'thave to change that basic structure. We can simply expand things like the file structure or addan optimization layer. I guess my feeling is that, back when resources were tight, the struc-tures people ended up with were artificially warped to make them work under those condi-tions, and weren't optimum structures for the problem at hand.



62

CONCLUSION Anyway, that's my arm-waving guess as to how we've been able to keep things simple.We started with something simple and let it evolve naturally, without trying to force it intosome traditional mold.

We're going to press on with this. I've given you a list of the areas we'll be covering infuture installments. With those installments, you should be able to build complete, work-ing compilers for just about any occasion, and build them simply. If you REALLY want tobuild production-quality compilers, you'll be able to do that, too.

For those of you who are chafing at the bit for more parser code, I apologize for thisdigression. I just thought you'd like to have things put into perspective a bit. Next time,we'll get back to the mainstream of the tutorial.

So far, we've only looked at pieces of compilers, and while we have many of the makingsof a complete language, we haven't talked about how to put it all together. That will be thesubject of our next two installments. Then we'll press on into the new subjects I listed atthe beginning of this installment.

See you then.


Part 9 - A Top View

Part 9 - A Top View

INTRODUCTION In the previous installments, we have learned many of the techniques required to build a full-blown compiler. We've done both assignment statements (with Boolean and arithmeticexpressions), relational operators, and control constructs. We still haven't addressed proce-dure or function calls, but even so we could conceivably construct a mini-language withoutthem. I've always thought it would be fun to see just how small a language one could buildthat would still be useful. We're ALMOST in a position to do that now. The problem is: thoughwe know how to parse and translate the constructs, we still don't know quite how to put themall together into a language.

In those earlier installments, the development of our programs had a decidedly bottom-up fla-vor. In the case of expression parsing, for example, we began with the very lowest level con-structs, the individual constants and variables, and worked our way up to more complexexpressions.

Most people regard the top-down design approach as being better than the bottom-up one. Ido too, but the way we did it certainly seemed natural enough for the kinds of things we wereparsing.

You mustn't get the idea, though, that the incremental approach that we've been using in allthese tutorials is inherently bottom-up. In this installment I'd like to show you that theapproach can work just as well when applied from the top down ... maybe better. We'll con-sider languages such as C and Pascal, and see how complete compilers can be built startingfrom the top.

In the next installment, we'll apply the same technique to build a complete translator for asubset of the KISS language, which I'll be calling TINY. But one of my goals for this series isthat you will not only be able to see how a compiler for TINY or KISS works, but that you willalso be able to design and build compilers for your own languages. The C and Pascal exam-ples will help. One thing I'd like you to see is that the natural structure of the compilerdepends very much on the language being translated, so the simplicity and ease of construc-tion of the compiler depends very much on letting the language set the program structure.



62

It's a bit much to produce a full C or Pascal compiler here, and we won't try. But we canflesh out the top levels far enough so that you can see how it goes.

Let's get started.


Part 9 - A Top View

THE TOP LEVEL One of the biggest mistakes people make in a top-down design is failing to start at the truetop. They think they know what the overall structure of the design should be, so they goahead and write it down.

Whenever I start a new design, I always like to do it at the absolute beginning. In programdesign language (PDL), this top level looks something like:

begin

solve the problem

end

OK, I grant you that this doesn't give much of a hint as to what the next level is, but I like towrite it down anyway, just to give me that warm feeling that I am indeed starting at the top.

For our problem, the overall function of a compiler is to compile a complete program. Any def-inition of the language, written in BNF, begins here. What does the top level BNF look like?Well, that depends quite a bit on the language to be translated. Let's take a look at Pascal.



62

THE STRUCTURE OF PASCAL Most texts for Pascal include a BNF or "railroad-track" definition of the language. Hereare the first few lines of one:

<program> ::= <program-header> <block> '.'

<program-header> ::= PROGRAM <ident>

<block> ::= <declarations> <statements>

We can write recognizers to deal with each of these elements, just as we've done before.For each one, we'll use our familiar single-character tokens to represent the input, thenflesh things out a little at a time. Let's begin with the first recognizer: the program itself.

To translate this, we'll start with a fresh copy of the Cradle. Since we're back to single-character names, we'll just use a 'p' to stand for 'PROGRAM.'


Part 9 - A Top View

To a fresh copy of the cradle, add the following code, and insert a call to it from the main pro-gram:

{--------------------------------------------------------------}

{ Parse and Translate A Program }

procedure Prog;

var Name: char;

begin

Match('p'); { Handles program header part }

Name := GetName;

Prolog(Name);

Match('.');

Epilog(Name);

end;

{--------------------------------------------------------------}

The procedures Prolog and Epilog perform whatever is required to let the program interfacewith the operating system, so that it can execute as a program. Needless to say, this part willbe VERY OS-dependent. Remember, I've been emitting code for a 68000 running under theOS I use, which is SK*DOS. I realize most of you are using PC's and would rather see some-thing else, but I'm in this thing too deep to change now!



62

Anyhow, SK*DOS is a particularly easy OS to interface to. Here is the code for Prolog andEpilog:

{--------------------------------------------------------------}

{ Write the Prolog }

procedure Prolog;

begin

EmitLn('WARMST EQU $A01E');

end;

{--------------------------------------------------------------}

{ Write the Epilog }

procedure Epilog(Name: char);

begin

EmitLn('DC WARMST');

EmitLn('END ' + Name);

end;

{--------------------------------------------------------------}

As usual, add this code and try out the "compiler." At this point, there is only one legalinput:

px. (where x is any single letter, the program name)

Well, as usual our first effort is rather unimpressive, but by now I'm sure you know thatthings will get more interesting. There is one important thing to note: THE OUTPUT IS AWORKING, COMPLETE, AND EXECUTABLE PROGRAM (at least after it's assembled).


Part 9 - A Top View

This is very important. The nice feature of the top-down approach is that at any stage you cancompile a subset of the complete language and get a program that will run on the targetmachine. From here on, then, we need only add features by fleshing out the language con-structs. It's all very similar to what we've been doing all along, except that we're approachingit from the other end.



62

FLESHING IT OUT To flesh out the compiler, we only have to deal with language features one by one. I like tostart with a stub procedure that does nothing, then add detail in incremental fashion. Let'sbegin by processing a block, in accordance with its PDL above. We can do this in twostages. First, add the null procedure:

{--------------------------------------------------------------}

{ Parse and Translate a Pascal Block }

procedure DoBlock(Name: char);

begin

end;

{--------------------------------------------------------------}

and modify Prog to read:

{--------------------------------------------------------------}


procedure Prog;

var Name: char;

begin Match('p'); Name := GetName; Prolog; DoBlock(Name); Match('.'); Epilog(Name);end;

{--------------------------------------------------------------}


Part 9 - A Top View

That certainly shouldn't change the behavior of the program, and it doesn't. But now the defi-nition of Prog is complete, and we can proceed to flesh out DoBlock. That's done right fromits BNF definition:

{--------------------------------------------------------------}

{ Parse and Translate a Pascal Block }

procedure DoBlock(Name: char);

begin

Declarations;

PostLabel(Name);

Statements;

end;

{--------------------------------------------------------------}

The procedure PostLabel was defined in the installment on branches. Copy it into your cra-dle.

I probably need to explain the reason for inserting the label where I have. It has to do with theoperation of SK*DOS. Unlike some OS's, SK*DOS allows the entry point to the main programto be anywhere in the program. All you have to do is to give that point a name. The call toPostLabel puts that name just before the first executable statement in the main program. Howdoes SK*DOS know which of the many labels is the entry point, you ask? It's the one thatmatches the END statement at the end of the program.

OK, now we need stubs for the procedures Declarations and Statements. Make them nullprocedures as we did before.

Does the program still run the same? Then we can move on to the next stage.



63

DECLARATIONS The BNF for Pascal declarations is:

<declarations> ::= ( <label list> |

<constant list> |

<type list> |

<variable list> |

<procedure> |

<function> )*

(Note that I'm using the more liberal definition used by Turbo Pascal. In the standard Pas-cal definition, each of these parts must be in a specific order relative to the rest.)


Part 9 - A Top View

As usual, let's let a single character represent each of these declaration types. The new formof Declarations is:

{--------------------------------------------------------------}

{ Parse and Translate the Declaration Part }

procedure Declarations;

begin

while Look in ['l', 'c', 't', 'v', 'p', 'f'] do

case Look of

'l': Labels;

'c': Constants;

't': Types;

'v': Variables;

'p': DoProcedure;

'f': DoFunction;

end;

end;

{--------------------------------------------------------------}

Of course, we need stub procedures for each of these declaration types. This time, they can'tquite be null procedures, since otherwise we'll end up with an infinite While loop. At the veryleast, each recognizer must eat the character that invokes it.



63

Insert the following procedures:

{--------------------------------------------------------------}

{ Process Label Statement }

procedure Labels;

begin

Match('l');

end;

{--------------------------------------------------------------}

{ Process Const Statement }

procedure Constants;

begin

Match('c');

end;

{--------------------------------------------------------------}

{ Process Type Statement }

procedure Types;

begin

Match('t');

end;


Part 9 - A Top View

{--------------------------------------------------------------}

{ Process Var Statement }

procedure Variables;

begin

Match('v');

end;

{--------------------------------------------------------------}

{ Process Procedure Definition }

procedure DoProcedure;

begin

Match('p');

end;

{--------------------------------------------------------------}

{ Process Function Definition }

procedure DoFunction;

begin

Match('f');

end;

{--------------------------------------------------------------}



63

Now try out the compiler with a few representative inputs. You can mix the declarationsany way you like, as long as the last character in the program is'.' to indicate the end ofthe program. Of course, none of the declarations actually declare anything, so you don'tneed (and can't use) any characters other than those standing for the keywords.

We can flesh out the statement part in a similar way. The BNF for it is:

<statements> ::= <compound statement>

<compound statement> ::= BEGIN <statement>

(';' <statement>) END

Note that statements can begin with any identifier except END.

So the first stub form of procedure Statements is:

{--------------------------------------------------------------}

{ Parse and Translate the Statement Part }

procedure Statements;

begin

Match('b');

while Look <> 'e' do

GetChar;

Match('e');

end;

{--------------------------------------------------------------}


Part 9 - A Top View

At this point the compiler will accept any number of declarations, followed by the BEGINblock of the main program. This block itself can contain any characters at all (except anEND), but it must be present.

The simplest form of input is now

'pxbe.'

Try it. Also try some combinations of this. Make some deliberate errors and see what hap-pens.

At this point you should be beginning to see the drill. We begin with a stub translator to pro-cess a program, then we flesh out each procedure in turn, based upon its BNF definition. Justas the lower-level BNF definitions add detail and elaborate upon the higher-level ones, thelower-level recognizers will parse more detail of the input program. When the last stub hasbeen expanded, the compiler will be complete. That's top-down design/implementation in itspurest form.

You might note that even though we've been adding procedures, the output of the programhasn't changed. That's as it should be. At these top levels there is no emitted code required.The recognizers are functioning as just that: recognizers. They are accepting input sen-tences, catching bad ones, and channeling good input to the right places, so they are doingtheir job. If we were to pursue this a bit longer, code would start to appear.



63

The next step in our expansion should probably be procedure Statements. The Pascaldefinition is:

<statement> ::= <simple statement> | <structured statement>

<simple statement> ::= <assignment> | <procedure call> | null

<structured statement> ::= <compound statement> |

<if statement> |

<case statement> |

<while statement> |

<repeat statement> |

<for statement> |

<with statement>

These are starting to look familiar. As a matter of fact, you have already gone through theprocess of parsing and generating code for both assignment statements and controlstructures. This is where the top level meets our bottom-up approach of previous ses-sions. The constructs will be a little different from those we've been using for KISS, butthe differences are nothing you can't handle.

I think you can get the picture now as to the procedure. We begin with a complete BNFdescription of the language. Starting at the top level, we code up the recognizer for thatBNF statement, using stubs for the next-level recognizers. Then we flesh those lower-level statements out one by one.

As it happens, the definition of Pascal is very compatible with the use of BNF, and BNFdescriptions of the language abound. Armed with such a description, you will find it fairlystraightforward to continue the process we've begun.


Part 9 - A Top View

You might have a go at fleshing a few of these constructs out, just to get a feel for it. I don'texpect you to be able to complete a Pascal compiler here ... there are too many things suchas procedures and types that we haven't addressed yet ... but it might be helpful to try someof the more familiar ones. It will do you good to see executable programs coming out theother end.

If I'm going to address those issues that we haven't covered yet, I'd rather do it in the contextof KISS. We're not trying to build a complete Pascal compiler just yet, so I'm going to stop theexpansion of Pascal here. Let's take a look at a very different language.



63

THE STRUCTURE OF C The C language is quite another matter, as you'll see. Texts on C rarely include a BNFdefinition of the language. Probably that's because the language is quite hard to writeBNF for.

One reason I'm showing you these structures now is so that I can impress upon you thesetwo facts:

(1) The definition of the language drives the structure of the

compiler. What works for one language may be a disaster for

another. It's a very bad idea to try to force a given

structure upon the compiler. Rather, you should let the BNF

drive the structure, as we have done here.

(2) A language that is hard to write BNF for will probably be

hard to write a compiler for, as well. C is a popular

language, and it has a reputation for letting you do

virtually anything that is possible to do. Despite the

success of Small C, C is _NOT_ an easy language to parse.


Part 9 - A Top View

A C program has less structure than its Pascal counterpart. At the top level, everything in C isa static declaration, either of data or of a function. We can capture this thought like this:

<program> ::= ( <global declaration> )*

<global declaration> ::= <data declaration> |

<function>

In Small C, functions can only have the default type int, which is not declared. This makes theinput easy to parse: the first token is either "int," "char," or the name of a function. In Small C,the preprocessor commands are also processed by the compiler proper, so the syntaxbecomes:

<global declaration> ::= '#' <preprocessor command> |

'int' <data list> |

'char' <data list> |

<ident> <function body> |

Although we're really more interested in full C here, I'll show you the code corresponding tothis top-level structure for Small C.



64

{--------------------------------------------------------------}


procedure Prog;

begin

while Look <> ^Z do begin

case Look of

'#': PreProc;

'i': IntDecl;

'c': CharDecl;

else DoFunction(Int);

end;

end;

end;

{--------------------------------------------------------------}

Note that I've had to use a ^Z to indicate the end of the source. C has no keyword such asEND or the '.' to otherwise indicate the end.

With full C, things aren't even this easy. The problem comes about because in full C, func-tions can also have types. So when the compiler sees a keyword like "int," it still doesn'tknow whether to expect a data declaration or a function definition. Things get more com-plicated since the next token may not be a name ... it may start with an '*' or '(', or combi-nations of the two.


Part 9 - A Top View

More specifically, the BNF for full C begins with:

<program> ::= ( <top-level decl> )*

<top-level decl> ::= <function def> | <data decl>

<data decl> ::= [<class>] <type> <decl-list>

<function def> ::= [<class>] [<type>] <function decl>

You can now see the problem: The first two parts of the declarations for data and functionscan be the same. Because of the ambiguity in the grammar as written above, it's not a suit-able grammar for a recursive-descent parser. Can we transform it into one that is suitable?Yes, with a little work. Suppose we write it this way:

<top-level decl> ::= [<class>] <decl>

<decl> ::= <type> <typed decl> | <function decl>

<typed decl> ::= <data list> | <function decl>

We can build a parsing routine for the class and type definitions, and have them store awaytheir findings and go on, without their ever having to "know" whether a function or a data dec-laration is being processed. To begin, key in the following version of the main program:

{--------------------------------------------------------------}

{ Main Program }begin Init; while Look <> ^Z do begin GetClass; GetType; TopDecl; end;end.

{--------------------------------------------------------------}



64

For the first round, just make the three procedures stubs that do nothing _BUT_ call Get-Char.

Does this program work? Well, it would be hard put NOT to, since we're not really askingit to do anything. It's been said that a C compiler will accept virtually any input withoutchoking. It's certainly true of THIS compiler, since in effect all it does is to eat input char-acters until it finds a ^Z.

Next, let's make GetClass do something worthwhile. Declare the global variable

var Class: char;

and change GetClass to do the following:

{--------------------------------------------------------------}

{ Get a Storage Class Specifier }

Procedure GetClass;

begin

if Look in ['a', 'x', 's'] then begin

Class := Look;

GetChar;

end

else Class := 'a';

end;

{--------------------------------------------------------------}

Here, I've used three single characters to represent the three storage classes "auto,""extern," and "static." These are not the only three possible classes ... there are also "reg-ister" and "typedef," but this should give you the picture. Note that the default class is"auto."


Part 9 - A Top View

We can do a similar thing for types. Enter the following procedure next:

{--------------------------------------------------------------}

{ Get a Type Specifier }

procedure GetType;

begin

Typ := ' ';

if Look = 'u' then begin

Sign := 'u';

Typ := 'i';

GetChar;

end

else Sign := 's';

if Look in ['i', 'l', 'c'] then begin

Typ := Look;

GetChar;

end;

end;

{--------------------------------------------------------------}

Note that you must add two more global variables, Sign and Typ.

With these two procedures in place, the compiler will process the class and type definitionsand store away their findings. We can now process the rest of the declaration.



64

We are by no means out of the woods yet, because there are still many complexities justin the definition of the type, before we even get to the actual data or function names. Let'spretend for the moment that we have passed all those gates, and that the next thing in theinput stream is a name. If the name is followed by a left paren, we have a function decla-ration. If not, we have at least one data item, and possibly a list, each element of whichcan have an initializer.

Insert the following version of TopDecl:

{--------------------------------------------------------------}

{ Process a Top-Level Declaration }

procedure TopDecl;

var Name: char;

begin

Name := Getname;

if Look = '(' then

DoFunc(Name)

else

DoData(Name);

end;

{--------------------------------------------------------------}

(Note that, since we have already read the name, we must pass it along to the appropri-ate routine.)


Part 9 - A Top View

Finally, add the two procedures DoFunc and DoData:

{--------------------------------------------------------------}

{ Process a Function Definition }

procedure DoFunc(n: char);

begin

Match('(');

Match(')');

Match('{');

Match('}');

if Typ = ' ' then Typ := 'i';

Writeln(Class, Sign, Typ, ' function ', n);

end;



64

{--------------------------------------------------------------}

{ Process a Data Declaration }

procedure DoData(n: char);

begin

if Typ = ' ' then Expected('Type declaration');

Writeln(Class, Sign, Typ, ' data ', n);

while Look = ',' do begin

Match(',');

n := GetName;

WriteLn(Class, Sign, Typ, ' data ', n);

end;

Match(';');

end;

{--------------------------------------------------------------}

Since we're still a long way from producing executable code, I decided to just have thesetwo routines tell us what they found.

OK, give this program a try. For data declarations, it's OK to give a list separated by com-mas. We can't process initializers as yet. We also can't process argument lists for thefunctions, but the "(){}" characters should be there.

We're still a _VERY_ long way from having a C compiler, but what we have is starting toprocess the right kinds of inputs, and is recognizing both good and bad inputs. In the pro-cess, the natural structure of the compiler is starting to take form.


Part 9 - A Top View

Can we continue this until we have something that acts more like a compiler. Of course wecan. Should we? That's another matter. I don't know about you, but I'm beginning to get dizzy,and we've still got a long way to go to even get past the data declarations.

At this point, I think you can see how the structure of the compiler evolves from the languagedefinition. The structures we've seen for our two examples, Pascal and C, are as different asnight and day. Pascal was designed at least partly to be easy to parse, and that's reflected inthe compiler. In general, in Pascal there is more structure and we have a better idea of whatkinds of constructs to expect at any point. In C, on the other hand, the program is essentiallya list of declarations, terminated only by the end of file.

We could pursue both of these structures much farther, but remember that our purpose hereis not to build a Pascal or a C compiler, but rather to study compilers in general. For those ofyou who DO want to deal with Pascal or C, I hope I've given you enough of a start so that youcan take it from here (although you'll soon need some of the stuff we still haven't covered yet,such as typing and procedure calls). For the rest of you, stay with me through the next install-ment. There, I'll be leading you through the development of a complete compiler for TINY, asubset of KISS.

See you then.



64

Part 10 - Introducing “Tiny”

INTRODUCTION In the last installment, I showed you the general idea for the top-down development of acompiler. I gave you the first few steps of the process for compilers for Pascal and C, butI stopped far short of pushing it through to completion. The reason was simple: if we'regoing to produce a real, functional compiler for any language, I'd rather do it for KISS, thelanguage that I've been defining in this tutorial series.

In this installment, we're going to do just that, for a subset of KISS which I've chosen tocall TINY.

The process will be essentially that outlined in Installment IX, except for one notable dif-ference. In that installment, I suggested that you begin with a full BNF description of thelanguage. That's fine for something like Pascal or C, for which the language definition isfirm. In the case of TINY, however, we don't yet have a full description ... we seem to bedefining the language as we go. That's OK. In fact, it's preferable, since we can tailor thelanguage slightly as we go, to keep the parsing easy.

So in the development that follows, we'll actually be doing a top-down development ofBOTH the language and its compiler. The BNF description will grow along with the com-piler.

In this process, there will be a number of decisions to be made, each of which will influ-ence the BNF and therefore the nature of the language. At each decision point I'll try toremember to explain the decision and the rationale behind my choice. That way, if youhappen to hold a different opinion and would prefer a different option, you can choose itinstead. You now have the background to do that. I guess the important thing to note isthat nothing we do here is cast in concrete. When YOU'RE designing YOUR language,you should feel free to do it YOUR way.



Many of you may be asking at this point: Why bother starting over from scratch? We had aworking subset of KISS as the outcome of Installment VII (lexical scanning). Why not justextend it as needed? The answer is threefold. First of all, I have been making a number ofchanges to further simplify the program ... changes like encapsulating the code generationprocedures, so that we can convert to a different target machine more easily. Second, I wantyou to see how the development can indeed be done from the top down as outlined in the lastinstallment. Finally, we both need the practice. Each time I go through this exercise, I get a lit-tle better at it, and you will, also.



65

GETTING STARTED Many years ago there were languages called Tiny BASIC, Tiny Pascal, and Tiny C, eachof which was a subset of its parent full language. Tiny BASIC, for example, had only sin-gle-character variable names and global variables. It supported only a single data type.Sound familiar? At this point we have almost all the tools we need to build a compiler likethat.

Yet a language called Tiny-anything still carries some baggage inherited from its parentlanguage. I've often wondered if this is a good idea. Granted, a language based uponsome parent language will have the advantage of familiarity, but there may also be somepeculiar syntax carried over from the parent that may tend to add unnecessary complexityto the compiler. (Nowhere is this more true than in Small C.)

I've wondered just how small and simple a compiler could be made and still be useful, if itwere designed from the outset to be both easy to use and to parse. Let's find out. Thislanguage will just be called "TINY," period. It's a subset of KISS, which I also haven't fullydefined, so that at least makes us consistent (!). I suppose you could call it TINY KISS.But that opens up a whole can of worms involving cuter and cuter (and perhaps more ris-que) names, so let's just stick with TINY.

The main limitations of TINY will be because of the things we haven't yet covered, suchas data types. Like its cousins Tiny C and Tiny BASIC, TINY will have only one data type,the 16-bit integer. The first version we develop will also have no procedure calls and willuse single-character variable names, although as you will see we can remove theserestrictions without much effort.

The language I have in mind will share some of the good features of Pascal, C, and Ada.Taking a lesson from the comparison of the Pascal and C compilers in the previousinstallment, though, TINY will have a decided Pascal flavor. Wherever feasible, a lan-guage structure will be bracketed by keywords or symbols, so that the parser will knowwhere it's going without having to guess.

One other ground rule: As we go, I'd like to keep the compiler producing real, executablecode. Even though it may not DO much at the beginning, it will at least do it correctly.



Finally, I'll use a couple of Pascal restrictions that make sense: All data and procedures mustbe declared before they are used. That makes good sense, even though for now the onlydata type we'll use is a word. This rule in turn means that the only reasonable place to put theexecutable code for the main program is at the end of the listing.

The top-level definition will be similar to Pascal:

<program> ::= PROGRAM <top-level decl> <main> '.'

Already, we've reached a decision point. My first thought was to make the main blockoptional. It doesn't seem to make sense to write a "program" with no main program, but itdoes make sense if we're allowing for multiple modules, linked together. As a matter of fact, Iintend to allow for this in KISS. But then we begin to open up a can of worms that I'd ratherleave closed for now. For example, the term "PROGRAM" really becomes a misnomer. TheMODULE of Modula-2 or the Unit of Turbo Pascal would be more appropriate. Second, whatabout scope rules? We'd need a convention for dealing with name visibility across modules.Better for now to just keep it simple and ignore the idea altogether.

There's also a decision in choosing to require the main program to be last. I toyed with theidea of making its position optional, as in C. The nature of SK*DOS, the OS I'm compiling for,make this very easy to do. But this doesn't really make much sense in view of the Pascal-likerequirement that all data and procedures be declared before they're referenced. Since themain program can only call procedures that have already been declared, the only positionthat makes sense is at the end, a la Pascal.



65

Given the BNF above, let's write a parser that just recognizes the brackets:

{--------------------------------------------------------------}


procedure Prog;

begin

Match('p');

Header;

Prolog;

Match('.');

Epilog;

end;

{--------------------------------------------------------------}

The procedure Header just emits the startup code required by the assembler:

{--------------------------------------------------------------}

{ Write Header Info }

procedure Header;

begin

WriteLn('WARMST', TAB, 'EQU $A01E');

end;

{--------------------------------------------------------------}



The procedures Prolog and Epilog emit the code for identifying the main program, and forreturning to the OS:

{--------------------------------------------------------------}


procedure Prolog;

begin

PostLabel('MAIN');

end;

{--------------------------------------------------------------}


procedure Epilog;

begin


EmitLn('END MAIN');

end;

{--------------------------------------------------------------}



65

The main program just calls Prog, and then looks for a clean ending:

{--------------------------------------------------------------}

{ Main Program }

begin

Init;

Prog;

if Look <> CR then Abort('Unexpected data after ''.''');

end.

{--------------------------------------------------------------}

At this point, TINY will accept only one input "program," the null program:

PROGRAM . (or 'p.' in our shorthand.)

Note, though, that the compiler DOES generate correct code for this program. It will run,and do what you'd expect the null program to do, that is, nothing but return gracefully tothe OS.

As a matter of interest, one of my favorite compiler benchmarks is to compile, link, andexecute the null program in whatever language is involved. You can learn a lot about theimplementation by measuring the overhead in time required to compile what should be atrivial case. It's also interesting to measure the amount of code produced. In many compil-ers, the code can be fairly large, because they always include the whole run- time librarywhether they need it or not. Early versions of Turbo Pascal produced a 12K object file forthis case. VAX C generates 50K!

The smallest null programs I've seen are those produced by Modula-2 compilers, andthey run about 200-800 bytes.



In the case of TINY, we HAVE no run-time library as yet, so the object code is indeed tiny: twobytes. That's got to be a record, and it's likely to remain one since it is the minimum sizerequired by the OS.

The next step is to process the code for the main program. I'll use the Pascal BEGIN-block:

<main> ::= BEGIN <block> END

Here, again, we have made a decision. We could have chosen to require a "PROCEDUREMAIN" sort of declaration, similar to C. I must admit that this is not a bad idea at all ... I don'tparticularly like the Pascal approach since I tend to have trouble locating the main program ina Pascal listing. But the alternative is a little awkward, too, since you have to deal with theerror condition where the user omits the main program or misspells its name. Here I'm takingthe easy way out.

Another solution to the "where is the main program" problem might be to require a name forthe program, and then bracket the main by

BEGIN <name>

END <name>

similar to the convention of Modula 2. This adds a bit of "syntactic sugar" to the language.Things like this are easy to add or change to your liking, if the language is your own design.



65

To parse this definition of a main block, change procedure Prog to read:

{--------------------------------------------------------------}


procedure Prog;

begin

Match('p');

Header;

Main;

Match('.');

end;

{--------------------------------------------------------------}



and add the new procedure:

{--------------------------------------------------------------}

{ Parse and Translate a Main Program }

procedure Main;

begin

Match('b');

Prolog;

Match('e');

Epilog;

end;

{--------------------------------------------------------------}

Now, the only legal program is:

PROGRAM BEGIN END . (or 'pbe.')

Aren't we making progress??? Well, as usual it gets better. You might try some deliberateerrors here, like omitting the 'b' or the 'e', and see what happens. As always, the compilershould flag all illegal inputs.



65

DECLARATIONS The obvious next step is to decide what we mean by a declaration. My intent here is tohave two kinds of declarations: variables and procedures/functions. At the top level, onlyglobal declarations are allowed, just as in C.

For now, there can only be variable declarations, identified by the keyword VAR (abbrevi-ated 'v'):

<top-level decls> ::= ( <data declaration> )*

<data declaration> ::= VAR <var-list>

Note that since there is only one variable type, there is no need to declare the type. Lateron, for full KISS, we can easily add a type description.

The procedure Prog becomes:

{--------------------------------------------------------------}


procedure Prog;

begin

Match('p');

Header;

TopDecls;

Main;

Match('.');

end;

{--------------------------------------------------------------}



Now, add the two new procedures:

{--------------------------------------------------------------}

{ Process a Data Declaration }

procedure Decl;

begin

Match('v');

GetChar;

end;{--------------------------------------------------------------}

{ Parse and Translate Global Declarations }procedure TopDecls;

begin

while Look <> 'b' do

case Look of

'v': Decl;

else Abort('Unrecognized Keyword ''' + Look + '''');

end;

end;{--------------------------------------------------------------}

Note that at this point, Decl is just a stub. It generates no code, and it doesn't process a list ...every variable must occur in a separate VAR statement.

OK, now we can have any number of data declarations, each starting with a 'v' for VAR,before the BEGIN-block. Try a few cases and see what happens.



66

DECLARATIONS AND SYMBOLS That looks pretty good, but we're still only generating the null program for output. A realcompiler would issue assembler directives to allocate storage for the variables. It's abouttime we actually produced some code.

With a little extra code, that's an easy thing to do from procedure Decl. Modify it as fol-lows:

{--------------------------------------------------------------}

{ Parse and Translate a Data Declaration }

procedure Decl;

var Name: char;

begin

Match('v');

Alloc(GetName);

end;

{--------------------------------------------------------------}

The procedure Alloc just issues a command to the assembler to allocate storage:

{--------------------------------------------------------------}

{ Allocate Storage for a Variable }procedure Alloc(N: char);begin WriteLn(N, ':', TAB, 'DC 0');end;

{--------------------------------------------------------------}



Give this one a whirl. Try an input that declares some variables, such as:

pvxvyvzbe.

See how the storage is allocated? Simple, huh? Note also that the entry point, "MAIN,"comes out in the right place.

For the record, a "real" compiler would also have a symbol table to record the variables beingused. Normally, the symbol table is necessary to record the type of each variable. But since inthis case all variables have the same type, we don't need a symbol table for that reason. As itturns out, we're going to find a symbol necessary even without different types, but let's post-pone that need until it arises.

Of course, we haven't really parsed the correct syntax for a data declaration, since it involvesa variable list. Our version only permits a single variable. That's easy to fix, too.

The BNF for <var-list> is

<var-list> ::= <ident> (, <ident>)*

Adding this syntax to Decl gives this new version:

{--------------------------------------------------------------}{ Parse and Translate a Data Declaration }procedure Decl;var Name: char;begin Match('v'); Alloc(GetName); while Look = ',' do begin GetChar; Alloc(GetName); end;end;

{--------------------------------------------------------------}

OK, now compile this code and give it a try. Try a number of lines of VAR declarations, try alist of several variables on one line, and try combinations of the two. Does it work?



66

INITIALIZERS As long as we're dealing with data declarations, one thing that's always bothered meabout Pascal is that it doesn't allow initializing data items in the declaration. That featureis admittedly sort of a frill, and it may be out of place in a language that purports to be aminimal language. But it's also SO easy to add that it seems a shame not to do so. TheBNF becomes:

<var-list> ::= <var> ( <var> )*

<var> ::= <ident> [ = <integer> ]

Change Alloc as follows:

{--------------------------------------------------------------}

{ Allocate Storage for a Variable }

procedure Alloc(N: char);

begin

Write(N, ':', TAB, 'DC ');

if Look = '=' then begin

Match('=');

WriteLn(GetNum);

end

else

WriteLn('0');

end;

{--------------------------------------------------------------}



There you are: an initializer with six added lines of Pascal.

OK, try this version of TINY and verify that you can, indeed, give the variables initial values.

By golly, this thing is starting to look real! Of course, it still doesn't DO anything, but it looksgood, doesn't it?

Before leaving this section, I should point out that we've used two versions of function Get-Num. One, the earlier one, returns a character value, a single digit. The other accepts a multi-digit integer and returns an integer value. Either one will work here, since WriteLn will handleeither type. But there's no reason to limit ourselves to single-digit values here, so the correctversion to use is the one that returns an integer. Here it is:

{--------------------------------------------------------------}

{ Get a Number }


var Val: integer;

begin

Val := 0;



Val := 10 * Val + Ord(Look) - Ord('0');

GetChar;

end;

GetNum := Val;

end;

{--------------------------------------------------------------}



66

As a matter of fact, strictly speaking we should allow for expressions in the data field ofthe initializer, or at the very least for negative values. For now, let's just allow for negativevalues by changing the code for Alloc as follows:

{--------------------------------------------------------------}



begin

if InTable(N) then Abort('Duplicate Variable Name ' + N);

ST[N] := 'v';



Match('=');

If Look = '-' then begin

Write(Look);

Match('-');

end;

WriteLn(GetNum);

end

else

WriteLn('0');end;{--------------------------------------------------------------}

Now you should be able to initialize variables with negative and/or multi-digit values.



THE SYMBOL TABLE There's one problem with the compiler as it stands so far: it doesn't do anything to record avariable when we declare it. So the compiler is perfectly content to allocate storage for sev-eral variables with the same name. You can easily verify this with an input like

pvavavabe.

Here we've declared the variable A three times. As you can see, the compiler will cheerfullyaccept that, and generate three identical labels. Not good.

Later on, when we start referencing variables, the compiler will also let us reference variablesthat don't exist. The assembler will catch both of these error conditions, but it doesn't seemfriendly at all to pass such errors along to the assembler. The compiler should catch suchthings at the source language level.

So even though we don't need a symbol table to record data types, we ought to install onejust to check for these two conditions. Since at this point we are still restricted to single-char-acter variable names, the symbol table can be trivial. To provide for it, first add the followingdeclaration at the beginning of your program:

var ST: array['A'..'Z'] of char;

and insert the following function:

{--------------------------------------------------------------}

{ Look for Symbol in Table }

function InTable(n: char): Boolean;

begin

InTable := ST[n] <> ' ';

end;

{--------------------------------------------------------------}



66

We also need to initialize the table to all blanks. The following lines in Init will do thejob:

var i: char;

begin


ST[i] := ' ';

...

Finally, insert the following two lines at the beginning of Alloc:


ST[N] := 'v';

That should do it. The compiler will now catch duplicate declarations. Later, we can alsouse InTable when generating references to the variables.



EXECUTABLE STATEMENTS At this point, we can generate a null program that has some data variables declared and pos-sibly initialized. But so far we haven't arranged to generate the first line of executable code.

Believe it or not, though, we almost have a usable language! What's missing is the execut-able code that must go into the main program. But that code is just assignment statementsand control statements ... all stuff we have done before. So it shouldn't take us long to providefor them, as well.

The BNF definition given earlier for the main program included a statement block, which wehave so far ignored:

<main> ::= BEGIN <block> END

For now, we can just consider a block to be a series of assignment statements:

<block> ::= (Assignment)*



66

Let's start things off by adding a parser for the block. We'll begin with a stub for theassignment statement:

{--------------------------------------------------------------}



begin

GetChar;

end;

{--------------------------------------------------------------}

{ Parse and Translate a Block of Statements }

procedure Block;

begin

while Look <> 'e' do

Assignment;

end;

{--------------------------------------------------------------}



Modify procedure Main to call Block as shown below:

{--------------------------------------------------------------}


procedure Main;

begin

Match('b');

Prolog;

Block;

Match('e');

Epilog;

end;

{--------------------------------------------------------------}

This version still won't generate any code for the "assignment statements" ... all it does is toeat characters until it sees the 'e' for 'END.' But it sets the stage for what is to follow.

The next step, of course, is to flesh out the code for an assignment statement. This is some-thing we've done many times before, so I won't belabor it. This time, though, I'd like to dealwith the code generation a little differently. Up till now, we've always just inserted the Emitsthat generate output code in line with the parsing routines. A little unstructured, perhaps, butit seemed the most straightforward approach, and made it easy to see what kind of codewould be emitted for each construct.

However, I realize that most of you are using an 80x86 computer, so the 68000 code gener-ated is of little use to you. Several of you have asked me if the CPU-dependent code couldn'tbe collected into one spot where it would be easier to retarget to another CPU. The answer,of course, is yes.



67

To accomplish this, insert the following "code generation" routines:

{---------------------------------------------------------------}

{ Clear the Primary Register }

procedure Clear;

begin

EmitLn('CLR D0');

end;

{---------------------------------------------------------------}

{ Negate the Primary Register }

procedure Negate;

begin

EmitLn('NEG D0');

end;

{---------------------------------------------------------------}

{ Load a Constant Value to Primary Register }

procedure LoadConst(n: integer);

begin

Emit('MOVE #');

WriteLn(n, ',D0');

end;



{---------------------------------------------------------------}

{ Load a Variable to Primary Register }

procedure LoadVar(Name: char);

begin

if not InTable(Name) then Undefined(Name);


end;

{---------------------------------------------------------------}

{ Push Primary onto Stack }

procedure Push;

begin


end;

{---------------------------------------------------------------}

{ Add Top of Stack to Primary }

procedure PopAdd;

begin


end;



67

{---------------------------------------------------------------}

{ Subtract Primary from Top of Stack }

procedure PopSub;

begin


EmitLn('NEG D0');

end;

{---------------------------------------------------------------}

{ Multiply Top of Stack by Primary }

procedure PopMul;

begin


end;



{---------------------------------------------------------------}

{ Divide Top of Stack by Primary }

procedure PopDiv;

begin


EmitLn('EXT.L D7');



end;

{---------------------------------------------------------------}

{ Store Primary to Variable }

procedure Store(Name: char);

begin




end;

{---------------------------------------------------------------}

The nice part of this approach, of course, is that we can retarget the compiler to a new CPUsimply by rewriting these "code generator" procedures. In addition, we will find later that wecan improve the code quality by tweaking these routines a bit, without having to modify thecompiler proper.



67

Note that both LoadVar and Store check the symbol table to make sure that the variable isdefined. The error handler Undefined simply calls Abort:

{--------------------------------------------------------------}

{ Report an Undefined Identifier }

procedure Undefined(n: string);

begin

Abort('Undefined Identifier ' + n);

end;

{--------------------------------------------------------------}

OK, we are now finally ready to begin processing executable code. We'll do that byreplacing the stub version of procedure Assignment.

We've been down this road many times before, so this should all be familiar to you. Infact, except for the changes associated with the code generation, we could just copy theprocedures from Part VII. Since we are making some changes, I won't just copy them, butwe will go a little faster than usual.

The BNF for the assignment statement is:

<assignment> ::= <ident> = <expression>

<expression> ::= <first term> ( <addop> <term> )*

<first term> ::= <first factor> <rest>

<term> ::= <factor> <rest>

<rest> ::= ( <mulop> <factor> )*

<first factor> ::= [ <addop> ] <factor>

<factor> ::= <var> | <number> | ( <expression> )



This version of the BNF is also a bit different than we've used before ... yet another "variationon the theme of an expression." This particular version has what I consider to be the besttreatment of the unary minus. As you'll see later, it lets us handle negative constant valuesefficiently. It's worth mentioning here that we have often seen the advantages of "tweaking"the BNF as we go, to help make the language easy to parse. What you're looking at here is abit different: we've tweaked the BNF to make the CODE GENERATION more efficient! That'sa first for this series.

Anyhow, the following code implements the BNF:

{---------------------------------------------------------------}



procedure Factor;

begin


Match('(');

Expression;

Match(')');

end


LoadVar(GetName)

else

LoadConst(GetNum);

end;



67

{--------------------------------------------------------------}

{ Parse and Translate a Negative Factor }

procedure NegFactor;

begin

Match('-');


LoadConst(-GetNum)

else begin

Factor;

Negate;

end;

end;



{--------------------------------------------------------------}

{ Parse and Translate a Leading Factor }

procedure FirstFactor;

begin

case Look of

'+': begin

Match('+');

Factor;

end;

'-': NegFactor;

else Factor;

end;

end;



67

{--------------------------------------------------------------}


procedure Multiply;

begin

Match('*');

Factor;

PopMul;

end;

{-------------------------------------------------------------}


procedure Divide;

begin

Match('/');

Factor;

PopDiv;

end;



{---------------------------------------------------------------}

{ Common Code Used by Term and FirstTerm }

procedure Term1;

begin


Push;

case Look of

'*': Multiply;

'/': Divide;

end;

end;

end;

{---------------------------------------------------------------}


procedure Term;

begin

Factor;

Term1;

end;



68

{---------------------------------------------------------------}

{ Parse and Translate a Leading Term }


begin

FirstFactor;

Term1;

end;

{--------------------------------------------------------------}


procedure Add;

begin

Match('+');

Term;

PopAdd;

end;



{-------------------------------------------------------------}


procedure Subtract;

begin

Match('-');

Term;

PopSub;

end;

{---------------------------------------------------------------}



begin

FirstTerm;


Push;

case Look of

'+': Add;

'-': Subtract;

end;

end;

end;



68

{--------------------------------------------------------------}



var Name: char;

begin

Name := GetName;

Match('=');

Expression;

Store(Name);

end;

{--------------------------------------------------------------}

OK, if you've got all this code inserted, then compile it and check it out. You should beseeing reasonable-looking code, representing a complete program that will assemble andexecute. We have a compiler!



BOOLEANS The next step should also be familiar to you. We must add Boolean expressions and rela-tional operations. Again, since we've already dealt with them more than once, I won't elabo-rate much on them, except where they are different from what we've done before. Again, wewon't just copy from other files because I've changed a few things just a bit. Most of thechanges just involve encapsulating the machine-dependent parts as we did for the arithmeticoperations. I've also modified procedure NotFactor somewhat, to parallel the structure ofFirstFactor. Finally, I corrected an error in the object code for the relational operators: TheScc instruction I used only sets the low 8 bits of D0. We want all 16 bits set for a logical true,so I've added an instruction to sign-extend the low byte.

To begin, we're going to need some more recognizers:

{--------------------------------------------------------------}


function IsOrop(c: char): boolean;

begin

IsOrop := c in ['|', '~'];

end;

{--------------------------------------------------------------}


function IsRelop(c: char): boolean;

begin

IsRelop := c in ['=', '#', '<', '>'];

end;

{--------------------------------------------------------------}



68

Also, we're going to need some more code generation routines:

{---------------------------------------------------------------}

{ Complement the Primary Register }

procedure NotIt;

begin

EmitLn('NOT D0');

end;

{---------------------------------------------------------------}

{---------------------------------------------------------------}

{ AND Top of Stack with Primary }

procedure PopAnd;

begin


end;

{---------------------------------------------------------------}

{ OR Top of Stack with Primary }

procedure PopOr;

begin


end;



{---------------------------------------------------------------}

{ XOR Top of Stack with Primary }

procedure PopXor;

begin


end;

{---------------------------------------------------------------}

{ Compare Top of Stack with Primary }

procedure PopCompare;

begin


end;

{---------------------------------------------------------------}

{ Set D0 If Compare was = }

procedure SetEqual;

begin

EmitLn('SEQ D0');

EmitLn('EXT D0');

end;



68

{---------------------------------------------------------------}

{ Set D0 If Compare was != }

procedure SetNEqual;

begin

EmitLn('SNE D0');

EmitLn('EXT D0');

end;

{---------------------------------------------------------------}

{ Set D0 If Compare was > }

procedure SetGreater;

begin

EmitLn('SLT D0');

EmitLn('EXT D0');

end;

{---------------------------------------------------------------}

{ Set D0 If Compare was < }

procedure SetLess;

begin

EmitLn('SGT D0');

EmitLn('EXT D0');

end;



All of this gives us the tools we need. The BNF for the Boolean expressions is:

<bool-expr> ::= <bool-term> ( <orop> <bool-term> )*

<bool-term> ::= <not-factor> ( <andop> <not-factor> )*

<not-factor> ::= [ '!' ] <relation>

<relation> ::= <expression> [ <relop> <expression> ]

Sharp-eyed readers might note that this syntax does not include the non-terminal "bool-fac-tor" used in earlier versions. It was needed then because I also allowed for the Boolean con-stants TRUE and FALSE. But remember that in TINY there is no distinction made betweenBoolean and arithmetic types ... they can be freely intermixed. So there is really no need forthese predefined values ... we can just use -1 and 0, respectively.

In C terminology, we could always use the defines:

#define TRUE -1

#define FALSE 0

(That is, if TINY had a preprocessor.) Later on, when we allow for declarations of constants,these two values will be predefined by the language.

The reason that I'm harping on this is that I've already tried the alternative, which is to includeTRUE and FALSE as keywords. The problem with that approach is that it then requires lexi-cal scanning for EVERY variable name in every expression. If you'll recall, I pointed out inInstallment VII that this slows the compiler down considerably. As long as keywords can't bein expressions, we need to do the scanning only at the beginning of every new statement ...quite an improvement. So using the syntax above not only simplifies the parsing, but speedsup the scanning as well.



68

OK, given that we're all satisfied with the syntax above, the corresponding code is shownbelow:

{---------------------------------------------------------------}


procedure Equals;

begin

Match('=');

Expression;

PopCompare;

SetEqual;

end;

{---------------------------------------------------------------}


procedure NotEquals;

begin

Match('#');

Expression;

PopCompare;

SetNEqual;

end;



{---------------------------------------------------------------}


procedure Less;

begin

Match('<');

Expression;

PopCompare;

SetLess;

end;

{---------------------------------------------------------------}


procedure Greater;

begin

Match('>');

Expression;

PopCompare;

SetGreater;

end;



69

{---------------------------------------------------------------}


procedure Relation;

begin

Expression;


Push;

case Look of

'=': Equals;

'#': NotEquals;

'<': Less;

'>': Greater;

end;

end;

end;



{---------------------------------------------------------------}

{ Parse and Translate a Boolean Factor with Leading NOT }


begin


Match('!');

Relation;

NotIt;

end

else

Relation;

end;



69

{---------------------------------------------------------------}


procedure BoolTerm;

begin

NotFactor;


Push;

Match('&');

NotFactor;

PopAnd;

end;

end;

{--------------------------------------------------------------}


procedure BoolOr;

begin

Match('|');

BoolTerm;

PopOr;

end;



{--------------------------------------------------------------}


procedure BoolXor;

begin

Match('~');

BoolTerm;

PopXor;

end;

{---------------------------------------------------------------}



begin

BoolTerm;


Push;

case Look of

'|': BoolOr;

'~': BoolXor;

end;

end;

end;



69

To tie it all together, don't forget to change the references to Expression in proceduresFactor and Assignment so that they call BoolExpression instead. OK, if you've got all thattyped in, compile it and give it a whirl. First, make sure you can still parse an ordinaryarithmetic expression. Then, try a Boolean one. Finally, make sure that you can assignthe results of relations. Try, for example:

pvx,y,zbx=z>ye.

which stands for:

PROGRAM

VAR X,Y,Z

BEGIN

X = Z > Y

END.

See how this assigns a Boolean value to X?



CONTROL STRUCTURES We're almost home. With Boolean expressions in place, it's a simple matter to add controlstructures. For TINY, we'll only allow two kinds of them, the IF and the WHILE:

<if> ::= IF <bool-expression> <block> [ ELSE <block>] ENDIF

<while> ::= WHILE <bool-expression> <block> ENDWHILE

Once again, let me spell out the decisions implicit in this syntax, which departs strongly fromthat of C or Pascal. In both of those languages, the "body" of an IF or WHILE is regarded asa single statement. If you intend to use a block of more than one statement, you have to builda compound statement using BEGIN-END (in Pascal) or '{}' (in C). In TINY (and KISS) thereis no such thing as a compound statement ... single or multiple they're all just blocks to theselanguages.

In KISS, all the control structures will have explicit and unique keywords bracketing the state-ment block, so there can be no confusion as to where things begin and end. This is the mod-ern approach, used in such respected languages as Ada and Modula 2, and it completelyeliminates the problem of the "dangling else."

Note that I could have chosen to use the same keyword END to end all the constructs, as isdone in Pascal. (The closing '}' in C serves the same purpose.) But this has always led toconfusion, which is why Pascal programmers tend to write things like

end { loop }

or end { if }

As I explained in Part V, using unique terminal keywords does increase the size of the key-word list and therefore slows down the scanning, but in this case it seems a small price to payfor the added insurance. Better to find the errors at compile time rather than run time.

One last thought: The two constructs above each have the non- terminals

<bool-expression> and <block>



69

juxtaposed with no separating keyword. In Pascal we would expect the keywords THENand DO in these locations.

I have no problem with leaving out these keywords, and the parser has no trouble either,ON CONDITION that we make no errors in the bool-expression part. On the other hand, ifwe were to include these extra keywords we would get yet one more level of insurance atvery little cost, and I have no problem with that, either. Use your best judgment as towhich way to go.

OK, with that bit of explanation let's proceed. As usual, we're going to need some newcode generation routines. These generate the code for conditional and unconditionalbranches:

{---------------------------------------------------------------}

{ Branch Unconditional }

procedure Branch(L: string);

begin

EmitLn('BRA ' + L);

end;

{---------------------------------------------------------------}

{ Branch False }

procedure BranchFalse(L: string);

begin

EmitLn('TST D0');

EmitLn('BEQ ' + L);

end;

{--------------------------------------------------------------}



Except for the encapsulation of the code generation, the code to parse the control constructsis the same as you've seen before:

{---------------------------------------------------------------}



procedure DoIf;

var L1, L2: string;

begin

Match('i');

BoolExpression;

L1 := NewLabel;

L2 := L1;

BranchFalse(L1);

Block; if Look = 'l' then begin Match('l'); L2 := NewLabel; Branch(L2); PostLabel(L1); Block; end; PostLabel(L2); Match('e');end;



69

{--------------------------------------------------------------}


rocedure DoWhile;

var L1, L2: string;

begin

Match('w');

L1 := NewLabel;

L2 := NewLabel;

PostLabel(L1);

BoolExpression;

BranchFalse(L2);

Block;

Match('e');

Branch(L1);

PostLabel(L2);

end;

{--------------------------------------------------------------}



To tie everything together, we need only modify procedure Block to recognize the "keywords"for the IF and WHILE. As usual, we expand the definition of a block like so:

<block> ::= ( <statement> )*

where

<statement> ::= <if> | <while> | <assignment>

The corresponding code is:

{--------------------------------------------------------------}


procedure Block;

begin


case Look of

'i': DoIf;

'w': DoWhile;

else Assignment;

end;

end;

end;

{--------------------------------------------------------------}



70

OK, add the routines I've given, compile and test them. You should be able to parse thesingle-character versions of any of the control constructs. It's looking pretty good!

As a matter of fact, except for the single-character limitation we've got a virtually completeversion of TINY. I call it, with tongue planted firmly in cheek, TINY Version 0.1.



LEXICAL SCANNING Of course, you know what's next: We have to convert the program so that it can deal withmulti-character keywords, newlines, and whitespace. We have just gone through all that inPart VII. We'll use the distributed scanner technique that I showed you in that installment. Theactual implementation is a little different because the way I'm handling newlines is different.

To begin with, let's simply allow for whitespace. This involves only adding calls to SkipWhiteat the end of the three routines, GetName, GetNum, and Match. A call to SkipWhite in Initprimes the pump in case there are leading spaces.

Next, we need to deal with newlines. This is really a two-step process, since the treatment ofthe newlines with single- character tokens is different from that for multi-character ones. Wecan eliminate some work by doing both steps at once, but I feel safer taking things one stepat a time.

Insert the new procedure:

{--------------------------------------------------------------}

{ Skip Over an End-of-Line }

procedure NewLine;

begin

while Look = CR do begin

GetChar;


SkipWhite;

end;

end;

{--------------------------------------------------------------}



70

Note that we have seen this procedure before in the form of Procedure Fin. I've changedthe name since this new one seems more descriptive of the actual function. I've alsochanged the code to allow for multiple newlines and lines with nothing but white space.

The next step is to insert calls to NewLine wherever we decide a newline is permissible.As I've pointed out before, this can be very different in different languages. In TINY, I'vedecided to allow them virtually anywhere. This means that we need calls to NewLine atthe BEGINNING (not the end, as with SkipWhite) of the procedures GetName, GetNum,and Match.

For procedures that have while loops, such as TopDecl, we need a call to NewLine at thebeginning of the procedure AND at the bottom of each loop. That way, we can be assuredthat NewLine has just been called at the beginning of each pass through the loop.

If you've got all this done, try the program out and verify that it will indeed handle whitespace and newlines.

If it does, then we're ready to deal with multi-character tokens and keywords. To begin,add the additional declarations (copied almost verbatim from Part VII):

{--------------------------------------------------------------}




TabPtr = ^SymTab;



{--------------------------------------------------------------}



Token: char; { Encoded Token }

Value: string[16]; { Unencoded Token }

ST: Array['A'..'Z'] of char;

{--------------------------------------------------------------}


const NKW = 9;

NKW1 = 10;

const KWlist: array[1..NKW] of Symbol =

('IF', 'ELSE', 'ENDIF', 'WHILE', 'ENDWHILE',

'VAR', 'BEGIN', 'END', 'PROGRAM');

const KWcode: string[NKW1] = 'xilewevbep';

{--------------------------------------------------------------}



70

Next, add the three procedures, also from Part VII:

{--------------------------------------------------------------}

{ Table Lookup }


var i: integer;

found: Boolean;

begin

found := false;

i := n;


if s = T^[i] then

found := true

else

dec(i);

Lookup := i;

end;

{--------------------------------------------------------------}

.

.



{--------------------------------------------------------------}


procedure Scan;

begin

GetName;

Token := KWcode[Lookup(Addr(KWlist), Value, NKW) + 1];

end;

{--------------------------------------------------------------}

{--------------------------------------------------------------}



begin


end;

{--------------------------------------------------------------}



70

Now, we have to make a fairly large number of subtle changes to the remaining proce-dures. First, we must change the function GetName to a procedure, again as we did inPart VII:

{--------------------------------------------------------------}


procedure GetName;

begin

NewLine;


Value := '';



GetChar;

end;

SkipWhite;

end;

{--------------------------------------------------------------}

Note that this procedure leaves its result in the global string Value.



Next, we have to change every reference to GetName to reflect its new form. These occur inFactor, Assignment, and Decl:

{---------------------------------------------------------------}


procedure BoolExpression; Forward;

procedure Factor;

begin


Match('(');

BoolExpression;

Match(')');

end

else if IsAlpha(Look) then begin

GetName;

LoadVar(Value[1]);

end

else

LoadConst(GetNum);

end;

{--------------------------------------------------------------}

.



70

{--------------------------------------------------------------}



var Name: char;

begin

Name := Value[1];

Match('=');

BoolExpression;

Store(Name);

end;

{---------------------------------------------------------------}

{ Parse and Translate a Data Declaration }procedure Decl;begin

GetName; Alloc(Value[1]); while Look = ',' do begin

Match(',');

GetName;

Alloc(Value[1]);

end;

end;

{--------------------------------------------------------------}



(Note that we're still only allowing single-character variable names, so we take the easy wayout here and simply use the first character of the string.)

Finally, we must make the changes to use Token instead of Look as the test character and tocall Scan at the appropriate places. Mostly, this involves deleting calls to Match, occasionallyreplacing calls to Match by calls to MatchString, and Replacing calls to NewLine by calls toScan. Here are the affected routines:

{---------------------------------------------------------------}{ Recognize and Translate an IF Construct }procedure Block; Forward;procedure DoIf;var L1, L2: string;begin

BoolExpression; L1 := NewLabel; L2 := L1; BranchFalse(L1); Block; if Token = 'l' then begin L2 := NewLabel;

Branch(L2);

PostLabel(L1);

Block;

end;

PostLabel(L2);


end;



71

{--------------------------------------------------------------}


procedure DoWhile;

var L1, L2: string;

begin

L1 := NewLabel;

L2 := NewLabel;

PostLabel(L1);

BoolExpression;

BranchFalse(L2);

Block;

MatchString('ENDWHILE');

Branch(L1);

PostLabel(L2);

end;



{--------------------------------------------------------------}


procedure Block;

begin

Scan;

while not(Token in ['e', 'l']) do begin

case Token of

'i': DoIf;

'w': DoWhile;

else Assignment;

end;

Scan;

end;

end;



71

{--------------------------------------------------------------}

{ Parse and Translate Global Declarations }

procedure TopDecls;

begin

Scan;

while Token <> 'b' do begin

case Token of

'v': Decl;

else Abort('Unrecognized Keyword ' + Value);

end;

Scan;

end;

end;



{--------------------------------------------------------------}


procedure Main;

begin

MatchString('BEGIN');

Prolog;

Block;

MatchString('END');

Epilog;

end;

{--------------------------------------------------------------}


procedure Prog;

begin

MatchString('PROGRAM');

Header;

TopDecls;

Main;

Match('.');

end;



71

{--------------------------------------------------------------}

{ Initialize }

procedure Init;

var i: char;

begin


ST[i] := ' ';

GetChar;

Scan;

end;

{--------------------------------------------------------------}

That should do it. If all the changes got in correctly, you should now be parsing programsthat look like programs. (If you didn't make it through all the changes, don't despair. Acomplete listing of the final form is given later.)

Did it work? If so, then we're just about home. In fact, with a few minor exceptions we'vealready got a compiler that's usable. There are still a few areas that need improvement.



MULTI-CHARACTER VARIABLE NAMES One of those is the restriction that we still have, requiring single-character variable names.Now that we can handle multi- character keywords, this one begins to look very much like anarbitrary and unnecessary limitation. And indeed it is. Basically, its only virtue is that it permitsa trivially simple implementation of the symbol table. But that's just a convenience to the com-piler writers, and needs to be eliminated.

We've done this step before. This time, as usual, I'm doing it a little differently. I think theapproach used here keeps things just about as simple as possible.

The natural way to implement a symbol table in Pascal is by declaring a record type, andmaking the symbol table an array of such records. Here, though, we don't really need a typefield yet (there is only one kind of entry allowed so far), so we only need an array of symbols.This has the advantage that we can use the existing procedure Lookup to search the symboltable as well as the keyword list. As it turns out, even when we need more fields we can stilluse the same approach, simply by storing the other fields in separate arrays.

OK, here are the changes that need to be made. First, add the new typed constant:

NEntry: integer = 0;

Then change the definition of the symbol table as follows:

const MaxEntry = 100;

var ST : array[1..MaxEntry] of Symbol;

(Note that ST is _NOT_ declared as a SymTab. That declaration is a phony one to getLookup to work. A SymTab would take up too much RAM space, and so one is never actuallyallocated.)



71

Next, we need to replace InTable:

{--------------------------------------------------------------}


function InTable(n: Symbol): Boolean;

begin

InTable := Lookup(@ST, n, MaxEntry) <> 0;

end;

{--------------------------------------------------------------}

We also need a new procedure, AddEntry, that adds a new entry to the table:

{--------------------------------------------------------------}

{ Add a New Entry to Symbol Table }

procedure AddEntry(N: Symbol; T: char);

begin

if InTable(N) then Abort('Duplicate Identifier ' + N);

if NEntry = MaxEntry then Abort('Symbol Table Full');

Inc(NEntry);

ST[NEntry] := N;

SType[NEntry] := T;

end;

{--------------------------------------------------------------}



This procedure is called by Alloc:

{--------------------------------------------------------------}


procedure Alloc(N: Symbol);

begin


AddEntry(N, 'v');

{--------------------------------------------------------------}

Finally, we must change all the routines that currently treat the variable name as a singlecharacter. These include LoadVar and Store (just change the type from char to string), andFactor, Assignment, and Decl (just change Value[1] to Value).



71

One last thing: change procedure Init to clear the array as shown:

{--------------------------------------------------------------}

{ Initialize }

procedure Init;

var i: integer;

begin

for i := 1 to MaxEntry do begin

ST[i] := '';

SType[i] := ' ';

end;

GetChar;

Scan;

end;

{--------------------------------------------------------------}

That should do it. Try it out and verify that you can, indeed, use multi-character variablenames.



MORE RELOPS We still have one remaining single-character restriction: the one on relops. Some of therelops are indeed single characters, but others require two. These are '<=' and '>='. I alsoprefer the Pascal '<>' for "not equals," instead of '#'.

If you'll recall, in Part VII I pointed out that the conventional way to deal with relops is toinclude them in the list of keywords, and let the lexical scanner find them. But, again, thisrequires scanning throughout the expression parsing process, whereas so far we've beenable to limit the use of the scanner to the beginning of a statement.

I mentioned then that we can still get away with this, since the multi-character relops are sofew and so limited in their usage. It's easy to just treat them as special cases and handlethem in an ad hoc manner.

The changes required affect only the code generation routines and procedures Relation andfriends. First, we're going to need two more code generation routines:

{---------------------------------------------------------------}

{ Set D0 If Compare was <= }

procedure SetLessOrEqual;

begin

EmitLn('SGE D0');

EmitLn('EXT D0');

end;



72

{---------------------------------------------------------------}

{ Set D0 If Compare was >= }

procedure SetGreaterOrEqual;

begin

EmitLn('SLE D0');

EmitLn('EXT D0');

end;

{---------------------------------------------------------------}

Then, modify the relation parsing routines as shown below:

{---------------------------------------------------------------}

{ Recognize and Translate a Relational "Less Than or Equal" }

procedure LessOrEqual;

begin

Match('=');

Expression;

PopCompare;

SetLessOrEqual;

end;



{---------------------------------------------------------------}


procedure NotEqual;

begin

Match('>');

Expression;

PopCompare;

SetNEqual;

end;



72

{---------------------------------------------------------------}


procedure Less;

begin

Match('<');

case Look of

'=': LessOrEqual;

'>': NotEqual;

else begin

Expression;

PopCompare;

SetLess;

end;

end;

end;



{---------------------------------------------------------------}


procedure Greater;

begin

Match('>');


Match('=');

Expression;

PopCompare;

SetGreaterOrEqual;

end

else begin

Expression;

PopCompare;

SetGreater;

end;

end;

{---------------------------------------------------------------}

That's all it takes. Now you can process all the relops. Try it.



72

INPUT/OUTPUT We now have a complete, working language, except for one minor embarassment: wehave no way to get data in or out. We need some I/O.

Now, the convention these days, established in C and continued in Ada and Modula 2, isto leave I/O statements out of the language itself, and just include them in the subroutinelibrary. That would be fine, except that so far we have no provision for subroutines. Any-how, with this approach you run into the problem of variable-length argument lists. In Pas-cal, the I/O statements are built into the language because they are the only ones forwhich the argument list can have a variable number of entries. In C, we settle for kludgeslike scanf and printf, and must pass the argument count to the called procedure. In Adaand Modula 2 we must use the awkward (and SLOW!) approach of a separate call foreach argument.

So I think I prefer the Pascal approach of building the I/O in, even though we don't needto.



As usual, for this we need some more code generation routines. These turn out to be the eas-iest of all, because all we do is to call library procedures to do the work:

{---------------------------------------------------------------}

{ Read Variable to Primary Register }

procedure ReadVar;

begin

EmitLn('BSR READ');

Store(Value);

end;

{---------------------------------------------------------------}

{ Write Variable from Primary Register }

procedure WriteVar;

begin

EmitLn('BSR WRITE');

end;

{--------------------------------------------------------------}

The idea is that READ loads the value from input to the D0, and WRITE outputs it from there.

These two procedures represent our first encounter with a need for library procedures ... thecomponents of a Run Time Library (RTL). Of course, someone (namely us) has to write theseroutines, but they're not part of the compiler itself. I won't even bother showing the routineshere, since these are obviously very much OS-dependent. I _WILL_ simply say that forSK*DOS, they are particularly simple ... almost trivial. One reason I won't show them here isthat you can add all kinds of fanciness to the things, for example by prompting in READ forthe inputs, and by giving the user a chance to reenter a bad input.



72

But that is really separate from compiler design, so for now I'll just assume that a librarycall TINYLIB.LIB exists. Since we now need it loaded, we need to add a statement toinclude it in procedure Header:

{--------------------------------------------------------------}


procedure Header;

begin


EmitLn('LIB TINYLIB');

end;

{--------------------------------------------------------------}

That takes care of that part. Now, we also need to recognize the read and write com-mands. We can do this by adding two more keywords to our list:

{--------------------------------------------------------------}


const NKW = 11;

NKW1 = 12;



'READ', 'WRITE', 'VAR', 'BEGIN', 'END',

'PROGRAM');const KWcode: string[NKW1] = 'xileweRWvbep';{--------------------------------------------------------------}



(Note how I'm using upper case codes here to avoid conflict with the 'w' of WHILE.)

Next, we need procedures for processing the read/write statement and its argument list:

{--------------------------------------------------------------}

{ Process a Read Statement }

procedure DoRead;

begin

Match('(');

GetName;

ReadVar;


Match(',');

GetName;

ReadVar;

end;

Match(')');

end;



72

{--------------------------------------------------------------}

{ Process a Write Statement }

procedure DoWrite;

begin

Match('(');

Expression;

WriteVar;


Match(',');

Expression;

WriteVar;

end;

Match(')');

end;

{--------------------------------------------------------------}



Finally, we must expand procedure Block to handle the new statement types:

{--------------------------------------------------------------}


procedure Block;

begin

Scan;


case Token of

'i': DoIf;

'w': DoWhile;

'R': DoRead;

'W': DoWrite;

else Assignment;

end;

Scan;

end;

end;

{--------------------------------------------------------------}

That's all there is to it. _NOW_ we have a language!



73

CONCLUSION At this point we have TINY completely defined. It's not much ... actually a toy compiler.TINY has only one data type and no subroutines ... but it's a complete, usable language.While you're not likely to be able to write another compiler in it, or do anything else veryseriously, you could write programs to read some input, perform calculations, and outputthe results. Not too bad for a toy.

Most importantly, we have a firm base upon which to build further extensions. I knowyou'll be glad to hear this: this is the last time I'll start over in building a parser ... from nowon I intend to just add features to TINY until it becomes KISS. Oh, there'll be other timeswe will need to try things out with new copies of the Cradle, but once we've found out howto do those things they'll be incorporated into TINY.

What will those features be? Well, for starters we need subroutines and functions. Thenwe need to be able to handle different types, including arrays, strings, and other struc-tures. Then we need to deal with the idea of pointers. All this will be upcoming in futureinstallments.

See you then.

For references purposes, the complete listing of TINY Version 1.0 is shown below:



{--------------------------------------------------------------}

program Tiny10;

{--------------------------------------------------------------}


const TAB = ^I;

CR = ^M;

LF = ^J;

LCount: integer = 0;


{--------------------------------------------------------------}




TabPtr = ^SymTab;



73

{--------------------------------------------------------------}







SType: array[1..MaxEntry] of char;



{--------------------------------------------------------------}


const NKW = 11;

NKW1 = 12;



'READ', 'WRITE', 'VAR', 'BEGIN', 'END',

'PROGRAM');

const KWcode: string[NKW1] = 'xileweRWvbep';



73

{--------------------------------------------------------------}


procedure GetChar;

begin

Read(Look);

end;

{--------------------------------------------------------------}

{ Report an Error }


begin

WriteLn;


end;

{--------------------------------------------------------------}



begin

Error(s);

Halt;

end;



{--------------------------------------------------------------}



begin


end;

{--------------------------------------------------------------}



begin


end;

{--------------------------------------------------------------}



begin


end;



73

{--------------------------------------------------------------}



begin

IsDigit := c in ['0'..'9'];

end;

{--------------------------------------------------------------}



begin


end;

{--------------------------------------------------------------}



begin

IsAddop := c in ['+', '-'];

end;



{--------------------------------------------------------------}



begin

IsMulop := c in ['*', '/'];

end;

{--------------------------------------------------------------}



begin

IsOrop := c in ['|', '~'];

end;

{--------------------------------------------------------------}



begin

IsRelop := c in ['=', '#', '<', '>'];

end;



73

{--------------------------------------------------------------}



begin


end;

{--------------------------------------------------------------}



begin


GetChar;

end;



{--------------------------------------------------------------}


procedure NewLine;

begin

while Look = CR do begin

GetChar;


SkipWhite;

end;

end;

{--------------------------------------------------------------}



begin

NewLine;



SkipWhite;

end;



74

{--------------------------------------------------------------}

{ Table Lookup }


var i: integer;

found: Boolean;

begin

found := false;

i := n;


if s = T^[i] then

found := true

else

dec(i);

Lookup := i;

end;



{--------------------------------------------------------------}

{ Locate a Symbol in Table }

{ Returns the index of the entry. Zero if not present. }

function Locate(N: Symbol): integer;

begin

Locate := Lookup(@ST, n, MaxEntry);

end;

{--------------------------------------------------------------}



begin

InTable := Lookup(@ST, n, MaxEntry) <> 0;

end;



74

{--------------------------------------------------------------}



begin

if InTable(N) then Abort('Duplicate Identifier ' + N);


Inc(NEntry);

ST[NEntry] := N;

SType[NEntry] := T;

end;



{--------------------------------------------------------------}


procedure GetName;

begin

NewLine;


Value := '';



GetChar;

end;

SkipWhite;

end;



74

{--------------------------------------------------------------}

{ Get a Number }


var Val: integer;

begin

NewLine;


Val := 0;



GetChar;

end;

GetNum := Val;

SkipWhite;

end;



{--------------------------------------------------------------}


procedure Scan;

begin

GetName;


end;

{--------------------------------------------------------------}



begin


end;

{--------------------------------------------------------------}



begin

Write(TAB, s);

end;



74

{--------------------------------------------------------------}



begin

Emit(s);

WriteLn;

end;

{--------------------------------------------------------------}



var S: string;

begin

Str(LCount, S);


Inc(LCount);

end;



{--------------------------------------------------------------}



begin

WriteLn(L, ':');

end;

{---------------------------------------------------------------}


procedure Clear;

begin

EmitLn('CLR D0');

end;

{---------------------------------------------------------------}


procedure Negate;

begin

EmitLn('NEG D0');

end;



74

{---------------------------------------------------------------}


procedure NotIt;

begin

EmitLn('NOT D0');

end;

{---------------------------------------------------------------}


procedure LoadConst(n: integer);

begin

Emit('MOVE #');

WriteLn(n, ',D0');

end;

{---------------------------------------------------------------}


procedure LoadVar(Name: string);

begin



end;



{---------------------------------------------------------------}


procedure Push;

begin


end;

{---------------------------------------------------------------}


procedure PopAdd;

begin


end;

{---------------------------------------------------------------}


procedure PopSub;

begin


EmitLn('NEG D0');

end;



75

{---------------------------------------------------------------}


procedure PopMul;

begin


end;

{---------------------------------------------------------------}


procedure PopDiv;

begin


EmitLn('EXT.L D7');



end;



{---------------------------------------------------------------}


procedure PopAnd;

begin


end;

{---------------------------------------------------------------}


procedure PopOr;

begin


end;

{---------------------------------------------------------------}


procedure PopXor;

begin


end;



75

{---------------------------------------------------------------}



begin


end;

{---------------------------------------------------------------}


procedure SetEqual;

begin

EmitLn('SEQ D0');

EmitLn('EXT D0');

end;

{---------------------------------------------------------------}



begin

EmitLn('SNE D0');

EmitLn('EXT D0');

end;



{---------------------------------------------------------------}



begin

EmitLn('SLT D0');

EmitLn('EXT D0');

end;

{---------------------------------------------------------------}


procedure SetLess;

begin

EmitLn('SGT D0');

EmitLn('EXT D0');

end;

{---------------------------------------------------------------}



begin

EmitLn('SGE D0');

EmitLn('EXT D0');

end;



75

{---------------------------------------------------------------}



begin

EmitLn('SLE D0');

EmitLn('EXT D0');

end;

{---------------------------------------------------------------}


procedure Store(Name: string);

begin




end;

{---------------------------------------------------------------}



begin

EmitLn('BRA ' + L);

end;



{---------------------------------------------------------------}

{ Branch False }


begin

EmitLn('TST D0');

EmitLn('BEQ ' + L);

end;

{---------------------------------------------------------------}


procedure ReadVar;

begin

EmitLn('BSR READ');

Store(Value[1]);

end;

{ Write Variable from Primary Register }

procedure WriteVar;

begin


end;



75

{--------------------------------------------------------------}


procedure Header;

begin


end;

{--------------------------------------------------------------}


procedure Prolog;

begin

PostLabel('MAIN');

end;

{--------------------------------------------------------------}


procedure Epilog;

begin


EmitLn('END MAIN');

end;



{---------------------------------------------------------------}



procedure Factor;

begin


Match('(');

BoolExpression;

Match(')');

end

else if IsAlpha(Look) then begin

GetName;

LoadVar(Value);

end

else

LoadConst(GetNum);

end;



75

{--------------------------------------------------------------}

{ Parse and Translate a Negative Factor }

procedure NegFactor;

begin

Match('-');


LoadConst(-GetNum)

else begin

Factor;

Negate;

end;

end;



{--------------------------------------------------------------}

{ Parse and Translate a Leading Factor }

procedure FirstFactor;

begin

case Look of

'+': begin

Match('+');

Factor;

end;

'-': NegFactor;

else Factor;

end;

end;



76

{--------------------------------------------------------------}


procedure Multiply;

begin

Match('*');

Factor;

PopMul;

end;

{-------------------------------------------------------------}


procedure Divide;

begin

Match('/');

Factor;

PopDiv;

end;



{---------------------------------------------------------------}

{ Common Code Used by Term and FirstTerm }

procedure Term1;

begin


Push;

case Look of

'*': Multiply;

'/': Divide;

end;

end;

end;

{---------------------------------------------------------------}


procedure Term;

begin

Factor;

Term1;

end;



76

{---------------------------------------------------------------}

{ Parse and Translate a Leading Term }


begin

FirstFactor;

Term1;

end;

{--------------------------------------------------------------}


procedure Add;

begin

Match('+');

Term;

PopAdd;

end;



{-------------------------------------------------------------}


procedure Subtract;

begin

Match('-');

Term;

PopSub;

end;

{---------------------------------------------------------------}



begin

FirstTerm;


Push;

case Look of

'+': Add;

'-': Subtract;

end;

end;

end;



76

{---------------------------------------------------------------}


procedure Equal;

begin

Match('=');

Expression;

PopCompare;

SetEqual;

end;

{---------------------------------------------------------------}



begin

Match('=');

Expression;

PopCompare;

SetLessOrEqual;

end;



{---------------------------------------------------------------}


procedure NotEqual;

begin

Match('>');

Expression;

PopCompare;

SetNEqual;

end;



76

{---------------------------------------------------------------}


procedure Less;

begin

Match('<');

case Look of

'=': LessOrEqual;

'>': NotEqual;

else begin

Expression;

PopCompare;

SetLess;

end;

end;

end;



{---------------------------------------------------------------}


procedure Greater;

begin

Match('>');


Match('=');

Expression;

PopCompare;

SetGreaterOrEqual;

end

else begin

Expression;

PopCompare;

SetGreater;

end;

end;



76

{---------------------------------------------------------------}


procedure Relation;

begin

Expression;


Push;

case Look of

'=': Equal;

'<': Less;

'>': Greater;

end;

end;

end;



{---------------------------------------------------------------}



begin


Match('!');

Relation;

NotIt;

end

else

Relation;

end;



77

{---------------------------------------------------------------}


procedure BoolTerm;

begin

NotFactor;


Push;

Match('&');

NotFactor;

PopAnd;

end;

end;

{--------------------------------------------------------------}


procedure BoolOr;

begin

Match('|');

BoolTerm;

PopOr;

end;



{--------------------------------------------------------------}


procedure BoolXor;

begin

Match('~');

BoolTerm;

PopXor;

end;

{---------------------------------------------------------------}



begin

BoolTerm;


Push;

case Look of

'|': BoolOr;

'~': BoolXor;

end;

end;

end;



77

{--------------------------------------------------------------}



var Name: string;

begin

Name := Value;

Match('=');

BoolExpression;

Store(Name);

end;



{---------------------------------------------------------------}



procedure DoIf;

var L1, L2: string;

begin

BoolExpression;

L1 := NewLabel;

L2 := L1;

BranchFalse(L1);

Block;


L2 := NewLabel;

Branch(L2);

PostLabel(L1);

Block;

end;

PostLabel(L2);


end;



77

{--------------------------------------------------------------}


procedure DoWhile;

var L1, L2: string;

begin

L1 := NewLabel;

L2 := NewLabel;

PostLabel(L1);

BoolExpression;

BranchFalse(L2);

Block;


Branch(L1);

PostLabel(L2);

end;



{--------------------------------------------------------------}


procedure DoRead;

begin

Match('(');

GetName;

ReadVar;


Match(',');

GetName;

ReadVar;

end;

Match(')');

end;



77

{--------------------------------------------------------------}


procedure DoWrite;

begin

Match('(');

Expression;

WriteVar;


Match(',');

Expression;

WriteVar;

end;

Match(')');

end;



{--------------------------------------------------------------}


procedure Block;

begin

Scan;


case Token of

'i': DoIf;

'w': DoWhile;

'R': DoRead;

'W': DoWrite;

else Assignment;

end;

Scan;

end;

end;



77

{--------------------------------------------------------------}


procedure Alloc(N: Symbol);

begin


AddEntry(N, 'v');



Match('=');

If Look = '-' then begin

Write(Look);

Match('-');

end;

WriteLn(GetNum);

end

else

WriteLn('0');

end;



{--------------------------------------------------------------}


procedure Decl;

begin

GetName;

Alloc(Value);


Match(',');

GetName;

Alloc(Value);

end;

end;



78

{--------------------------------------------------------------}


procedure TopDecls;

begin

Scan;

while Token <> 'b' do begin

case Token of

'v': Decl;

else Abort('Unrecognized Keyword ' + Value);

end;

Scan;

end;

end;



{--------------------------------------------------------------}


procedure Main;

begin


Prolog;

Block;

MatchString('END');

Epilog;

end;



78

{--------------------------------------------------------------}


procedure Prog;

begin


Header;

TopDecls;

Main;

Match('.');

end;

{--------------------------------------------------------------}{ Initialize }procedure Init;

var i: integer;

begin

for i := 1 to MaxEntry do begin

ST[i] := '';

SType[i] := ' ';

end;

GetChar;

Scan;

end;



{--------------------------------------------------------------}

{ Main Program }

begin

Init;

Prog;

if Look <> CR then Abort('Unexpected data after ''.''');

end.

{--------------------------------------------------------------}



78

Part 11 - Lexical Scan Revisited

INTRODUCTION I've got some good news and some bad news. The bad news is that this installment is notthe one I promised last time. What's more, the one after this one won't be, either. Thegood news is the reason for this installment: I've found a way to simplify and improve thelexical scanning part of the compiler. Let me explain.



BACKGROUND If you'll remember, we talked at length about the subject of lexical scanners in Part VII, and Ileft you with a design for a distributed scanner that I felt was about as simple as I could makeit ... more than most that I've seen elsewhere. We used that idea in Part X. The compilerstructure that resulted was simple, and it got the job done.

Recently, though, I've begun to have problems, and they're the kind that send a message thatyou might be doing something wrong.

The whole thing came to a head when I tried to address the issue of semicolons. Severalpeople have asked me about them, and whether or not KISS will have them separating thestatements. My intention has been NOT to use semicolons, simply because I don't like themand, as you can see, they have not proved necessary.

But I know that many of you, like me, have gotten used to them, and so I set out to write ashort installment to show you how they could easily be added, if you were so inclined.

Well, it turned out that they weren't easy to add at all. In fact it was darned difficult.

I guess I should have realized that something was wrong, because of the issue of newlines.In the last couple of installments we've addressed that issue, and I've shown you how to dealwith newlines with a procedure called, appropriately enough, NewLine. In TINY Version 1.0, Isprinkled calls to this procedure in strategic spots in the code.

It seems that every time I've addressed the issue of newlines, though, I've found it to betricky, and the resulting parser turned out to be quite fragile ... one addition or deletion here orthere and things tended to go to pot. Looking back on it, I realize that there was a message inthis that I just wasn't paying attention to.

When I tried to add semicolons on top of the newlines, that was the last straw. I ended up withmuch too complex a solution. I began to realize that something fundamental had to change.

So, in a way this installment will cause us to backtrack a bit and revisit the issue of scanningall over again. Sorry about that. That's the price you pay for watching me do this in real time.But the new version is definitely an improvement, and will serve us well for what is to come.



78

As I said, the scanner we used in Part X was about as simple as one can get. But any-thing can be improved. The new scanner is more like the classical scanner, and not assimple as before. But the overall compiler structure is even simpler than before. It's alsomore robust, and easier to add to and/or modify. I think that's worth the time spent in thisdigression. So in this installment, I'll be showing you the new structure. No doubt you'll behappy to know that, while the changes affect many procedures, they aren't very profoundand so we lose very little of what's been done so far.

Ironically, the new scanner is much more conventional than the old one, and is very muchlike the more generic scanner I showed you earlier in Part VII. Then I started trying to getclever, and I almost clevered myself clean out of business. You'd think one day I'd learn:K-I-S-S!



THE PROBLEM The problem begins to show itself in procedure Block, which I've reproduced below:

{--------------------------------------------------------------}


procedure Block;

begin

Scan;


case Token of

'i': DoIf;

'w': DoWhile;

'R': DoRead;

'W': DoWrite;

else Assignment;

end;

Scan;

end;

end;

{--------------------------------------------------------------}



78

As you can see, Block is oriented to individual program statements. At each pass throughthe loop, we know that we are at the beginning of a statement. We exit the block when wehave scanned an END or an ELSE.

But suppose that we see a semicolon instead. The procedure as it's shown above can'thandle that, because procedure Scan only expects and can only accept tokens that beginwith a letter.

I tinkered around for quite awhile to come up with a fix. I found many possibleapproaches, but none were very satisfying. I finally figured out the reason.

Recall that when we started with our single-character parsers, we adopted a conventionthat the lookahead character would always be prefetched. That is, we would have thecharacter that corresponds to our current position in the input stream fetched into the glo-bal character Look, so that we could examine it as many times as needed. The rule weadopted was that EVERY recognizer, if it found its target token, would advance Look tothe next character in the input stream.

That simple and fixed convention served us very well when we had single-charactertokens, and it still does. It would make a lot of sense to apply the same rule to multi-char-acter tokens.

But when we got into lexical scanning, I began to violate that simple rule. The scanner ofPart X did indeed advance to the next token if it found an identifier or keyword, but itDIDN'T do that if it found a carriage return, a whitespace character, or an operator.

Now, that sort of mixed-mode operation gets us into deep trouble in procedure Block,because whether or not the input stream has been advanced depends upon the kind oftoken we encounter. If it's a keyword or the target of an assignment statement, the "cur-sor," as defined by the contents of Look, has been advanced to the next token OR to thebeginning of whitespace. If, on the other hand, the token is a semicolon, or if we have hita carriage return, the cursor has NOT advanced.

Needless to say, we can add enough logic to keep us on track. But it's tricky, and makesthe whole parser very fragile.



There's a much better way, and that's just to adopt that same rule that's worked so wellbefore, to apply to TOKENS as well as single characters. In other words, we'll prefetch tokensjust as we've always done for characters. It seems so obvious once you think about it thatway.

Interestingly enough, if we do things this way the problem that we've had with newline char-acters goes away. We can just lump them in as whitespace characters, which means that thehandling of newlines becomes very trivial, and MUCH less prone to error than we've had todeal with in the past.



79

THE SOLUTION Let's begin to fix the problem by re-introducing the two procedures:

{--------------------------------------------------------------}


procedure GetName;

begin

SkipWhite;

if Not IsAlpha(Look) then Expected('Identifier');

Token := 'x';

Value := '';

repeat


GetChar;

until not IsAlNum(Look);

end;



{--------------------------------------------------------------}

{ Get a Number }

procedure GetNum;

begin

SkipWhite;

if not IsDigit(Look) then Expected('Number');

Token := '#';

Value := '';

repeat


GetChar;

until not IsDigit(Look);

end;

{--------------------------------------------------------------}

These two procedures are functionally almost identical to the ones I showed you in Part VII.They each fetch the current token, either an identifier or a number, into the global stringValue. They also set the encoded version, Token, to the appropriate code. The input streamis left with Look containing the first character NOT part of the token.



79

We can do the same thing for operators, even multi-character operators, with a proceduresuch as:

{--------------------------------------------------------------}

{ Get an Operator }

procedure GetOp;

begin

Token := Look;

Value := '';

repeat


GetChar;

until IsAlpha(Look) or IsDigit(Look) or IsWhite(Look);

end;

{--------------------------------------------------------------}

Note that GetOp returns, as its encoded token, the FIRST character of the operator. Thisis important, because it means that we can now use that single character to drive theparser, instead of the lookahead character.



We need to tie these procedures together into a single procedure that can handle all threecases. The following procedure will read any one of the token types and always leave theinput stream advanced beyond it:

{--------------------------------------------------------------}

{ Get the Next Input Token }

procedure Next;

begin

SkipWhite;

if IsAlpha(Look) then GetName

else if IsDigit(Look) then GetNum

else GetOp;

end;

{--------------------------------------------------------------}

***NOTE that here I have put SkipWhite BEFORE the calls rather than after. This means that,in general, the variable Look will NOT have a meaningful value in it, and therefore we shouldNOT use it as a test value for parsing, as we have been doing so far. That's the big departurefrom our normal approach.

Now, remember that before I was careful not to treat the carriage return (CR) and line feed(LF) characters as white space. This was because, with SkipWhite called as the last thing inthe scanner, the encounter with LF would trigger a read statement. If we were on the last lineof the program, we couldn't get out until we input another line with a non-white character.That's why I needed the second procedure, NewLine, to handle the CRLF's.

But now, with the call to SkipWhite coming first, that's exactly the behavior we want. Thecompiler must know there's another token coming or it wouldn't be calling Next. In otherwords, it hasn't found the terminating END yet. So we're going to insist on more data until wefind something.



79

All this means that we can greatly simplify both the program and the concepts, by treatingCR and LF as whitespace characters, and eliminating NewLine. You can do that simplyby modifying the function IsWhite:

{--------------------------------------------------------------}



begin


end;

{--------------------------------------------------------------}

We've already tried similar routines in Part VII, but you might as well try these new onesout. Add them to a copy of the Cradle and call Next with the following main program:

{--------------------------------------------------------------}

{ Main Program }

begin

Init;

repeat

Next;

WriteLn(Token, ' ', Value);

until Token = '.';

end.

{--------------------------------------------------------------}



Compile it and verify that you can separate a program into a series of tokens, and that youget the right encoding for each token.

This ALMOST works, but not quite. There are two potential problems: First, in KISS/TINYalmost all of our operators are single-character operators. The only exceptions are the relops>=, <=, and <>. It seems a shame to treat all operators as strings and do a string compare,when only a single character compare will almost always suffice. Second, and much moreimportant, the thing doesn't WORK when two operators appear together, as in (a+b)*(c+d).Here the string following 'b' would be interpreted as a single operator ")*(."

It's possible to fix that problem. For example, we could just give GetOp a list of legal charac-ters, and we could treat the parentheses as different operator types than the others. But thisbegins to get messy.

Fortunately, there's a better way that solves all the problems. Since almost all the operatorsare single characters, let's just treat them that way, and let GetOp get only one character at atime. This not only simplifies GetOp, but also speeds things up quite a bit. We still have theproblem of the relops, but we were treating them as special cases anyway.

So here's the final version of GetOp:

{--------------------------------------------------------------}

{ Get an Operator }procedure GetOp;

begin

SkipWhite;

Token := Look;

Value := Look;

GetChar;

end;

{--------------------------------------------------------------}



79

Note that I still give the string Value a value. If you're truly concerned about efficiency, youcould leave this out. When we're expecting an operator, we will only be testing Token any-how, so the value of the string won't matter. But to me it seems to be good practice to givethe thing a value just in case.

Try this new version with some realistic-looking code. You should be able to separate anyprogram into its individual tokens, with the caveat that the two-character relops will scaninto two separate tokens. That's OK ... we'll parse them that way.

Now, in Part VII the function of Next was combined with procedure Scan, which alsochecked every identifier against a list of keywords and encoded each one that was found.As I mentioned at the time, the last thing we would want to do is to use such a procedurein places where keywords should not appear, such as in expressions. If we did that, thekeyword list would be scanned for every identifier appearing in the code. Not good.

The right way to deal with that is to simply separate the functions of fetching tokens andlooking for keywords. The version of Scan shown below does NOTHING but check forkeywords. Notice that it operates on the current token and does NOT advance the inputstream.

{--------------------------------------------------------------}

{ Scan the Current Identifier for Keywords }

procedure Scan;

begin

if Token = 'x' then


end;

{--------------------------------------------------------------}



There is one last detail. In the compiler there are a few places that we must actually check thestring value of the token. Mainly, this is done to distinguish between the different END's, butthere are a couple of other places. (I should note in passing that we could always eliminatethe need for matching END characters by encoding each one to a different character. Rightnow we are definitely taking the lazy man's route.)

The following version of MatchString takes the place of the character-oriented Match. Notethat, like Match, it DOES advance the input stream.

{--------------------------------------------------------------}



begin


Next;

end;

{--------------------------------------------------------------}



79

FIXING UP THE COMPILER Armed with these new scanner procedures, we can now begin to fix the compiler to usethem properly. The changes are all quite minor, but there are quite a few places wherechanges are necessary. Rather than showing you each place, I will give you the generalidea and then just give the finished product.

First of all, the code for procedure Block doesn't change, though its function does:

{--------------------------------------------------------------}{ Parse and Translate a Block of Statements }procedure Block;

begin

Scan;


case Token of

'i': DoIf;

'w': DoWhile;

'R': DoRead;

'W': DoWrite;

else Assignment;

end;

Scan;

end;

end;

{--------------------------------------------------------------}



Remember that the new version of Scan doesn't advance the input stream, it only scans forkeywords. The input stream must be advanced by each procedure that Block calls.

In general, we have to replace every test on Look with a similar test on Token. For example:

{---------------------------------------------------------------}



begin

BoolTerm;

while IsOrOp(Token) do begin

Push;

case Token of

'|': BoolOr;

'~': BoolXor;

end;

end;

end;

{--------------------------------------------------------------}



80

In procedures like Add, we don't have to use Match anymore. We need only call Next toadvance the input stream:

{--------------------------------------------------------------}


procedure Add;

begin

Next;

Term;

PopAdd;

end;

{-------------------------------------------------------------}



Control structures are actually simpler. We just call Next to advance over the control key-words:

{---------------------------------------------------------------}{ Recognize and Translate an IF Construct }procedure Block; Forward;procedure DoIf;var L1, L2: string;begin

Next;

BoolExpression;

L1 := NewLabel;

L2 := L1;

BranchFalse(L1);

Block;


Next;

L2 := NewLabel;

Branch(L2);

PostLabel(L1);

Block;

end;

PostLabel(L2); MatchString('ENDIF');end;

{--------------------------------------------------------------}



80

That's about the extent of the REQUIRED changes. In the listing of TINY Version 1.1below, I've also made a number of other "improvements" that aren't really required. Letme explain them briefly:

(1) I've deleted the two procedures Prog and Main, and combined

their functions into the main program. They didn't seem to

add to program clarity ... in fact they seemed to just

muddy things up a little.

(2) I've deleted the keywords PROGRAM and BEGIN from the

keyword list. Each one only occurs in one place, so it's

not necessary to search for it.

(3) Having been bitten by an overdose of cleverness, I've

reminded myself that TINY is supposed to be a minimalist

program. Therefore I've replaced the fancy handling of

unary minus with the dumbest one I could think of. A giant

step backwards in code quality, but a great simplification

of the compiler. KISS is the right place to use the other

version.

(4) I've added some error-checking routines such as CheckTable

and CheckDup, and replaced in-line code by calls to them.



This cleans up a number of routines.

(5) I've taken the error checking out of code generation

routines like Store, and put it in the parser where it

belongs. See Assignment, for example.

(6) There was an error in InTable and Locate that caused them

to search all locations instead of only those with valid

data in them. They now search only valid cells. This

allows us to eliminate the initialization of the symbol

table, which was done in Init.

(7) Procedure AddEntry now has two arguments, which helps to

make things a bit more modular.

(8) I've cleaned up the code for the relational operators by

the addition of the new procedures CompareExpression and

NextExpression.

(9) I fixed an error in the Read routine ... the earlier value

did not check for a valid variable name.



80

CONCLUSION The resulting compiler for TINY is given below. Other than the removal of the keywordPROGRAM, it parses the same language as before. It's just a bit cleaner, and moreimportantly it's considerably more robust. I feel good about it.

The next installment will be another digression: the discussion of semicolons and suchthat got me into this mess in the first place. THEN we'll press on into procedures andtypes. Hang in there with me. The addition of those features will go a long way towardsremoving KISS from the "toy language" category. We're getting very close to being able towrite a serious compiler.



TINY VERSION 1.1 {--------------------------------------------------------------}

program Tiny11;

{--------------------------------------------------------------}


const TAB = ^I;

CR = ^M;

LF = ^J;

LCount: integer = 0;


{--------------------------------------------------------------}




TabPtr = ^SymTab;



80

{--------------------------------------------------------------}







SType: array[1..MaxEntry] of char;

{--------------------------------------------------------------}


const NKW = 9;

NKW1 = 10;



'READ', 'WRITE', 'VAR', 'END');

const KWcode: string[NKW1] = 'xileweRWve';



{--------------------------------------------------------------}


procedure GetChar;

begin

Read(Look);

end;

{--------------------------------------------------------------}

{ Report an Error }


begin

WriteLn;


end;

{--------------------------------------------------------------}



begin

Error(s);

Halt;

end;



80

--------------------------------------------------------------}



begin


end;

{--------------------------------------------------------------}



begin


end;

{--------------------------------------------------------------}

{ Report a Duplicate Identifier }

procedure Duplicate(n: string);

begin

Abort('Duplicate Identifier ' + n);

end;



{--------------------------------------------------------------}

{ Check to Make Sure the Current Token is an Identifier }

procedure CheckIdent;

begin

if Token <> 'x' then Expected('Identifier');

end;

{--------------------------------------------------------------}



begin


end;

{--------------------------------------------------------------}



begin

IsDigit := c in ['0'..'9'];

end;



81

{--------------------------------------------------------------}



begin


end;

{--------------------------------------------------------------}



begin

IsAddop := c in ['+', '-'];

end;

{--------------------------------------------------------------}



begin

IsMulop := c in ['*', '/'];

end;



{--------------------------------------------------------------}



begin

IsOrop := c in ['|', '~'];

end;

{--------------------------------------------------------------}



begin

IsRelop := c in ['=', '#', '<', '>'];

end;

{--------------------------------------------------------------}



begin


end;



81

{--------------------------------------------------------------}



begin


GetChar;

end;

{--------------------------------------------------------------}

{ Table Lookup }


var i: integer;

found: Boolean;

begin found := false; i := n;


if s = T^[i] then

found := true

else

dec(i);

Lookup := i;

end;



{--------------------------------------------------------------}

{ Locate a Symbol in Table }

{ Returns the index of the entry. Zero if not present. }

function Locate(N: Symbol): integer;

begin

Locate := Lookup(@ST, n, NEntry);

end;

{--------------------------------------------------------------}



begin

InTable := Lookup(@ST, n, NEntry) <> 0;

end;

{--------------------------------------------------------------}

{ Check to See if an Identifier is in the Symbol Table }

{ Report an error if it's not. }

procedure CheckTable(N: Symbol);

begin

if not InTable(N) then Undefined(N);

end;



81

{--------------------------------------------------------------}

{ Check the Symbol Table for a Duplicate Identifier }

{ Report an error if identifier is already in table. }

procedure CheckDup(N: Symbol);

begin

if InTable(N) then Duplicate(N);

end;

{--------------------------------------------------------------}



begin

CheckDup(N);


Inc(NEntry);

ST[NEntry] := N;

SType[NEntry] := T;

end;



{--------------------------------------------------------------}


procedure GetName;

begin

SkipWhite;

if Not IsAlpha(Look) then Expected('Identifier');

Token := 'x';

Value := '';

repeat


GetChar;

until not IsAlNum(Look);

end;



81

{--------------------------------------------------------------}

{ Get a Number }

procedure GetNum;

begin

SkipWhite;

if not IsDigit(Look) then Expected('Number');

Token := '#';

Value := '';

repeat


GetChar;

until not IsDigit(Look);

end;

{--------------------------------------------------------------}{ Get an Operator }procedure GetOp;

begin

SkipWhite;

Token := Look;

Value := Look;

GetChar;

end;



{--------------------------------------------------------------}

{ Get the Next Input Token }

procedure Next;

begin

SkipWhite;

if IsAlpha(Look) then GetName

else if IsDigit(Look) then GetNum

else GetOp;

end;

{--------------------------------------------------------------}

{ Scan the Current Identifier for Keywords }

procedure Scan;

begin

if Token = 'x' then


end;



81

{--------------------------------------------------------------}



begin


Next;

end;

{--------------------------------------------------------------}



begin

Write(TAB, s);

end;

{--------------------------------------------------------------}



begin

Emit(s);

WriteLn;

end;



{--------------------------------------------------------------}



var S: string;

begin

Str(LCount, S);


Inc(LCount);

end;

{--------------------------------------------------------------}



begin

WriteLn(L, ':');

end;

{---------------------------------------------------------------}


procedure Clear;

begin

EmitLn('CLR D0');

end;



82

{---------------------------------------------------------------}


procedure Negate;

begin

EmitLn('NEG D0');

end;

{---------------------------------------------------------------}


procedure NotIt;

begin

EmitLn('NOT D0');

end;

{---------------------------------------------------------------}


procedure LoadConst(n: string);

begin

Emit('MOVE #');

WriteLn(n, ',D0');

end;



{---------------------------------------------------------------}


procedure LoadVar(Name: string);

begin



end;

{---------------------------------------------------------------}


procedure Push;

begin


end;

{---------------------------------------------------------------}


procedure PopAdd;

begin


end;



82

{---------------------------------------------------------------}


procedure PopSub;

begin


EmitLn('NEG D0');

end;

{---------------------------------------------------------------}


procedure PopMul;

begin


end;



{---------------------------------------------------------------}


procedure PopDiv;

begin


EmitLn('EXT.L D7');



end;

{---------------------------------------------------------------}


procedure PopAnd;

begin


end;

{---------------------------------------------------------------}


procedure PopOr;

begin


end;



82

{---------------------------------------------------------------}


procedure PopXor;

begin


end;

{---------------------------------------------------------------}



begin


end;

{---------------------------------------------------------------}


procedure SetEqual;

begin

EmitLn('SEQ D0');

EmitLn('EXT D0');

end;



{---------------------------------------------------------------}



begin

EmitLn('SNE D0');

EmitLn('EXT D0');

end;

{---------------------------------------------------------------}



begin

EmitLn('SLT D0');

EmitLn('EXT D0');

end;

{---------------------------------------------------------------}


procedure SetLess;

begin

EmitLn('SGT D0');

EmitLn('EXT D0');

end;



82

{---------------------------------------------------------------}



begin

EmitLn('SGE D0');

EmitLn('EXT D0');

end;

{---------------------------------------------------------------}



begin

EmitLn('SLE D0');

EmitLn('EXT D0');

end;

{---------------------------------------------------------------}


procedure Store(Name: string);

begin



end;



{---------------------------------------------------------------}



begin

EmitLn('BRA ' + L);

end;

{---------------------------------------------------------------}

{ Branch False }


begin

EmitLn('TST D0');

EmitLn('BEQ ' + L);

end;



82

{---------------------------------------------------------------}


procedure ReadIt(Name: string);

begin

EmitLn('BSR READ');

Store(Name);

end;

{ Write from Primary Register }

procedure WriteIt;

begin


end;

{--------------------------------------------------------------}


procedure Header;

begin


end;



{--------------------------------------------------------------}


procedure Prolog;

begin

PostLabel('MAIN');

end;

{--------------------------------------------------------------}


procedure Epilog;

begin


EmitLn('END MAIN');

end;

{---------------------------------------------------------------}

{ Allocate Storage for a Static Variable }

procedure Allocate(Name, Val: string);

begin

WriteLn(Name, ':', TAB, 'DC ', Val);

end;



83

{---------------------------------------------------------------}



procedure Factor;

begin

if Token = '(' then begin

Next;

BoolExpression;

MatchString(')');

end

else begin

if Token = 'x' then

LoadVar(Value)

else if Token = '#' then

LoadConst(Value)

else Expected('Math Factor');

Next;

end;

end;



{--------------------------------------------------------------}


procedure Multiply;

begin

Next;

Factor;

PopMul;

end;

{-------------------------------------------------------------}


procedure Divide;

begin

Next;

Factor;

PopDiv;

end;



83

{---------------------------------------------------------------}


procedure Term;

begin

Factor;

while IsMulop(Token) do begin

Push;

case Token of

'*': Multiply;

'/': Divide;

end;

end;

end;

{--------------------------------------------------------------}


procedure Add;

begin

Next;

Term;

PopAdd;

end;



{-------------------------------------------------------------}


procedure Subtract;

begin

Next;

Term;

PopSub;

end;



83

{---------------------------------------------------------------}



begin

if IsAddop(Token) then

Clear

else

Term;

while IsAddop(Token) do begin

Push;

case Token of

'+': Add;

'-': Subtract;

end;

end;

end;

{---------------------------------------------------------------}{ Get Another Expression and Compare }procedure CompareExpression;begin Expression; PopCompare;end;



{---------------------------------------------------------------}

{ Get The Next Expression and Compare }

procedure NextExpression;

begin

Next;

CompareExpression;

end;

{---------------------------------------------------------------}


procedure Equal;

begin

NextExpression;

SetEqual;

end;

{---------------------------------------------------------------}



begin

NextExpression;

SetLessOrEqual;

end;



83

{---------------------------------------------------------------}


procedure NotEqual;

begin

NextExpression;

SetNEqual;

end;

{---------------------------------------------------------------}


procedure Less;

begin

Next;

case Token of

'=': LessOrEqual;

'>': NotEqual;

else begin

CompareExpression;

SetLess;

end;

end;

end;



{---------------------------------------------------------------}


procedure Greater;

begin

Next;

if Token = '=' then begin

NextExpression;

SetGreaterOrEqual;

end

else begin

CompareExpression;

SetGreater;

end;

end;



83

{---------------------------------------------------------------}


procedure Relation;

begin

Expression;

if IsRelop(Token) then begin

Push;

case Token of

'=': Equal;

'<': Less;

'>': Greater;

end;

end;

end;



{---------------------------------------------------------------}



begin

if Token = '!' then begin

Next;

Relation;

NotIt;

end

else

Relation;

end;



84

{---------------------------------------------------------------}


procedure BoolTerm;

begin

NotFactor;

while Token = '&' do begin

Push;

Next;

NotFactor;

PopAnd;

end;

end;

{--------------------------------------------------------------}


procedure BoolOr;

begin

Next;

BoolTerm;

PopOr;

end;



{--------------------------------------------------------------}


procedure BoolXor;

begin

Next;

BoolTerm;

PopXor;

end;

{---------------------------------------------------------------}



begin

BoolTerm;

while IsOrOp(Token) do begin

Push;

case Token of

'|': BoolOr;

'~': BoolXor;

end;

end;

end;



84

{--------------------------------------------------------------}



var Name: string;

begin

CheckTable(Value);

Name := Value;

Next;

MatchString('=');

BoolExpression;

Store(Name);

end;



{---------------------------------------------------------------}



procedure DoIf;

var L1, L2: string;

begin

Next;

BoolExpression;

L1 := NewLabel;

L2 := L1;

BranchFalse(L1);

Block;


Next;

L2 := NewLabel;

Branch(L2);

PostLabel(L1);

Block;

end; PostLabel(L2); MatchString('ENDIF');

end;



84

{--------------------------------------------------------------}


procedure DoWhile;

var L1, L2: string;

begin

Next;

L1 := NewLabel;

L2 := NewLabel;

PostLabel(L1);

BoolExpression;

BranchFalse(L2);

Block;


Branch(L1);

PostLabel(L2);

end;



{--------------------------------------------------------------}

{ Read a Single Variable }

procedure ReadVar;

begin

CheckIdent;

CheckTable(Value);

ReadIt(Value);

Next;

end;

{--------------------------------------------------------------}


procedure DoRead;

begin

Next;

MatchString('(');

ReadVar;

while Token = ',' do begin

Next;

ReadVar;

end; MatchString(')');end;



84

{--------------------------------------------------------------}


procedure DoWrite;

begin

Next;

MatchString('(');

Expression;

WriteIt;

while Token = ',' do begin

Next;

Expression;

WriteIt;

end;

MatchString(')');

end;



{--------------------------------------------------------------}


procedure Block;

begin

Scan;


case Token of

'i': DoIf;

'w': DoWhile;

'R': DoRead;

'W': DoWrite;

else Assignment;

end;

Scan;

end;

end;



84

{--------------------------------------------------------------}


procedure Alloc;

begin

Next;

if Token <> 'x' then Expected('Variable Name');

CheckDup(Value);

AddEntry(Value, 'v');

Allocate(Value, '0');

Next;

end;



{--------------------------------------------------------------}


procedure TopDecls;

begin

Scan;

while Token = 'v' do

Alloc;

while Token = ',' do

Alloc;

end;

{--------------------------------------------------------------}

{ Initialize }

procedure Init;

begin

GetChar;

Next;

end;



85

{--------------------------------------------------------------}

{ Main Program }

begin

Init;


Header;

TopDecls;


Prolog;

Block;

MatchString('END');

Epilog;

end.

{--------------------------------------------------------------}


Part 12 - Miscellany


INTRODUCTION This installment is another one of those excursions into side alleys that don't seem to fit intothe mainstream of this tutorial series. As I mentioned last time, it was while I was writing thisinstallment that I realized some changes had to be made to the compiler structure. So I hadto digress from this digression long enough to develop the new structure and show it to you.

Now that that's behind us, I can tell you what I set out to in the first place. This shouldn't takelong, and then we can get back into the mainstream.

Several people have asked me about things that other languages provide, but so far I haven'taddressed in this series. The two biggies are semicolons and comments. Perhaps you'vewondered about them, too, and wondered how things would change if we had to deal withthem. Just so you can proceed with what's to come, without being bothered by that naggingfeeling that something is missing, we'll address such issues here.



85

SEMICOLONS Ever since the introduction of Algol, semicolons have been a part of almost every modernlanguage. We've all used them to the point that they are taken for granted. Yet I suspectthat more compilation errors have occurred due to misplaced or missing semicolons thanany other single cause. And if we had a penny for every extra keystroke programmershave used to type the little rascals, we could pay off the national debt.

Having been brought up with FORTRAN, it took me a long time to get used to using semi-colons, and to tell the truth I've never quite understood why they were necessary. Since Iprogram in Pascal, and since the use of semicolons in Pascal is particularly tricky, thatone little character is still by far my biggest source of errors.

When I began developing KISS, I resolved to question EVERY construct in other lan-guages, and to try to avoid the most common problems that occur with them. That putsthe semicolon very high on my hit list.

To understand the role of the semicolon, you have to look at a little history.

Early programming languages were line-oriented. In FORTRAN, for example, variousparts of the statement had specific columns or fields that they had to appear in. Sincesome statements were too long for one line, the "continuation card" mechanism was pro-vided to let the compiler know that a given card was still part of the previous line. Themechanism survives to this day, even though punched cards are now things of the distantpast.

When other languages came along, they also adopted various mechanisms for dealingwith multiple-line statements. BASIC is a good example. It's important to recognize,though, that the FORTRAN mechanism was not so much required by the line orientationof that language, as by the column-orientation. In those versions of FORTRAN wherefree-form input is permitted, it's no longer needed.

When the fathers of Algol introduced that language, they wanted to get away from line-oriented programs like FORTRAN and BASIC, and allow for free-form input. Thisincluded the possibility of stringing multiple statements on a single line, as in

a=b; c=d; e=e+1;



In cases like this, the semicolon is almost REQUIRED. The same line, without the semico-lons, just looks "funny":

a=b c= d e=e+1

I suspect that this is the major ... perhaps ONLY ... reason for semicolons: to keep programsfrom looking funny.

But the idea of stringing multiple statements together on a single line is a dubious one at best.It's not very good programming style, and harks back to the days when it was consideredimprotant to conserve cards. In these days of CRT's and indented code, the clarity of pro-grams is far better served by keeping statements separate. It's still nice to have the OPTIONof multiple statements, but it seems a shame to keep programmers in slavery to the semico-lon, just to keep that one rare case from "looking funny."

When I started in with KISS, I tried to keep an open mind. I decided that I would use semico-lons when it became necessary for the parser, but not until then. I figured this would happenjust about the time I added the ability to spread statements over multiple lines. But, as youcan see, that never happened. The TINY compiler is perfectly happy to parse the most com-plicated statement, spread over any number of lines, without semicolons.

Still, there are people who have used semicolons for so long, they feel naked without them.I'm one of them. Once I had KISS defined sufficiently well, I began to write a few sample pro-grams in the language. I discovered, somewhat to my horror, that I kept putting semicolons inanyway. So now I'm facing the prospect of a NEW rash of compiler errors, caused byUNWANTED semicolons. Phooey!

Perhaps more to the point, there are readers out there who are designing their own lan-guages, which may include semicolons, or who want to use the techniques of these tutorialsto compile conventional languages like C. In either case, we need to be able to deal withsemicolons.



85

SYNTACTIC SUGAR This whole discussion brings up the issue of "syntactic sugar" ... constructs that areadded to a language, not because they are needed, but because they help make the pro-grams look right to the programmer. After all, it's nice to have a small, simple compiler, butit would be of little use if the resulting language were cryptic and hard to program. Thelanguage FORTH comes to mind (a premature OUCH! for the barrage I know that one'sgoing to fetch me). If we can add features to the language that make the programs easierto read and understand, and if those features help keep the programmer from makingerrors, then we should do so. Particularly if the constructs don't add much to the complex-ity of the language or its compiler.

The semicolon could be considered an example, but there are plenty of others, such asthe 'THEN' in a IF-statement, the 'DO' in a WHILE-statement, and even the 'PROGRAM'statement, which I came within a gnat's eyelash of leaving out of TINY. None of thesetokens add much to the syntax of the language ... the compiler can figure out what's goingon without them. But some folks feel that they DO add to the readability of programs, andthat can be very important.

There are two schools of thought on this subject, which are well represented by two of ourmost popular languages, C and Pascal.

To the minimalists, all such sugar should be left out. They argue that it clutters up the lan-guage and adds to the number of keystrokes programmers must type. Perhaps moreimportantly, every extra token or keyword represents a trap laying in wait for the inatten-tive programmer. If you leave out a token, misplace it, or misspell it, the compiler will getyou. So these people argue that the best approach is to get rid of such things. These folkstend to like C, which has a minimum of unnecessary keywords and punctuation.

Those from the other school tend to like Pascal. They argue that having to type a fewextra characters is a small price to pay for legibility. After all, humans have to read theprograms, too. Their best argument is that each such construct is an opportunity to tellthe compiler that you really mean for it to do what you said to. The sugary tokens serve asuseful landmarks to help you find your way.



The differences are well represented by the two languages. The most oft-heard complaintabout C is that it is too forgiving. When you make a mistake in C, the erroneous code is toooften another legal C construct. So the compiler just happily continues to compile, and leavesyou to find the error during debug. I guess that's why debuggers are so popular with C pro-grammers.

On the other hand, if a Pascal program compiles, you can be pretty sure that the program willdo what you told it. If there is an error at run time, it's probably a design error.

The best example of useful sugar is the semicolon itself. Consider the code fragment:

a=1+(2*b+c) b...

Since there is no operator connecting the token 'b' with the rest of the statement, the compilerwill conclude that the expression ends with the ')', and the 'b' is the beginning of a new state-ment. But suppose I have simply left out the intended operator, and I really want to say:

a=1+(2*b+c)*b...

In this case the compiler will get an error, all right, but it won't be very meaningful since it willbe expecting an '=' sign after the 'b' that really shouldn't be there.

If, on the other hand, I include a semicolon after the 'b', THEN there can be no doubt where Iintend the statement to end. Syntactic sugar, then, can serve a very useful purpose by provid-ing some additional insurance that we remain on track.

I find myself somewhere in the middle of all this. I tend to favor the Pascal-ers' view ... I'dmuch rather find my bugs at compile time rather than run time. But I also hate to just throwverbosity in for no apparent reason, as in COBOL. So far I've consistently left most of thePascal sugar out of KISS/TINY. But I certainly have no strong feelings either way, and I alsocan see the value of sprinkling a little sugar around just for the extra insurance that it brings.If you like this latter approach, things like that are easy to add. Just remember that, like thesemicolon, each item of sugar is something that can potentially cause a compile error by itsomission.



85

DEALING WITH SEMICOLONS There are two distinct ways in which semicolons are used in popular languages. In Pas-cal, the semicolon is regarded as an statement SEPARATOR. No semicolon is requiredafter the last statement in a block. The syntax is:

<block> ::= <statement> ( ';' <statement>)*

<statement> ::= <assignment> | <if> | <while> ... | null

(The null statement is IMPORTANT!)

Pascal also defines some semicolons in other places, such as after the PROGRAM state-ment.

In C and Ada, on the other hand, the semicolon is considered a statement TERMINATOR,and follows all statements (with some embarrassing and confusing exceptions). The syn-tax for this is simply:

<block> ::= ( <statement> ';')*

Of the two syntaxes, the Pascal one seems on the face of it more rational, but experiencehas shown that it leads to some strange difficulties. People get so used to typing a semi-colon after every statement that they tend to type one after the last statement in a block,also. That usually doesn't cause any harm ... it just gets treated as a null statement. ManyPascal programmers, including yours truly, do just that. But there is one place you abso-lutely CANNOT type a semicolon, and that's right before an ELSE. This little gotcha hascost me many an extra compilation, particularly when the ELSE is added to existing code.So the C/Ada choice turns out to be better. Apparently Nicklaus Wirth thinks so, too: In hisModula 2, he abandoned the Pascal approach.

Given either of these two syntaxes, it's an easy matter (now that we've reorganized theparser!) to add these features to our parser. Let's take the last case first, since it's simpler.



To begin, I've made things easy by introducing a new recognizer:

{--------------------------------------------------------------}

{ Match a Semicolon }

procedure Semi;

begin

MatchString(';');

end;

{--------------------------------------------------------------}

This procedure works very much like our old Match. It insists on finding a semicolon as thenext token. Having found it, it skips to the next one.



85

Since a semicolon follows a statement, procedure Block is almost the only one we needto change:

{--------------------------------------------------------------}


procedure Block;

begin

Scan;


case Token of

'i': DoIf;

'w': DoWhile;

'R': DoRead;

'W': DoWrite;

'x': Assignment;

end;

Semi;

Scan;

end;

end;

{--------------------------------------------------------------}



Note carefully the subtle change in the case statement. The call to Assignment is nowguarded by a test on Token. This is to avoid calling Assignment when the token is a semico-lon (which could happen if the statement is null).

Since declarations are also statements, we also need to add a call to Semi within procedureTopDecls:

{--------------------------------------------------------------}


procedure TopDecls;

begin

Scan;

while Token = 'v' do begin

Alloc;

while Token = ',' do

Alloc;

Semi;

end;

end;

{--------------------------------------------------------------}



86

Finally, we need one for the PROGRAM statement:

{--------------------------------------------------------------}

{ Main Program }

begin

Init;


Semi;

Header;

TopDecls;


Prolog;

Block;

MatchString('END');

Epilog;

end.

{--------------------------------------------------------------}

It's as easy as that. Try it with a copy of TINY and see how you like it.



The Pascal version is a little trickier, but it still only requires minor changes, and those only toprocedure Block. To keep things as simple as possible, let's split the procedure into two parts.The following procedure handles just one statement:

{--------------------------------------------------------------}

{ Parse and Translate a Single Statement }

procedure Statement;

begin

Scan;

case Token of

'i': DoIf;

'w': DoWhile;

'R': DoRead;

'W': DoWrite;

'x': Assignment;

end;

end;

{--------------------------------------------------------------}



86

Using this procedure, we can now rewrite Block like this:

{--------------------------------------------------------------}


procedure Block;

begin

Statement;

while Token = ';' do begin

Next;

Statement;

end;

end;

{--------------------------------------------------------------}

That sure didn't hurt, did it? We can now parse semicolons in Pascal-like fashion.



A COMPROMISE Now that we know how to deal with semicolons, does that mean that I'm going to put them inKISS/TINY? Well, yes and no. I like the extra sugar and the security that comes with knowingfor sure where the ends of statements are. But I haven't changed my dislike for the compila-tion errors associated with semicolons.

So I have what I think is a nice compromise: Make them OPTIONAL!

Consider the following version of Semi:

{--------------------------------------------------------------}

{ Match a Semicolon }

procedure Semi;

begin

if Token = ';' then Next;

end;

{--------------------------------------------------------------}

This procedure will ACCEPT a semicolon whenever it is called, but it won't INSIST on one.That means that when you choose to use semicolons, the compiler will use the extra informa-tion to help keep itself on track. But if you omit one (or omit them all) the compiler won't com-plain. The best of both worlds.

Put this procedure in place in the first version of your program (the one for C/Ada syntax),and you have the makings of TINY Version 1.2.



86

COMMENTS Up until now I have carefully avoided the subject of comments. You would think that thiswould be an easy subject ... after all, the compiler doesn't have to deal with comments atall; it should just ignore them. Well, sometimes that's true.

Comments can be just about as easy or as difficult as you choose to make them. At oneextreme, we can arrange things so that comments are intercepted almost the instant theyenter the compiler. At the other, we can treat them as lexical elements. Things tend to getinteresting when you consider things like comment delimiters contained in quoted strings.



SINGLE-CHARACTER DELIMITERS Here's an example. Suppose we assume the Turbo Pascal standard and use curly braces forcomments. In this case we have single- character delimiters, so our parsing is a little easier.

One approach is to strip the comments out the instant we encounter them in the input stream;that is, right in procedure GetChar. To do this, first change the name of GetChar to somethingelse, say GetCharX. (For the record, this is going to be a TEMPORARY change, so best notdo this with your only copy of TINY. I assume you understand that you should always dothese experiments with a working copy.)

Now, we're going to need a procedure to skip over comments. So key in the following one:

{--------------------------------------------------------------}

{ Skip A Comment Field }

procedure SkipComment;

begin

while Look <> '}' do

GetCharX;

GetCharX;

end;

{--------------------------------------------------------------}

Clearly, what this procedure is going to do is to simply read and discard characters from theinput stream, until it finds a right curly brace. Then it reads one more character and returns itin Look.



86

Now we can write a new version of GetChar that SkipComment to strip out comments:

{--------------------------------------------------------------}

{ Get Character from Input Stream }

{ Skip Any Comments }

procedure GetChar;

begin

GetCharX;

if Look = '{' then SkipComment;

end;

{--------------------------------------------------------------}

Code this up and give it a try. You'll find that you can, indeed, bury comments anywhereyou like. The comments never even get into the parser proper ... every call to GetCharjust returns any character that's NOT part of a comment.

As a matter of fact, while this approach gets the job done, and may even be perfectly sat-isfactory for you, it does its job a little TOO well. First of all, most programming languagesspecify that a comment should be treated like a space, so that comments aren't allowedto be embedded in, say, variable names. This current version doesn't care WHERE youput comments.

Second, since the rest of the parser can't even receive a '{' character, you will not beallowed to put one in a quoted string.

Before you turn up your nose at this simplistic solution, though, I should point out that asrespected a compiler as Turbo Pascal also won't allow a '{' in a quoted string. Try it. Andas for embedding a comment in an identifier, I can't imagine why anyone would want to dosuch a thing, anyway, so the question is moot. For 99% of all applications, what I've justshown you will work just fine.



But, if you want to be picky about it and stick to the conventional treatment, then we need tomove the interception point downstream a little further.

To do this, first change GetChar back to the way it was and change the name called inSkipComment. Then, let's add the left brace as a possible whitespace character:

{--------------------------------------------------------------}



begin

IsWhite := c in [' ', TAB, CR, LF, '{'];

end;

{--------------------------------------------------------------}



86

Now, we can deal with comments in procedure SkipWhite:

{--------------------------------------------------------------}



begin

while IsWhite(Look) do begin

if Look = '{' then

SkipComment

else

GetChar;

end;

end;

{--------------------------------------------------------------}

Note that SkipWhite is written so that we will skip over any combination of whitespacecharacters and comments, in one call.

OK, give this one a try, too. You'll find that it will let a comment serve to delimit tokens. It'sworth mentioning that this approach also gives us the ability to handle curly braces withinquoted strings, since within such strings we will not be testing for or skipping overwhitespace.

There's one last item to deal with: Nested comments. Some programmers like the idea ofnesting comments, since it allows you to comment out code during debugging. The codeI've given here won't allow that and, again, neither will Turbo Pascal.



But the fix is incredibly easy. All we need to do is to make SkipComment recursive:

{--------------------------------------------------------------}



begin

while Look <> '}' do begin

GetChar;

if Look = '{' then SkipComment;

end;

GetChar;

end;

{--------------------------------------------------------------}

That does it. As sophisticated a comment-handler as you'll ever need.



87

MULTI-CHARACTER DELIMITERS That's all well and good for cases where a comment is delimited by single characters, butwhat about the cases such as C or standard Pascal, where two characters are required?Well, the principles are still the same, but we have to change our approach quite a bit. I'msure it won't surprise you to learn that things get harder in this case.

For the multi-character situation, the easiest thing to do is to intercept the left delimiterback at the GetChar stage. We can "tokenize" it right there, replacing it by a single char-acter.



Let's assume we're using the C delimiters '/*' and '*/'. First, we need to go back to the "Get-CharX' approach. In yet another copy of your compiler, rename GetChar to GetCharX andthen enter the following new procedure GetChar:

{--------------------------------------------------------------}

{ Read New Character. Intercept '/*' }

procedure GetChar;

begin

if TempChar <> ' ' then begin

Look := TempChar;

TempChar := ' ';

end

else begin

GetCharX;

if Look = '/' then begin

Read(TempChar);

if TempChar = '*' then begin

Look := '{';

TempChar := ' ';

end;

end;

end;end;{--------------------------------------------------------------}



87

As you can see, what this procedure does is to intercept every occurrence of '/'. It thenexamines the NEXT character in the stream. If the character is a '*', then we have foundthe beginning of a comment, and GetChar will return a single character replacement for it.(For simplicity, I'm using the same '{' character as I did for Pascal. If you were writing a Ccompiler, you'd no doubt want to pick some other character that's not used elsewhere inC. Pick anything you like ... even $FF, anything that's unique.)

If the character following the '/' is NOT a '*', then GetChar tucks it away in the new globalTempChar, and returns the '/'.

Note that you need to declare this new variable and initialize it to ' '. I like to do things likethat using the Turbo "typed constant" construct:

const TempChar: char = ' ';

Now we need a new version of SkipComment:

{--------------------------------------------------------------}{ Skip A Comment Field }procedure SkipComment;

begin

repeat

repeat

GetCharX;

until Look = '*';

GetCharX;

until Look = '/';

GetChar;

end;

{--------------------------------------------------------------}



A few things to note: first of all, function IsWhite and procedure SkipWhite don't need to bechanged, since GetChar returns the '{' token. If you change that token character, then ofcourse you also need to change the character in those two routines.

Second, note that SkipComment doesn't call GetChar in its loop, but GetCharX. That meansthat the trailing '/' is not intercepted and is seen by SkipComment. Third, although GetChar isthe procedure doing the work, we can still deal with the comment characters embedded in aquoted string, by calling GetCharX instead of GetChar while we're within the string. Finally,note that we can again provide for nested comments by adding a single statement toSkipComment, just as we did before.



87

ONE-SIDED COMMENTS So far I've shown you how to deal with any kind of comment delimited on the left and theright. That only leaves the one- sided comments like those in assembler language or inAda, that are terminated by the end of the line. In a way, that case is easier. The only pro-cedure that would need to be changed is SkipComment, which must now terminate at thenewline characters:

{--------------------------------------------------------------}



begin

repeat

GetCharX;

until Look = CR;

GetChar;

end;

{--------------------------------------------------------------}

If the leading character is a single one, as in the ';' of assembly language, then we'reessentially done. If it's a two- character token, as in the '--' of Ada, we need only modifythe tests within GetChar. Either way, it's an easier problem than the balanced case.



CONCLUSION At this point we now have the ability to deal with both comments and semicolons, as well asother kinds of syntactic sugar. I've shown you several ways to deal with each, dependingupon the convention desired. The only issue left is: which of these conventions should weuse in KISS/TINY?

For the reasons that I've given as we went along, I'm choosing the following:

(1) Semicolons are TERMINATORS, not separators

(2) Semicolons are OPTIONAL

(3) Comments are delimited by curly braces

(4) Comments MAY be nested

Put the code corresponding to these cases into your copy of TINY. You now have TINY Ver-sion 1.2.

Now that we have disposed of these sideline issues, we can finally get back into the main-stream. In the next installment, we'll talk about procedures and parameter passing, and we'lladd these important features to TINY. See you then.



87

Part 13 - Procedures

INTRODUCTION At last we get to the good part!

At this point we've studied almost all the basic features of compilers and parsing. Wehave learned how to translate arithmetic expressions, Boolean expressions, control con-structs, data declarations, and I/O statements. We have defined a language, TINY 1.3,that embodies all these features, and we have written a rudimentary compiler that cantranslate them. By adding some file I/O we could indeed have a working compiler thatcould produce executable object files from programs written in TINY. With such a com-piler, we could write simple programs that could read integer data, perform calculationswith it, and output the results.

That's nice, but what we have is still only a toy language. We can't read or write even asingle character of text, and we still don't have procedures.

It's the features to be discussed in the next couple of installments that separate the menfrom the toys, so to speak. "Real" languages have more than one data type, and theysupport procedure calls. More than any others, it's these two features that give a lan-guage much of its character and personality. Once we have provided for them, our lan-guages, TINY and its successors, will cease to become toys and will take on thecharacter of real languages, suitable for serious programming jobs.

For several installments now, I've been promising you sessions on these two importantsubjects. Each time, other issues came up that required me to digress and deal withthem. Finally, we've been able to put all those issues to rest and can get on with the main-stream of things. In this installment, I'll cover procedures. Next time, we'll talk about thebasic data types.



ONE LAST DIGRESSION This has been an extraordinarily difficult installment for me to write. The reason has nothingto do with the subject itself ... I've known what I wanted to say for some time, and in fact I pre-sented most of this at Software Development '89, back in February. It has more to do with theapproach. Let me explain.

When I first began this series, I told you that we would use several "tricks" to make thingseasy, and to let us learn the concepts without getting too bogged down in the details. Amongthese tricks was the idea of looking at individual pieces of a compiler at a time, i.e. performingexperiments using the Cradle as a base. When we studied expressions, for example, wedealt with only that part of compiler theory. When we studied control structures, we wrote adifferent program, still based on the Cradle, to do that part. We only incorporated these con-cepts into a complete language fairly recently. These techniques have served us very wellindeed, and led us to the development of a compiler for TINY version 1.3.

When I first began this session, I tried to build upon what we had already done, and just addthe new features to the existing compiler. That turned out to be a little awkward and tricky ...much too much to suit me.

I finally figured out why. In this series of experiments, I had abandoned the very useful tech-niques that had allowed us to get here, and without meaning to I had switched over into anew method of working, that involved incremental changes to the full TINY compiler.

You need to understand that what we are doing here is a little unique. There have been anumber of articles, such as the Small C articles by Cain and Hendrix, that presented finishedcompilers for one language or another. This is different. In this series of tutorials, you arewatching me design and implement both a language and a compiler, in real time.

In the experiments that I've been doing in preparation for this article, I was trying to inject thechanges into the TINY compiler in such a way that, at every step, we still had a real, workingcompiler. In other words, I was attempting an incremental enhancement of the language andits compiler, while at the same time explaining to you what I was doing.



87

That's a tough act to pull off! I finally realized that it was dumb to try. Having gotten this farusing the idea of small experiments based on single-character tokens and simple, spe-cial-purpose programs, I had abandoned them in favor of working with the full compiler. Itwasn't working.

So we're going to go back to our roots, so to speak. In this installment and the next, I'll beusing single-character tokens again as we study the concepts of procedures, unfetteredby the other baggage that we have accumulated in the previous sessions. As a matter offact, I won't even attempt, at the end of this session, to merge the constructs into theTINY compiler. We'll save that for later.

After all this time, you don't need more buildup than that, so let's waste no more time anddive right in.



THE BASICS All modern CPU's provide direct support for procedure calls, and the 68000 is no exception.For the 68000, the call is a BSR (PC-relative version) or JSR, and the return is RTS. All wehave to do is to arrange for the compiler to issue these commands at the proper place.

Actually, there are really THREE things we have to address. One of them is the call/returnmechanism. The second is the mechanism for DEFINING the procedure in the first place.And, finally, there is the issue of passing parameters to the called procedure. None of thesethings are really very difficult, and we can of course borrow heavily on what people have donein other languages ... there's no need to reinvent the wheel here. Of the three issues, that ofparameter passing will occupy most of our attention, simply because there are so manyoptions available.



88

A BASIS FOR EXPERIMENTS As always, we will need some software to serve as a basis for what we are doing. Wedon't need the full TINY compiler, but we do need enough of a program so that some ofthe other constructs are present. Specifically, we need at least to be able to handle state-ments of some sort, and data declarations.

The program shown below is that basis. It's a vestigial form of TINY, with single-charactertokens. It has data declarations, but only in their simplest form ... no lists or initializers. Ithas assignment statements, but only of the kind

<ident> = <ident>

In other words, the only legal expression is a single variable name. There are no controlconstructs ... the only legal statement is the assignment.



Most of the program is just the standard Cradle routines. I've shown the whole thing here, justto make sure we're all starting from the same point:

{--------------------------------------------------------------}

program Calls;

{--------------------------------------------------------------}


const TAB = ^I;

CR = ^M;

LF = ^J;

{--------------------------------------------------------------}



var ST: Array['A'..'Z'] of char;

{--------------------------------------------------------------}


procedure GetChar;

begin

Read(Look);

end;



88

{--------------------------------------------------------------}

{ Report an Error }


begin

WriteLn;


end;

{--------------------------------------------------------------}



begin

Error(s);

Halt;

end;

{--------------------------------------------------------------}



begin


end;



{--------------------------------------------------------------}



begin


end;

{--------------------------------------------------------------}

{ Report an Duplicate Identifier }

procedure Duplicate(n: string);

begin

Abort('Duplicate Identifier ' + n);

end;

{--------------------------------------------------------------}

{ Get Type of Symbol }

function TypeOf(n: char): char;

begin

TypeOf := ST[n];

end;



88

{--------------------------------------------------------------}


function InTable(n: char): Boolean;

begin

InTable := ST[n] <> ' ';

end;

{--------------------------------------------------------------}

{ Add a New Symbol to Table }

procedure AddEntry(Name, T: char);

begin

if Intable(Name) then Duplicate(Name);

ST[Name] := T;

end;

{--------------------------------------------------------------}

{ Check an Entry to Make Sure It's a Variable }

procedure CheckVar(Name: char);

begin


if TypeOf(Name) <> 'v' then Abort(Name + ' is not a

variable');

end;



{--------------------------------------------------------------}



begin

IsAlpha := upcase(c) in ['A'..'Z'];

end;

{--------------------------------------------------------------}



begin

IsDigit := c in ['0'..'9'];

end;

{--------------------------------------------------------------}



begin


end;



88

{--------------------------------------------------------------}



begin

IsAddop := c in ['+', '-'];

end;

{--------------------------------------------------------------}



begin

IsMulop := c in ['*', '/'];

end;

{--------------------------------------------------------------}



begin

IsOrop := c in ['|', '~'];

end;



{--------------------------------------------------------------}



begin

IsRelop := c in ['=', '#', '<', '>'];

end;

{--------------------------------------------------------------}



begin


end;

{--------------------------------------------------------------}



begin


GetChar;

end;



88

{--------------------------------------------------------------}


procedure Fin;

begin

if Look = CR then begin

GetChar;

if Look = LF then

GetChar;

end;

end;

{--------------------------------------------------------------}



begin



SkipWhite;

end;



{--------------------------------------------------------------}



begin



GetChar;

SkipWhite;

end;

{--------------------------------------------------------------}

{ Get a Number }


begin


GetNum := Look;

GetChar;

SkipWhite;

end;



89

{--------------------------------------------------------------}



begin

Write(TAB, s);

end;

{--------------------------------------------------------------}



begin

Emit(s);

WriteLn;

end;

{--------------------------------------------------------------}



begin

WriteLn(L, ':');

end;



{--------------------------------------------------------------}

{ Load a Variable to the Primary Register }

procedure LoadVar(Name: char);

begin

CheckVar(Name);


end;

{--------------------------------------------------------------}

{ Store the Primary Register }

procedure StoreVar(Name: char);

begin

CheckVar(Name);



end;



89

{--------------------------------------------------------------}

{ Initialize }

procedure Init;

var i: char;

begin

GetChar;

SkipWhite;


ST[i] := ' ';

end;

{--------------------------------------------------------------}


{ Vestigial Version }


begin

LoadVar(GetName);

end;



{--------------------------------------------------------------}



var Name: char;

begin

Name := GetName;

Match('=');

Expression;

StoreVar(Name);

end;

{--------------------------------------------------------------}


procedure DoBlock;

begin


Assignment;

Fin;

end;

end;



89

{--------------------------------------------------------------}

{ Parse and Translate a Begin-Block }

procedure BeginBlock;

begin

Match('b');

Fin;

DoBlock;

Match('e');

Fin;

end;

{--------------------------------------------------------------}



begin


ST[N] := 'v';

WriteLn(N, ':', TAB, 'DC 0');

end;



{--------------------------------------------------------------}


procedure Decl;

var Name: char;

begin

Match('v');

Alloc(GetName);

end;

{--------------------------------------------------------------}


procedure TopDecls;

begin

while Look <> 'b' do begin

case Look of

'v': Decl;

else Abort('Unrecognized Keyword ' + Look);

end;

Fin;

end;

end;



89

{--------------------------------------------------------------}

{ Main Program }

begin

Init;

TopDecls;

BeginBlock;

end.

{--------------------------------------------------------------}

Note that we DO have a symbol table, and there is logic to check a variable name tomake sure it's a legal one. It's also worth noting that I have included the code you've seenbefore to provide for white space and newlines. Finally, note that the main program isdelimited, as usual, by BEGIN-END brackets.

Once you've copied the program to Turbo, the first step is to compile it and make sure itworks. Give it a few declarations, and then a begin-block. Try something like:

va (for VAR A)

vb (for VAR B)

vc (for VAR C)

b (for BEGIN)

a=b

b=c

e. (for END.)

As usual, you should also make some deliberate errors, and verify that the programcatches them correctly.



DECLARING A PROCEDURE If you're satisfied that our little program works, then it's time to deal with the procedures.Since we haven't talked about parameters yet, we'll begin by considering only proceduresthat have no parameter lists.

As a start, let's consider a simple program with a procedure, and think about the code we'dlike to see generated for it:

PROGRAM FOO; . . PROCEDURE BAR; BAR: BEGIN . . . . . END; RTS BEGIN { MAIN PROGRAM } MAIN: . . . . FOO; BSR BAR . . . . END. END MAIN

Here I've shown the high-order language constructs on the left, and the desired assemblercode on the right. The first thing to notice is that we certainly don't have much code to gener-ate here! For the great bulk of both the procedure and the main program, our existing con-structs take care of the code to be generated.

The key to dealing with the body of the procedure is to recognize that although a proceduremay be quite long, declaring it is really no different than declaring a variable. It's just onemore kind of declaration. We can write the BNF:

<declaration> ::= <data decl> | <procedure>

This means that it should be easy to modify TopDecl to deal with procedures. What about thesyntax of a procedure? Well, here's a suggested syntax, which is essentially that of Pascal:



89

<procedure> ::= PROCEDURE <ident> <begin-block>

There is practically no code generation required, other than that generated within thebegin-block. We need only emit a label at the beginning of the procedure, and an RTS atthe end.

Here's the required code:

{--------------------------------------------------------------}{ Parse and Translate a Procedure Declaration }procedure DoProc;

var N: char;

begin

Match('p');

N := GetName;

Fin;


ST[N] := 'p';

PostLabel(N);

BeginBlock;

Return;

end;

{--------------------------------------------------------------}

Note that I've added a new code generation routine, Return, which merely emits an RTSinstruction. The creation of that routine is "left as an exercise for the student."

To finish this version, add the following line within the Case statement in DoBlock:



'p': DoProc;

I should mention that this structure for declarations, and the BNF that drives it, differs fromstandard Pascal. In the Jensen & Wirth definition of Pascal, variable declarations, in fact ALLkinds of declarations, must appear in a specific sequence, i.e. labels, constants, types, vari-ables, procedures, and main program. To follow such a scheme, we should separate the twodeclarations, and have code in the main program something like

DoVars;

DoProcs;

DoMain;

However, most implementations of Pascal, including Turbo, don't require that order and letyou freely mix up the various declarations, as long as you still don't try to refer to somethingbefore it's declared. Although it may be more aesthetically pleasing to declare all the globalvariables at the top of the program, it certainly doesn't do any HARM to allow them to besprinkled around. In fact, it may do some GOOD, in the sense that it gives you the opportunityto do a little rudimentary information hiding. Variables that should be accessed only by themain program, for example, can be declared just before it and will thus be inaccessible by theprocedures.

OK, try this new version out. Note that we can declare as many procedures as we choose (aslong as we don't run out of single- character names!), and the labels and RTS's all come outin the right places.



90

It's worth noting here that I do _NOT_ allow for nested procedures. In TINY, all proce-dures must be declared at the global level, the same as in C. There has been quite a dis-cussion about this point in the Computer Language Forum of CompuServe. It turns outthat there is a significant penalty in complexity that must be paid for the luxury of nestedprocedures. What's more, this penalty gets paid at RUN TIME, because extra code mustbe added and executed every time a procedure is called. I also happen to believe thatnesting is not a good idea, simply on the grounds that I have seen too many abuses of thefeature. Before going on to the next step, it's also worth noting that the "main program" asit stands is incomplete, since it doesn't have the label and END statement. Let's fix that lit-tle oversight:

{--------------------------------------------------------------}


procedure DoMain;

begin

Match('b');

Fin;

Prolog;

DoBlock;

Epilog;

end;

{--------------------------------------------------------------}

.

.

.



{--------------------------------------------------------------}

{ Main Program }

begin

Init;

TopDecls;

DoMain;

end.

{--------------------------------------------------------------}

Note that DoProc and DoMain are not quite symmetrical. DoProc uses a call to BeginBlock,whereas DoMain cannot. That's because a procedure is signaled by the keyword PROCE-DURE (abbreviated by a 'p' here), while the main program gets no keyword other than theBEGIN itself.

And _THAT_ brings up an interesting question: WHY?

If we look at the structure of C programs, we find that all functions are treated just alike,except that the main program happens to be identified by its name, "main." Since C functionscan appear in any order, the main program can also be anywhere in the compilation unit.

In Pascal, on the other hand, all variables and procedures must be declared before they'reused, which means that there is no point putting anything after the main program ... it couldnever be accessed. The "main program" is not identified at all, other than being that part ofthe code that comes after the global BEGIN. In other words, if it ain't anything else, it must bethe main program.



90

This causes no small amount of confusion for beginning programmers, and for big Pascalprograms sometimes it's difficult to find the beginning of the main program at all. Thisleads to conventions such as identifying it in comments:

BEGIN { of MAIN }

This has always seemed to me to be a bit of a kludge. The question comes up: Whyshould the main program be treated so much differently than a procedure? In fact, nowthat we've recognized that procedure declarations are just that ... part of the global decla-rations ... isn't the main program just one more declaration, also?

The answer is yes, and by treating it that way, we can simplify the code and make it con-siderably more orthogonal. I propose that we use an explicit keyword, PROGRAM, toidentify the main program (Note that this means that we can't start the file with it, as inPascal). In this case, our BNF becomes:

<declaration> ::= <data decl> | <procedure> | <main program>

<procedure> ::= PROCEDURE <ident> <begin-block>

<main program> ::= PROGRAM <ident> <begin-block>



The code also looks much better, at least in the sense that DoMain and DoProc look morealike:

{--------------------------------------------------------------}


procedure DoMain;

var N: char;

begin

Match('P');

N := GetName;

Fin;


Prolog;

BeginBlock;

end;

{--------------------------------------------------------------}



90

{--------------------------------------------------------------}


procedure TopDecls;

begin

while Look <> '.' do begin

case Look of

'v': Decl;

'p': DoProc;

'P': DoMain;


end;

Fin;

end;

end;



{--------------------------------------------------------------}

{ Main Program }

begin

Init;

TopDecls;

Epilog;

end.

{--------------------------------------------------------------}

Since the declaration of the main program is now within the loop of TopDecl, that doespresent some difficulties. How do we ensure that it's the last thing in the file? And how do weever exit from the loop? My answer for the second question, as you can see, was to bringback our old friend the period. Once the parser sees that, we're done.

To answer the first question: it depends on how far we're willing to go to protect the program-mer from dumb mistakes. In the code that I've shown, there's nothing to keep the program-mer from adding code after the main program ... even another main program. The code willjust not be accessible. However, we COULD access it via a FORWARD statement, whichwe'll be providing later. As a matter of fact, many assembler language programmers like touse the area just after the program to declare large, uninitialized data blocks, so there mayindeed be some value in not requiring the main program to be last. We'll leave it as it is.

If we decide that we should give the programmer a little more help than that, it's pretty easy toadd some logic to kick us out of the loop once the main program has been processed. Or wecould at least flag an error if someone tries to include two mains.



90

CALLING THE PROCEDURE If you're satisfied that things are working, let's address the second half of the equation ...the call.

Consider the BNF for a procedure call:

<proc_call> ::= <identifier>

for an assignment statement, on the other hand, the BNF is:

<assignment> ::= <identifier> '=' <expression>

At this point we seem to have a problem. The two BNF statements both begin on theright-hand side with the token <identifier>. How are we supposed to know, when we seethe identifier, whether we have a procedure call or an assignment statement? This lookslike a case where our parser ceases being predictive, and indeed that's exactly the case.However, it turns out to be an easy problem to fix, since all we have to do is to look at thetype of the identifier, as recorded in the symbol table. As we've discovered before, aminor local violation of the predictive parsing rule can be easily handled as a specialcase.



Here's how to do it:

{--------------------------------------------------------------}


procedure Assignment(Name: char);

begin

Match('=');

Expression;

StoreVar(Name);

end;



90

{--------------------------------------------------------------}

{ Decide if a Statement is an Assignment or Procedure Call }

procedure AssignOrProc;

var Name: char;

begin

Name := GetName;

case TypeOf(Name) of

' ': Undefined(Name);

'v': Assignment(Name);

'p': CallProc(Name);

else Abort('Identifier ' + Name +

' Cannot Be Used Here');

end;

end;



{--------------------------------------------------------------}


procedure DoBlock;

begin


AssignOrProc;

Fin;

end;

end;

{--------------------------------------------------------------}

As you can see, procedure Block now calls AssignOrProc instead of Assignment. The func-tion of this new procedure is to simply read the identifier, determine its type, and then callwhichever procedure is appropriate for that type. Since the name has already been read, wemust pass it to the two procedures, and modify Assignment to match. Procedure CallProc is asimple code generation routine:

{--------------------------------------------------------------}

{ Call a Procedure }

procedure CallProc(N: char);

begin

EmitLn('BSR ' + N);

end;

{--------------------------------------------------------------}



91

Well, at this point we have a compiler that can deal with procedures. It's worth noting thatprocedures can call procedures to any depth. So even though we don't allow nestedDECLARATIONS, there is certainly nothing to keep us from nesting CALLS, just as wewould expect to do in any language. We're getting there, and it wasn't too hard, was it?

Of course, so far we can only deal with procedures that have no parameters. The proce-dures can only operate on the global variables by their global names. So at this point wehave the equivalent of BASIC's GOSUB construct. Not too bad ... after all lots of seriousprograms were written using GOSUBs, but we can do better, and we will. That's the nextstep.



PASSING PARAMETERS Again, we all know the basic idea of passed parameters, but let's review them just to be safe.

In general the procedure is given a parameter list, for example

PROCEDURE FOO(X, Y, Z)

In the declaration of a procedure, the parameters are called formal parameters, and may bereferred to in the body of the procedure by those names. The names used for the formalparameters are really arbitrary. Only the position really counts. In the example above, thename 'X' simply means "the first parameter" wherever it is used.

When a procedure is called, the "actual parameters" passed to it are associated with the for-mal parameters, on a one-for-one basis.

The BNF for the syntax looks something like this:

<procedure> ::= PROCEDURE <ident>

'(' <param-list> ')' <begin-block>

<param_list> ::= <parameter> ( ',' <parameter> )* | null

Similarly, the procedure call looks like:

<proc call> ::= <ident> '(' <param-list> ')'



91

Note that there is already an implicit decision built into this syntax. Some languages, suchas Pascal and Ada, permit parameter lists to be optional. If there are no parameters, yousimply leave off the parens completely. Other languages, like C and Modula 2, require theparens even if the list is empty. Clearly, the example we just finished corresponds to theformer point of view. But to tell the truth I prefer the latter. For procedures alone, the deci-sion would seem to favor the "listless" approach. The statement

Initialize; ,

standing alone, can only mean a procedure call. In the parsers we've been writing, we'vemade heavy use of parameterless procedures, and it would seem a shame to have towrite an empty pair of parens for each case.

But later on we're going to be using functions, too. And since functions can appear in thesame places as simple scalar identifiers, you can't tell the difference between the two.You have to go back to the declarations to find out. Some folks consider this to be anadvantage. Their argument is that an identifier gets replaced by a value, and what do youcare whether it's done by substitution or by a function? But we sometimes _DO_ care,because the function may be quite time-consuming. If, by writing a simple identifier into agiven expression, we can incur a heavy run-time penalty, it seems to me we ought to bemade aware of it.

Anyway, Niklaus Wirth designed both Pascal and Modula 2. I'll give him the benefit of thedoubt and assume that he had a good reason for changing the rules the second timearound!

Needless to say, it's an easy thing to accomodate either point of view as we design a lan-guage, so this one is strictly a matter of personal preference. Do it whichever way you likebest.



Before we go any further, let's alter the translator to handle a (possibly empty) parameter list.For now we won't generate any extra code ... just parse the syntax. The code for processingthe declaration has very much the same form we've seen before when dealing with VAR-lists:

{--------------------------------------------------------------}

{ Process the Formal Parameter List of a Procedure }

procedure FormalList;

begin

Match('(');

if Look <> ')' then begin

FormalParam;


Match(',');

FormalParam;

end;

end;

Match(')');

end;

{--------------------------------------------------------------}



91

Procedure DoProc needs to have a line added to call FormalList:

{--------------------------------------------------------------}

{ Parse and Translate a Procedure Declaration }

procedure DoProc;

var N: char;

begin

Match('p');

N := GetName;

FormalList;

Fin;


ST[N] := 'p';

PostLabel(N);

BeginBlock;

Return;

end;

{--------------------------------------------------------------}



For now, the code for FormalParam is just a dummy one that simply skips the parametername:

{--------------------------------------------------------------}

{ Process a Formal Parameter }

procedure FormalParam;

var Name: char;

begin

Name := GetName;

end;

{--------------------------------------------------------------}

For the actual procedure call, there must be similar code to process the actual parameter list:

{--------------------------------------------------------------}

{ Process an Actual Parameter }

procedure Param;

var Name: char;

begin

Name := GetName;

end;



91

{--------------------------------------------------------------}

{ Process the Parameter List for a Procedure Call }

procedure ParamList;

begin

Match('(');


Param;


Match(',');

Param;

end;

end;

Match(')');

end;

{--------------------------------------------------------------}{ Process a Procedure Call }procedure CallProc(Name: char);

begin

ParamList;

Call(Name);

end;

{--------------------------------------------------------------}



Note here that CallProc is no longer just a simple code generation routine. It has some struc-ture to it. To handle this, I've renamed the code generation routine to just Call, and called itfrom within CallProc.

OK, if you'll add all this code to your translator and try it out, you'll find that you can indeedparse the syntax properly. I'll note in passing that there is _NO_ checking to make sure thatthe number (and, later, types) of formal and actual parameters match up. In a productioncompiler, we must of course do this. We'll ignore the issue now if for no other reason than thatthe structure of our symbol table doesn't currently give us a place to store the necessaryinformation. Later on, we'll have a place for that data and we can deal with the issue then.



91

THE SEMANTICS OF PARAMETERS So far we've dealt with the SYNTAX of parameter passing, and we've got the parsingmechanisms in place to handle it. Next, we have to look at the SEMANTICS, i.e., theactions to be taken when we encounter parameters. This brings us square up against theissue of the different ways parameters can be passed.

There is more than one way to pass a parameter, and the way we do it can have a pro-found effect on the character of the language. So this is another of those areas where Ican't just give you my solution. Rather, it's important that we spend some time looking atthe alternatives so that you can go another route if you choose to.

There are two main ways parameters are passed:

o By value

o By reference (address)

The differences are best seen in the light of a little history.

The old FORTRAN compilers passed all parameters by reference. In other words, whatwas actually passed was the address of the parameter. This meant that the called sub-routine was free to either read or write that parameter, as often as it chose to, just asthough it were a global variable. This was actually quite an efficient way to do things, andit was pretty simple since the same mechanism was used in all cases, with one exceptionthat I'll get to shortly.

There were problems, though. Many people felt that this method created entirely toomuch coupling between the called subroutine and its caller. In effect, it gave the subrou-tine complete access to all variables that appeared in the parameter list.



Many times, we didn't want to actually change a parameter, but only use it as an input. Forexample, we might pass an element count to a subroutine, and wish we could then use thatcount within a DO-loop. To avoid changing the value in the calling program, we had to make alocal copy of the input parameter, and operate only on the copy. Some FORTRAN program-mers, in fact, made it a practice to copy ALL parameters except those that were to be used asreturn values. Needless to say, all this copying defeated a good bit of the efficiency associ-ated with the approach.

There was, however, an even more insidious problem, which was not really just the fault ofthe "pass by reference" convention, but a bad convergence of several implementation deci-sions.

Suppose we have a subroutine:

SUBROUTINE FOO(X, Y, N)

where N is some kind of input count or flag. Many times, we'd like to be able to pass a literalor even an expression in place of a variable, such as:

CALL FOO(A, B, J + 1)

Here the third parameter is not a variable, and so it has no address. The earliest FORTRANcompilers did not allow such things, so we had to resort to subterfuges like:

K = J + 1

CALL FOO(A, B, K)

Here again, there was copying required, and the burden was on the programmer to do it. Notgood.

Later FORTRAN implementations got rid of this by allowing expressions as parameters.What they did was to assign a compiler-generated variable, store the value of the expressionin the variable, and then pass the address of the expression.

So far, so good. Even if the subroutine mistakenly altered the anonymous variable, who wasto know or care? On the next call, it would be recalculated anyway.



92

The problem arose when someone decided to make things more efficient. They rea-soned, rightly enough, that the most common kind of "expression" was a single integervalue, as in:

CALL FOO(A, B, 4)

It seemed inefficient to go to the trouble of "computing" such an integer and storing it in atemporary variable, just to pass it through the calling list. Since we had to pass theaddress of the thing anyway, it seemed to make lots of sense to just pass the address ofthe literal integer, 4 in the example above.

To make matters more interesting, most compilers, then and now, identify all literals andstore them separately in a "literal pool," so that we only have to store one value for eachunique literal. That combination of design decisions: passing expressions, optimizationfor literals as a special case, and use of a literal pool, is what led to disaster.

To see how it works, imagine that we call subroutine FOO as in the example above, pass-ing it a literal 4. Actually, what gets passed is the address of the literal 4, which is stored inthe literal pool. This address corresponds to the formal parameter, K, in the subroutineitself.

Now suppose that, unbeknownst to the programmer, subroutine FOO actually modifies Kto be, say, -7. Suddenly, that literal 4 in the literal pool gets CHANGED, to a -7. From thenon, every expression that uses a 4 and every subroutine that passes a 4 will be using thevalue of -7 instead! Needless to say, this can lead to some bizarre and difficult-to-findbehavior. The whole thing gave the concept of pass-by-reference a bad name, althoughas we have seen, it was really a combination of design decisions that led to the problem.

In spite of the problem, the FORTRAN approach had its good points. Chief among themis the fact that we don't have to support multiple mechanisms. The same scheme, pass-ing the address of the argument, works for EVERY case, including arrays. So the size ofthe compiler can be reduced.

Partly because of the FORTRAN gotcha, and partly just because of the reduced couplinginvolved, modern languages like C, Pascal, Ada, and Modula 2 generally pass scalars byvalue.



This means that the value of the scalar is COPIED into a separate value used only for thecall. Since the value passed is a copy, the called procedure can use it as a local variable andmodify it any way it likes. The value in the caller will not be changed.

It may seem at first that this is a bit inefficient, because of the need to copy the parameter.But remember that we're going to have to fetch SOME value to pass anyway, whether it bethe parameter itself or an address for it. Inside the subroutine, using pass-by-value is defi-nitely more efficient, since we eliminate one level of indirection. Finally, we saw earlier thatwith FORTRAN, it was often necessary to make copies within the subroutine anyway, sopass-by-value reduces the number of local variables. All in all, pass-by-value is better.

Except for one small little detail: if all parameters are passed by value, there is no way for acalled to procedure to return a result to its caller! The parameter passed is NOT altered in thecaller, only in the called procedure. Clearly, that won't get the job done.

There have been two answers to this problem, which are equivalent. In Pascal, Wirth pro-vides for VAR parameters, which are passed-by-reference. What a VAR parameter is, in fact,is none other than our old friend the FORTRAN parameter, with a new name and paint job fordisguise. Wirth neatly gets around the "changing a literal" problem as well as the "address ofan expression" problem, by the simple expedient of allowing only a variable to be the actualparameter. In other words, it's the same restriction that the earliest FORTRANs imposed.

C does the same thing, but explicitly. In C, _ALL_ parameters are passed by value. One kindof variable that C supports, however, is the pointer. So by passing a pointer by value, you ineffect pass what it points to by reference. In some ways this works even better yet, becauseeven though you can change the variable pointed to all you like, you still CAN'T change thepointer itself. In a function such as strcpy, for example, where the pointers are incremented asthe string is copied, we are really only incrementing copies of the pointers, so the values ofthose pointers in the calling procedure still remain as they were. To modify a pointer, youmust pass a pointer to the pointer.

Since we are simply performing experiments here, we'll look at BOTH pass-by-value andpass-by-reference. That way, we'll be able to use either one as we need to. It's worth men-tioning that it's going to be tough to use the C approach to pointers here, since a pointer is adifferent type and we haven't studied types yet!



92

PASS-BY-VALUE Let's just try some simple-minded things and see where they lead us. Let's begin with thepass-by-value case. Consider the procedure call:

FOO(X, Y)

Almost the only reasonable way to pass the data is through the CPU stack. So the codewe'd like to see generated might look something like this:

MOVE X(PC),-(SP) ; Push X

MOVE Y(PC),-(SP) ; Push Y

BSR FOO ; Call FOO

That certainly doesn't seem too complex!

When the BSR is executed, the CPU pushes the return address onto the stack and jumpsto FOO. At this point the stack will look like this:

.

.

Value of X (2 bytes)

Value of Y (2 bytes)

SP --> Return Address (4 bytes)

So the values of the parameters have addresses that are fixed offsets from the stackpointer. In this example, the addresses are:

X: 6(SP)

Y: 4(SP)



Now consider what the called procedure might look like:

PROCEDURE FOO(A, B)

BEGIN

A = B

END

(Remember, the names of the formal parameters are arbitrary ... only the positions count.)

The desired output code might look like:

FOO: MOVE 4(SP),D0

MOVE D0,6(SP)

RTS

Note that, in order to address the formal parameters, we're going to have to know which posi-tion they have in the parameter list. This means some changes to the symbol table stuff. Infact, for our single-character case it's best to just create a new symbol table for the formalparameters.

Let's begin by declaring a new table:

var Params: Array['A'..'Z'] of integer;

We also will need to keep track of how many parameters a given procedure has:

var NumParams: integer;



92

And we need to initialize the new table. Now, remember that the formal parameter list willbe different for each procedure that we process, so we'll need to initialize that table anewfor each procedure. Here's the initializer:

{--------------------------------------------------------------}

{ Initialize Parameter Table to Null }

procedure ClearParams;

var i: char;

begin


Params[i] := 0;

NumParams := 0;

end;

{--------------------------------------------------------------}



We'll put a call to this procedure in Init, and also at the end of DoProc:

{--------------------------------------------------------------}

{ Initialize }

procedure Init;

var i: char;

begin

GetChar;

SkipWhite;


ST[i] := ' ';

ClearParams;

end;

{--------------------------------------------------------------}

.

.

.



92

{--------------------------------------------------------------}


procedure DoProc;

var N: char;

begin

Match('p');

N := GetName;

FormalList;

Fin;


ST[N] := 'p';

PostLabel(N);

BeginBlock;

Return;

ClearParams;

end;

{--------------------------------------------------------------}

Note that the call within DoProc ensures that the table will be clear when we're in themain program.



OK, now we need a few procedures to work with the table. The next few functions are essen-tially copies of InTable, TypeOf, etc.:

{--------------------------------------------------------------}

{ Find the Parameter Number }

function ParamNumber(N: char): integer;

begin

ParamNumber := Params[N];

end;

{--------------------------------------------------------------}

{ See if an Identifier is a Parameter }

function IsParam(N: char): boolean;

begin

IsParam := Params[N] <> 0;

end;

{--------------------------------------------------------------}{ Add a New Parameter to Table }procedure AddParam(Name: char);begin

if IsParam(Name) then Duplicate(Name);

Inc(NumParams);

Params[Name] := NumParams;

end;

{--------------------------------------------------------------}



92

Finally, we need some code generation routines:

{--------------------------------------------------------------}

{ Load a Parameter to the Primary Register }

procedure LoadParam(N: integer);

var Offset: integer;

begin

Offset := 4 + 2 * (NumParams - N);

Emit('MOVE ');

WriteLn(Offset, '(SP),D0');

end;

{--------------------------------------------------------------}

{ Store a Parameter from the Primary Register }

procedure StoreParam(N: integer);


begin


Emit('MOVE D0,');

WriteLn(Offset, '(SP)');

end;



{--------------------------------------------------------------}

{ Push The Primary Register to the Stack }

procedure Push;

begin


end;

{--------------------------------------------------------------}

( The last routine is one we've seen before, but it wasn't in this vestigial version of the pro-gram.)

With those preliminaries in place, we're ready to deal with the semantics of procedures withcalling lists (remember, the code to deal with the syntax is already in place).

Let's begin by processing a formal parameter. All we have to do is to add each parameter tothe parameter symbol table:

{--------------------------------------------------------------}

{ Process a Formal Parameter }

procedure FormalParam;

begin

AddParam(GetName);

end;

{--------------------------------------------------------------}



93

Now, what about dealing with a formal parameter when it appears in the body of the pro-cedure? That takes a little more work. We must first determine that it IS a formal parame-ter. To do this, I've written a modified version of TypeOf:

{--------------------------------------------------------------}

{ Get Type of Symbol }

function TypeOf(n: char): char;

begin

if IsParam(n) then

TypeOf := 'f'

else

TypeOf := ST[n];

end;

{--------------------------------------------------------------}

(Note that, since TypeOf now calls IsParam, it may need to be relocated in your source.)



We also must modify AssignOrProc to deal with this new type:

{--------------------------------------------------------------}

{ Decide if a Statement is an Assignment or Procedure Call }

procedure AssignOrProc;

var Name: char;

begin

Name := GetName;

case TypeOf(Name) of

' ': Undefined(Name);

'v', 'f': Assignment(Name);

'p': CallProc(Name);

else Abort('Identifier ' + Name + ' Cannot Be Used

Here');

end;

end;

{--------------------------------------------------------------}



93

Finally, the code to process an assignment statement and an expression must beextended:

{--------------------------------------------------------------}


{ Vestigial Version }


var Name: char;

begin

Name := GetName;

if IsParam(Name) then

LoadParam(ParamNumber(Name))

else

LoadVar(Name);

end;



{--------------------------------------------------------------}


procedure Assignment(Name: char);

begin

Match('=');

Expression;

if IsParam(Name) then

StoreParam(ParamNumber(Name))

else

StoreVar(Name);

end;

{--------------------------------------------------------------}

As you can see, these procedures will treat every variable name encountered as either a for-mal parameter or a global variable, depending on whether or not it appears in the parametersymbol table. Remember that we are using only a vestigial form of Expression. In the finalprogram, the change shown here will have to be added to Factor, not Expression.



93

procedure call, which we can do with one new line of code:

{--------------------------------------------------------------}


procedure Param;

begin

Expression;

Push;

end;

{--------------------------------------------------------------}

That's it. Add these changes to your program and give it a try. Try declaring one or twoprocedures, each with a formal parameter list. Then do some assignments, using combi-nations of global and formal parameters. You can call one procedure from within another,but you cannot DECLARE a nested procedure. You can even pass formal parametersfrom one procedure to another. If we had the full syntax of the language here, you'd alsobe able to do things like read or write formal parameters or use them in complicatedexpressions.



WHAT'S WRONG? At this point, you might be thinking: Surely there's more to this than a few pushes and pops.There must be more to passing parameters than this.

You'd be right. As a matter of fact, the code that we're generating here leaves a lot to bedesired in several respects.

The most glaring oversight is that it's wrong! If you'll look back at the code for a procedurecall, you'll see that the caller pushes each actual parameter onto the stack before it calls theprocedure. The procedure USES that information, but it doesn't change the stack pointer.That means that the stuff is still there when we return. SOMEBODY needs to clean up thestack, or we'll soon be in very hot water!

Fortunately, that's easily fixed. All we have to do is to increment the stack pointer when we'refinished.

Should we do that in the calling program, or the called procedure? Some folks let the calledprocedure clean up the stack, since that requires less code to be generated per call, andsince the procedure, after all, knows how many parameters it's got. But that means that itmust do something with the return address so as not to lose it.



93

I prefer letting the caller clean up, so that the callee need only execute a return. Also, itseems a bit more balanced, since the caller is the one who "messed up" the stack in thefirst place. But THAT means that the caller must remember how many items it pushed. Tomake things easy, I've modified the procedure ParamList to be a function instead of a pro-cedure, returning the number of bytes pushed:

{--------------------------------------------------------------}{ Process the Parameter List for a Procedure Call }function ParamList: integer;

var N: integer;

begin

N := 0;

Match('(');


Param;

inc(N);


Match(',');

Param;

inc(N);

end;

end;

Match(')'); ParamList := 2 * N;end;

{--------------------------------------------------------------}



Procedure CallProc then uses this to clean up the stack:

{--------------------------------------------------------------}

{ Process a Procedure Call }

procedure CallProc(Name: char);

var N: integer;

begin

N := ParamList;

Call(Name);

CleanStack(N);

end;

{--------------------------------------------------------------}

Here I've created yet another code generation procedure:

{--------------------------------------------------------------}

{ Adjust the Stack Pointer Upwards by N Bytes }

procedure CleanStack(N: integer);

begin

if N > 0 then begin

Emit('ADD #'); WriteLn(N, ',SP'); end;

end;

{--------------------------------------------------------------}



93

OK, if you'll add this code to your compiler, I think you'll find that the stack is now undercontrol.

The next problem has to do with our way of addressing relative to the stack pointer. Thatworks fine in our simple examples, since with our rudimentary form of expressionsnobody else is messing with the stack. But consider a different example as simple as:

PROCEDURE FOO(A, B)

BEGIN

A = A + B

END

The code generated by a simple-minded parser might be:

FOO: MOVE 6(SP),D0 ; Fetch A

MOVE D0,-(SP) ; Push it

MOVE 4(SP),D0 ; Fetch B

ADD (SP)+,D0 ; Add A

MOVE D0,6(SP) : Store A

RTS

This would be wrong. When we push the first argument onto the stack, the offsets for thetwo formal parameters are no longer 4 and 6, but are 6 and 8. So the second fetch wouldfetch A again, not B.

This is not the end of the world. I think you can see that all we really have to do is to alterthe offset every time we do a push, and that in fact is what's done if the CPU has no sup-port for other methods.



Fortunately, though, the 68000 does have such support. Recognizing that this CPU would beused a lot with high-order language compilers, Motorola decided to add direct support for thiskind of thing.

The problem, as you can see, is that as the procedure executes, the stack pointer bouncesup and down, and so it becomes an awkward thing to use as a reference to access the formalparameters. The solution is to define some _OTHER_ register, and use it instead. This regis-ter is typically set equal to the original stack pointer, and is called the frame pointer.

The 68000 instruction set LINK lets you declare such a frame pointer, and sets it equal to thestack pointer, all in one instruction. As a matter of fact, it does even more than that. Since thisregister may have been in use for something else in the calling procedure, LINK also pushesthe current value of that register onto the stack. It can also add a value to the stack pointer, tomake room for local variables.

The complement of LINK is UNLK, which simply restores the stack pointer and pops the oldvalue back into the register.

Using these two instructions, the code for the previous example becomes:

FOO: LINK A6,#0

MOVE 10(A6),D0 ; Fetch A

MOVE D0,-(SP) ; Push it

MOVE 8(A6),D0 ; Fetch B

ADD (SP)+,D0 ; Add A

MOVE D0,10(A6) : Store A

UNLK A6

RTS



94

Fixing the compiler to generate this code is a lot easier than it is to explain it. All we needto do is to modify the code generation created by DoProc. Since that makes the code a lit-tle more than one line, I've created new procedures to deal with it, paralleling the Prologand Epilog procedures called by DoMain:

{--------------------------------------------------------------}

{ Write the Prolog for a Procedure }

procedure ProcProlog(N: char);

begin

PostLabel(N);

EmitLn('LINK A6,#0');

end;

{--------------------------------------------------------------}

{ Write the Epilog for a Procedure }

procedure ProcEpilog;

begin

EmitLn('UNLK A6');

EmitLn('RTS');

end;

{--------------------------------------------------------------}



Procedure DoProc now just calls these:

{--------------------------------------------------------------}


procedure DoProc;

var N: char;

begin

Match('p');

N := GetName;

FormalList;

Fin;


ST[N] := 'p';

ProcProlog(N);

BeginBlock;

ProcEpilog;

ClearParams;

end;

{--------------------------------------------------------------}



94

Finally, we need to change the references to SP in procedures LoadParam andStoreParam:

{--------------------------------------------------------------}




begin


Emit('MOVE ');

WriteLn(Offset, '(A6),D0');

end;



{--------------------------------------------------------------}




begin


Emit('MOVE D0,');

WriteLn(Offset, '(A6)');

end;

{--------------------------------------------------------------}

(Note that the Offset computation changes to allow for the extra push of A6.)

That's all it takes. Try this out and see how you like it.

At this point we are generating some relatively nice code for procedures and procedure calls.Within the limitation that there are no local variables (yet) and that no procedure nesting isallowed, this code is just what we need.

There is still just one little small problem remaining:

WE HAVE NO WAY TO RETURN RESULTS TO THE CALLER!

But that, of course, is not a limitation of the code we're generating, but one inherent in thecall-by-value protocol. Notice that we CAN use formal parameters in any way inside the pro-cedure. We can calculate new values for them, use them as loop counters (if we had loops,that is!), etc. So the code is doing what it's supposed to. To get over this last problem, weneed to look at the alternative protocol.



94

CALL-BY-REFERENCE This one is easy, now that we have the mechanisms already in place. We only have tomake a few changes to the code generation. Instead of pushing a value onto the stack,we must push an address. As it turns out, the 68000 has an instruction, PEA, that doesjust that.

We'll be making a new version of the test program for this. Before we do anything else,

>>>> MAKE A COPY <<<<

of the program as it now stands, because we'll be needing it again later.

Let's begin by looking at the code we'd like to see generated for the new case. Using thesame example as before, we need the call

FOO(X, Y)

to be translated to:

PEA X(PC) ; Push the address of X

PEA Y(PC) ; Push Y the address of Y

BSR FOO ; Call FOO



That's a simple matter of a slight change to Param:

{--------------------------------------------------------------}


procedure Param;

begin

EmitLn('PEA ' + GetName + '(PC)');

end;

{--------------------------------------------------------------}

(Note that with pass-by-reference, we can't have expressions in the calling list, so Param canjust read the name directly.)

At the other end, the references to the formal parameters must be given one level of indirec-tion:

FOO: LINK A6,#0 MOVE.L 12(A6),A0 ; Fetch the address of A MOVE (A0),D0 ; Fetch A MOVE D0,-(SP) ; Push it MOVE.L 8(A6),A0 ; Fetch the address of B MOVE (A0),D0 ; Fetch B ADD (SP)+,D0 ; Add A MOVE.L 12(A6),A0 ; Fetch the address of A MOVE D0,(A0) : Store A UNLK A6 RTS



94

All of this can be handled by changes to LoadParam and StoreParam:

{--------------------------------------------------------------}




begin


Emit('MOVE.L ');

WriteLn(Offset, '(A6),A0');


end;

{--------------------------------------------------------------}




begin


Emit('MOVE.L ');

WriteLn(Offset, '(A6),A0'); EmitLn('MOVE D0,(A0)');end;

{--------------------------------------------------------------}



To get the count right, we must also change one line in ParamList:

ParamList := 4 * N;

That should do it. Give it a try and see if it's generating reasonable-looking code. As you willsee, the code is hardly optimal, since we reload the address register every time a parameteris needed. But that's consistent with our KISS approach here, of just being sure to generatecode that works. We'll just make a little note here, that here's yet another candidate for opti-mization, and press on.

Now we've learned to process parameters using pass-by-value and pass-by-reference. In thereal world, of course, we'd like to be able to deal with BOTH methods. We can't do that yet,though, because we have not yet had a session on types, and that has to come first.

If we can only have ONE method, then of course it has to be the good ol' FORTRAN methodof pass-by-reference, since that's the only way procedures can ever return values to theircaller.

This, in fact, will be one of the differences between TINY and KISS. In the next version ofTINY, we'll use pass-by-reference for all parameters. KISS will support both methods.



94

LOCAL VARIABLES So far, we've said nothing about local variables, and our definition of procedures doesn'tallow for them. Needless to say, that's a big gap in our language, and one that needs tobe corrected.

Here again we are faced with a choice: Static or dynamic storage?

In those old FORTRAN programs, local variables were given static storage just like globalones. That is, each local variable got a name and allocated address, like any other vari-able, and was referenced by that name.

That's easy for us to do, using the allocation mechanisms already in place. Remember,though, that local variables can have the same names as global ones. We need to some-how deal with that by assigning unique names for these variables.

The characteristic of static storage, of course, is that the data survives a procedure calland return. When the procedure is called again, the data will still be there. That can be anadvantage in some applications. In the FORTRAN days we used to do tricks like initializea flag, so that you could tell when you were entering a procedure for the first time andcould do any one-time initialization that needed to be done.

Of course, the same "feature" is also what makes recursion impossible with static stor-age. Any new call to a procedure will overwrite the data already in the local variables.

The alternative is dynamic storage, in which storage is allocated on the stack just as forpassed parameters. We also have the mechanisms already for doing this. In fact, thesame routines that deal with passed (by value) parameters on the stack can easily dealwith local variables as well ... the code to be generated is the same. The purpose of theoffset in the 68000 LINK instruction is there just for that reason: we can use it to adjust thestack pointer to make room for locals. Dynamic storage, of course, inherently supportsrecursion.



When I first began planning TINY, I must admit to being prejudiced in favor of static storage.That's simply because those old FORTRAN programs were pretty darned efficient ... theearly FORTRAN compilers produced a quality of code that's still rarely matched by moderncompilers. Even today, a given program written in FORTRAN is likely to outperform the sameprogram written in C or Pascal, sometimes by wide margins. (Whew! Am I going to hearabout THAT statement!)

I've always supposed that the reason had to do with the two main differences between FOR-TRAN implementations and the others: static storage and pass-by-reference. I know thatdynamic storage supports recursion, but it's always seemed to me a bit peculiar to be willingto accept slower code in the 95% of cases that don't need recursion, just to get that featurewhen you need it. The idea is that, with static storage, you can use absolute addressingrather than indirect addressing, which should result in faster code.

More recently, though, several folks have pointed out to me that there really is no perfor-mance penalty associated with dynamic storage. With the 68000, for example, you shouldn'tuse absolute addressing anyway ... most operating systems require position independentcode. And the 68000 instruction

MOVE 8(A6),D0

has exactly the same timing as

MOVE X(PC),D0.

So I'm convinced, now, that there is no good reason NOT to use dynamic storage.

Since this use of local variables fits so well into the scheme of pass-by-value parameters,we'll use that version of the translator to illustrate it. (I _SURE_ hope you kept a copy!)

The general idea is to keep track of how many local parameters there are. Then we use theinteger in the LINK instruction to adjust the stack pointer downward to make room for them.Formal parameters are addressed as positive offsets from the frame pointer, and locals asnegative offsets. With a little bit of work, the same procedures we've already created can takecare of the whole thing.



95

Let's start by creating a new variable, Base:

var Base: integer;

We'll use this variable, instead of NumParams, to compute stack offsets. That meanschanging the two references to NumParams in LoadParam and StoreParam:

{--------------------------------------------------------------}




begin

Offset := 8 + 2 * (Base - N);

Emit('MOVE ');

WriteLn(Offset, '(A6),D0');

end;

{--------------------------------------------------------------}{ Store a Parameter from the Primary Register }procedure StoreParam(N: integer);var Offset: integer;

begin

Offset := 8 + 2 * (Base - N);

Emit('MOVE D0,');

WriteLn(Offset, '(A6)');

end;

{--------------------------------------------------------------}



The idea is that the value of Base will be frozen after we have processed the formal parame-ters, and won't increase further as the new, local variables, are inserted in the symbol table.This is taken care of at the end of FormalList:

{--------------------------------------------------------------}

{ Process the Formal Parameter List of a Procedure }

procedure FormalList;

begin

Match('(');


FormalParam;


Match(',');

FormalParam;

end;

end;

Match(')');

Fin;

Base := NumParams;

NumParams := NumParams + 4;

end;

{--------------------------------------------------------------}



95

(We add four words to make allowances for the return address and old frame pointer,which end up between the formal parameters and the locals.)

About all we need to do next is to install the semantics for declaring local variables intothe parser. The routines are very similar to Decl and TopDecls:

{--------------------------------------------------------------}

{ Parse and Translate a Local Data Declaration }

procedure LocDecl;

var Name: char;

begin

Match('v');

AddParam(GetName);

Fin;

end;



{--------------------------------------------------------------}

{ Parse and Translate Local Declarations }

function LocDecls: integer;

var n: integer;

begin

n := 0;

while Look = 'v' do begin

LocDecl;

inc(n);

end;

LocDecls := n;

end;

{--------------------------------------------------------------}

Note that LocDecls is a FUNCTION, returning the number of locals to DoProc.



95

Next, we modify DoProc to use this information:

{--------------------------------------------------------------}


procedure DoProc;

var N: char;

k: integer;

begin

Match('p');

N := GetName;


ST[N] := 'p';

FormalList;

k := LocDecls;

ProcProlog(N, k);

BeginBlock;

ProcEpilog;

ClearParams;

end;

{--------------------------------------------------------------}

(I've made a couple of changes here that weren't really necessary. Aside from rearrang-ing things a bit, I moved the call to Fin to within FormalList, and placed one inside LocDe-cls as well. Don't forget to put one at the end of FormalList, so that we're together here.)



Note the change in the call to ProcProlog. The new argument is the number of WORDS (notbytes) to allocate space for. Here's the new version of ProcProlog:

{--------------------------------------------------------------}

{ Write the Prolog for a Procedure }

procedure ProcProlog(N: char; k: integer);

begin

PostLabel(N);

Emit('LINK A6,#');

WriteLn(-2 * k)

end;

{--------------------------------------------------------------}

That should do it. Add these changes and see how they work.



95

CONCLUSION At this point you know how to compile procedure declarations and procedure calls, withparameters passed by reference and by value. You can also handle local variables. Asyou can see, the hard part is not in providing the mechanisms, but in deciding just whichmechanisms to use. Once we make these decisions, the code to translate the constructsis really not that difficult. I didn't show you how to deal with the combination of localparameters and pass-by-reference parameters, but that's a straightforward extension towhat you've already seen. It just gets a little more messy, that's all, since we need to sup-port both mechanisms instead of just one at a time. I'd prefer to save that one until afterwe've dealt with ways to handle different variable types.

That will be the next installment, which will be coming soon to a Forum near you. See youthen.


Part 14 - Types

Part 14 - Types

INTRODUCTION In the last installment (Part XIII: PROCEDURES) I mentioned that in that part and this one,we would cover the two features that tend to separate the toy language from a real, usableone. We covered procedure calls in that installment. Many of you have been waiting patiently,since August '89, for me to drop the other shoe. Well, here it is.

In this installment, we'll talk about how to deal with different data types. As I did in the lastsegment, I will NOT incorporate these features directly into the TINY compiler at this time.Instead, I'll be using the same approach that has worked so well for us in the past: using onlyfragments of the parser and single-character tokens. As usual, this allows us to get directly tothe heart of the matter without having to wade through a lot of unnecessary code. Since themajor problems in dealing with multiple types occur in the arithmetic operations, that's wherewe'll concentrate our focus.

A few words of warning: First, there are some types that I will NOT be covering in this install-ment. Here we will ONLY be talking about the simple, predefined types. We won't even dealwith arrays, pointers or strings in this installment; I'll be covering them in the next few.

Second, we also will not discuss user-defined types. That will not come until much later, forthe simple reason that I still haven't convinced myself that user-defined types belong in a lan-guage named KISS. In later installments, I do intend to cover at least the general concepts ofuser-defined types, records, etc., just so that the series will be complete. But whether or notthey will be included as part of KISS is still an open issue. I am open to comments or sugges-tions on this question.



95

Finally, I should warn you: what we are about to do CAN add considerable extra compli-cation to both the parser and the generated code. Handling variables of different types isstraightforward enough. The complexity comes in when you add rules about conversionbetween types. In general, you can make the compiler as simple or as complex as youchoose to make it, depending upon the way you define the type-conversion rules. Even ifyou decide not to allow ANY type conversions (as in Ada, for example) the problem is stillthere, and is built into the mathematics. When you multiply two short numbers, for exam-ple, you can get a long result.

I've approached this problem very carefully, in an attempt to Keep It Simple. But we can'tavoid the complexity entirely. As has so often has happened, we end up having to tradecode quality against complexity, and as usual I will tend to opt for the simplest approach.


Part 14 - Types

WHAT'S COMING NEXT? Before diving into the tutorial, I think you'd like to know where we are going from here ...especially since it's been so long since the last installment.

I have not been idle in the meantime. What I've been doing is reorganizing the compiler itselfinto Turbo Units. One of the problems I've encountered is that as we've covered new areasand thereby added features to the TINY compiler, it's been getting longer and longer. I real-ized a couple of installments back that this was causing trouble, and that's why I've goneback to using only compiler fragments for the last installment and this one. The problem isthat it just seems dumb to have to reproduce the code for, say, processing boolean exclusiveOR's, when the subject of the discussion is parameter passing.

The obvious way to have our cake and eat it, too, is to break up the compiler into separatelycompilable modules, and of course the Turbo Unit is an ideal vehicle for doing this. Thisallows us to hide some fairly complex code (such as the full arithmetic and boolean expres-sion parsing) into a single unit, and just pull it in whenever it's needed. In that way, the onlycode I'll have to reproduce in these installments will be the code that actually relates to theissue under discussion.

I've also been toying with Turbo 5.5, which of course includes the Borland object-orientedextensions to Pascal. I haven't decided whether to make use of these features, for two rea-sons. First of all, many of you who have been following this series may still not have 5.5, andI certainly don't want to force anyone to have to go out and buy a new compiler just to com-plete the series. Secondly, I'm not convinced that the O-O extensions have all that muchvalue for this application. We've been having some discussions about that in CompuServe'sCLM forum, and so far we've not found any compelling reason to use O-O constructs. This isanother of those areas where I could use some feedback from you readers. Anyone want tovote for Turbo 5.5 and O-O?



96

In any case, after the next few installments in the series, the plan is to upload to you acomplete set of Units, and complete functioning compilers as well. The plan, in fact, is tohave THREE compilers: One for a single-character version of TINY (to use for our exper-iments), one for TINY and one for KISS. I've pretty much isolated the differences betweenTINY and KISS, which are these:

o TINY will support only two data types: The character and the 16-bit integer. I may alsotry to do something with strings, since without them a compiler would be pretty useless.KISS will support all the usual simple types, including arrays and even floating point.

o TINY will only have two control constructs, the IF and the WHILE. KISS will support avery rich set of constructs, including one we haven't discussed here before ... the CASE.

o KISS will support separately compilable modules.

One caveat: Since I still don't know much about 80x86 assembler language, all thesecompiler modules will still be written to support 68000 code. However, for the programs Iplan to upload, all the code generation has been carefully encapsulated into a single unit,so that any enterprising student should be able to easily retarget to any other processor.This task is "left as an exercise for the student." I'll make an offer right here and now: Forthe person who provides us the first robust retarget to 80x86, I will be happy to discussshared copyrights and royalties from the book that's upcoming.

But enough talk. Let's get on with the study of types. As I said earlier, we'll do this one aswe did in the last installment: by performing experiments using single-character tokens.


Part 14 - Types

THE SYMBOL TABLE It should be apparent that, if we're going to deal with variables of different types, we're goingto need someplace to record what those types are. The obvious vehicle for that is the symboltable, and we've already used it that way to distinguish, for example, between local and glo-bal variables, and between variables and procedures.

The symbol table structure for single-character tokens is particularly simple, and we've usedit several times before. To deal with it, we'll steal some procedures that we've used before.

First, we need to declare the symbol table itself:

{--------------------------------------------------------------}



ST: Array['A'..'Z'] of char; { *** ADD THIS LINE ***}

{--------------------------------------------------------------}

Next, we need to make sure it's initialized as part of procedure Init:

{--------------------------------------------------------------}{ Initialize }procedure Init;var i: char;begin for i := 'A' to 'Z' do

ST[i] := '?';

GetChar;

end;

{--------------------------------------------------------------}



96

We don't really need the next procedure, but it will be helpful for debugging. All it does isto dump the contents of the symbol table:

{--------------------------------------------------------------}

{ Dump the Symbol Table }

procedure DumpTable;

var i: char;

begin


WriteLn(i, ' ', ST[i]);

end;

{--------------------------------------------------------------}

It really doesn't matter much where you put this procedure ... I plan to cluster all the sym-bol table routines together, so I put mine just after the error reporting procedures.


Part 14 - Types

If you're the cautious type (as I am), you might want to begin with a test program that doesnothing but initializes, then dumps the table. Just to be sure that we're all on the same wave-length here, I'm reproducing the entire program below, complete with the new procedures.Note that this version includes support for white space:

{--------------------------------------------------------------}

program Types;

{--------------------------------------------------------------}


const TAB = ^I;

CR = ^M;

LF = ^J;

{--------------------------------------------------------------}



ST: Array['A'..'Z'] of char;

{--------------------------------------------------------------}


procedure GetChar;

begin

Read(Look);

end;



96

{--------------------------------------------------------------}

{ Report an Error }


begin

WriteLn;


end;

{--------------------------------------------------------------}



begin

Error(s);

Halt;

end;

{--------------------------------------------------------------}



begin


end;


Part 14 - Types

{--------------------------------------------------------------}

{ Dump the Symbol Table }

procedure DumpTable;

var i: char;

begin



end;

{--------------------------------------------------------------}



begin


end;

{--------------------------------------------------------------}



begin

IsDigit := c in ['0'..'9'];

end;



96

{--------------------------------------------------------------}



begin


end;

{--------------------------------------------------------------}



begin

IsAddop := c in ['+', '-'];

end;

{--------------------------------------------------------------}



begin

IsMulop := c in ['*', '/'];

end;


Part 14 - Types

{--------------------------------------------------------------}



begin

IsOrop := c in ['|', '~'];

end;

{--------------------------------------------------------------}



begin

IsRelop := c in ['=', '#', '<', '>'];

end;

{--------------------------------------------------------------}



begin


end;



96

{--------------------------------------------------------------}



begin


GetChar;

end;

{--------------------------------------------------------------}


procedure Fin;

begin

if Look = CR then begin

GetChar;

if Look = LF then

GetChar;

end;

end;


Part 14 - Types

{--------------------------------------------------------------}



begin



SkipWhite;

end;

{--------------------------------------------------------------}



begin



GetChar;

SkipWhite;

end;



97

{--------------------------------------------------------------}

{ Get a Number }


begin


GetNum := Look;

GetChar;

SkipWhite;

end;

{--------------------------------------------------------------}



begin

Write(TAB, s);

end;


Part 14 - Types

{--------------------------------------------------------------}



begin

Emit(s);

WriteLn;

end;

{--------------------------------------------------------------}

{ Initialize }

procedure Init;

var i: char;

begin


ST[i] := '?';

GetChar;

SkipWhite;

end;



97

{--------------------------------------------------------------}

{ Main Program }

begin

Init;

DumpTable;

end.

{--------------------------------------------------------------}

OK, run this program. You should get a (very fast) printout of all the letters of the alphabet(potential identifiers), each followed by a question mark. Not very exciting, but it's a start.

Of course, in general we only want to see the types of the variables that have beendefined. We can eliminate the others by modifying DumpTable with an IF test. Change theloop to read:


if ST[i] <> '?' then


Now, run the program again. What did you get?


Part 14 - Types

Well, that's even more boring than before! There was no output at all, since at this pointNONE of the names have been declared. We can spice things up a bit by inserting somestatements declaring some entries in the main program. Try these:

ST['A'] := 'a';

ST['P'] := 'b';

ST['X'] := 'c';

This time, when you run the program, you should get an output showing that the symbol tableis working right.



97

ADDING ENTRIES Of course, writing to the table directly is pretty poor practice, and not one that will help usmuch later. What we need is a procedure to add entries to the table. At the same time, weknow that we're going to need to test the table, to make sure that we aren't redeclaring avariable that's already in use (easy to do with only 26 choices!). To handle all this, enterthe following new procedures:

{--------------------------------------------------------------}

{ Report Type of a Variable }

function TypeOf(N: char): char;

begin

TypeOf := ST[N];

end;

{--------------------------------------------------------------}

{ Report if a Variable is in the Table }

function InTable(N: char): boolean;

begin

InTable := TypeOf(N) <> '?';

end;


Part 14 - Types

{--------------------------------------------------------------}

{ Check for a Duplicate Variable Name }

procedure CheckDup(N: char);

begin

if InTable(N) then Abort('Duplicate Name ' + N);

end;

{--------------------------------------------------------------}

{ Add Entry to Table }

procedure AddEntry(N, T: char);

begin

CheckDup(N);

ST[N] := T;

end;

{--------------------------------------------------------------}



97

Now change the three lines in the main program to read:

AddEntry('A', 'a');

AddEntry('P', 'b');

AddEntry('X', 'c');

and run the program again. Did it work? Then we have the symbol table routines neededto support our work on types. In the next section, we'll actually begin to use them.


Part 14 - Types

ALLOCATING STORAGE In other programs like this one, including the TINY compiler itself, we have already addressedthe issue of declaring global variables, and the code generated for them. Let's build a vesti-gial version of a "compiler" here, whose only function is to allow us declare variables.Remember, the syntax for a declaration is:

<data decl> ::= VAR <identifier>

Again, we can lift a lot of the code from previous programs. The following are stripped-downversions of those procedures. They are greatly simplified since I have eliminated niceties likevariable lists and initializers. In procedure Alloc, note that the new call to AddEntry will alsotake care of checking for duplicate declarations:

{--------------------------------------------------------------}



begin

AddEntry(N, 'v');

WriteLn(N, ':', TAB, 'DC 0');

end;



97

{--------------------------------------------------------------}


procedure Decl;

var Name: char;

begin

Match('v');

Alloc(GetName);

end;

{--------------------------------------------------------------}


procedure TopDecls;

begin


case Look of

'v': Decl;


end;

Fin;

end;

end;

{--------------------------------------------------------------}


Part 14 - Types

Now, in the main program, add a call to TopDecls and run the program. Try allocating a fewvariables, and note the resulting code generated. This is old stuff for you, so the resultsshould look familiar. Note from the code for TopDecls that the program is ended by a termi-nating period.

While you're at it, try declaring two variables with the same name, and verify that the parsercatches the error.



98

DECLARING TYPES Allocating storage of different sizes is as easy as modifying procedure TopDecls to recog-nize more than one keyword. There are a number of decisions to be made here, in termsof what the syntax should be, etc., but for now I'm going to duck all the issues and simplydeclare by executive fiat that our syntax will be:

<data decl> ::= <typename> <identifier>

where:

<typename> ::= BYTE | WORD | LONG

(By an amazing coincidence, the first letters of these names happen to be the same asthe 68000 assembly code length specifications, so this choice saves us a little work.)

We can create the code to take care of these declarations with only slight modifications.In the routines below, note that I've separated the code generation parts of Alloc from thelogic parts. This is in keeping with our desire to encapsulate the machine-dependent partof the compiler.

{--------------------------------------------------------------}

{ Generate Code for Allocation of a Variable }

procedure AllocVar(N, T: char);

begin

WriteLn(N, ':', TAB, 'DC.', T, ' 0');

end;


Part 14 - Types

{--------------------------------------------------------------}


procedure Alloc(N, T: char);

begin

AddEntry(N, T);

AllocVar(N, T);

end;

{--------------------------------------------------------------}


procedure Decl;

var Typ: char;

begin

Typ := GetName;

Alloc(GetName, Typ);

end;



98

{--------------------------------------------------------------}


procedure TopDecls;

begin


case Look of

'b', 'w', 'l': Decl;


end;

Fin;

end;

end;

{--------------------------------------------------------------}

Make the changes shown to these procedures, and give the thing a try. Use the singlecharacters 'b', 'w', and 'l' for the keywords (they must be lower case, for now). You will seethat in each case, we are allocating the proper storage size. Note from the dumped sym-bol table that the sizes are also recorded for later use. What later use? Well, that's thesubject of the rest of this installment.


Part 14 - Types

ASSIGNMENTS Now that we can declare variables of different sizes, it stands to reason that we ought to beable to do something with them. For our first trick, let's just try loading them into our workingregister, D0. It makes sense to use the same idea we used for Alloc; that is, make a load pro-cedure that can load more than one size. We also want to continue to encapsulate themachine- dependent stuff. The load procedure looks like this:

{---------------------------------------------------------------}


procedure LoadVar(Name, Typ: char);

begin

Move(Typ, Name + '(PC)', 'D0');

end;

{---------------------------------------------------------------}

On the 68000, at least, it happens that many instructions turn out to be MOVE's. It turns outto be useful to create a separate code generator just for these instructions, and then call it asneeded:

{---------------------------------------------------------------}

{ Generate a Move Instruction }

procedure Move(Size: char; Source, Dest: String);

begin

EmitLn('MOVE.' + Size + ' ' + Source + ',' + Dest);

end;

{---------------------------------------------------------------}



98

Note that these two routines are strictly code generators; they have no error-checking orother logic. To complete the picture, we need one more layer of software that providesthese functions.

First of all, we need to make sure that the type we are dealing with is a loadable type.This sounds like a job for another recognizer:

{--------------------------------------------------------------}

{ Recognize a Legal Variable Type }

function IsVarType(c: char): boolean;

begin

IsVarType := c in ['B', 'W', 'L'];

end;

{--------------------------------------------------------------}


Part 14 - Types

Next, it would be nice to have a routine that will fetch the type of a variable from the symboltable, while checking it to make sure it's valid:

{--------------------------------------------------------------}

{ Get a Variable Type from the Symbol Table }

function VarType(Name: char): char;

var Typ: char;

begin

Typ := TypeOf(Name);

if not IsVarType(Typ) then Abort('Identifier ' + Name +

' is not a variable');

VarType := Typ;

end;

{--------------------------------------------------------------}

Armed with these tools, a procedure to cause a variable to be loaded becomes trivial:

{--------------------------------------------------------------}


procedure Load(Name: char);

begin

LoadVar(Name, VarType(Name));

end;

{--------------------------------------------------------------}



98

(NOTE to the concerned: I know, I know, all this is all very inefficient. In a production pro-gram, we probably would take steps to avoid such deep nesting of procedure calls. Don'tworry about it. This is an EXERCISE, remember? It's more important to get it right andunderstand it, than it is to make it get the wrong answer, quickly. If you get your compilercompleted and find that you're unhappy with the speed, feel free to come back and hackthe code to speed it up!)

It would be a good idea to test the program at this point. Since we don't have a procedurefor dealing with assignments yet, I just added the lines:

Load('A');

Load('B');

Load('C');

Load('X');

to the main program. Thus, after the declaration section is complete, they will be exe-cuted to generate code for the loads. You can play around with this, and try different com-binations of declarations to see how the errors are handled.

I'm sure you won't be surprised to learn that storing variables is a lot like loading them.The necessary procedures are shown next:

{---------------------------------------------------------------}


procedure StoreVar(Name, Typ: char);

begin


Move(Typ, 'D0', '(A0)');

end;


Part 14 - Types

{--------------------------------------------------------------}

{ Store a Variable from the Primary Register }

procedure Store(Name: char);

begin

StoreVar(Name, VarType(Name));

end;

{--------------------------------------------------------------}

You can test this one the same way as the loads.

Now, of course, it's a RATHER small step to use these to handle assignment statements.What we'll do is to create a special version of procedure Block that supports only assignmentstatements, and also a special version of Expression that only supports single variables aslegal expressions. Here they are:



98

{---------------------------------------------------------------}



var Name: char;

begin

Load(GetName);

end;

{--------------------------------------------------------------}



var Name: char;

begin

Name := GetName;

Match('=');

Expression;

Store(Name);

end;


Part 14 - Types

{--------------------------------------------------------------}


procedure Block;

begin


Assignment;

Fin;

end;

end;

{--------------------------------------------------------------}

(It's worth noting that, if anything, the new procedures that permit us to manipulate types are,if anything, even simpler and cleaner than what we've seen before. This is mostly thanks toour efforts to encapsulate the code generator procedures.)

There is one small, nagging problem. Before, we used the Pascal terminating period to get usout of procedure TopDecls. This is now the wrong character ... it's used to terminate Block. Inprevious programs, we've used the BEGIN symbol (abbreviated 'b') to get us out. But that isnow used as a type symbol.

The solution, while somewhat of a kludge, is easy enough. We'll use an UPPER CASE 'B' tostand for the BEGIN. So change the character in the WHILE loop within TopDecls, from '.' to'B', and everything will be fine.



99

Now, we can complete the task by changing the main program to read:

{--------------------------------------------------------------}

{ Main Program }

begin

Init;

TopDecls;

Match('B');

Fin;

Block;

DumpTable;

end.

{--------------------------------------------------------------}

(Note that I've had to sprinkle a few calls to Fin around to get us out of Newline troubles.)


Part 14 - Types

OK, run this program. Try the input:

ba { byte a } *** DON'T TYPE THE COMMENTS!!! ***

wb { word b }

lc { long c }

B { begin }

a=a

a=b

a=c

b=a

b=b

b=c

c=a

c=b

c=c

.

For each declaration, you should get code generated that allocates storage. For each assign-ment, you should get code that loads a variable of the correct size, and stores one, also ofthe correct size.

There's only one small little problem: The generated code is WRONG!



99

Look at the code for a=c above. The code is:

MOVE.L C(PC),D0

LEA A(PC),A0

MOVE.B D0,(A0)

This code is correct. It will cause the lower eight bits of C to be stored into A, which is areasonable behavior. It's about all we can expect to happen.

But now, look at the opposite case. For c=a, the code generated is:

MOVE.B A(PC),D0

LEA C(PC),A0

MOVE.L D0,(A0)

This is NOT correct. It will cause the byte variable A to be stored into the lower eight bitsof D0. According to the rules for the 68000 processor, the upper 24 bits are unchanged.This means that when we store the entire 32 bits into C, whatever garbage that was inthose high bits will also get stored. Not good.

So what we have run into here, early on, is the issue of TYPE CONVERSION, or COER-CION.

Before we do anything with variables of different types, even if it's just to copy them, wehave to face up to the issue. It is not the most easy part of a compiler. Most of the bugs Ihave seen in production compilers have had to do with errors in type conversion for someobscure combination of arguments. As usual, there is a tradeoff between compiler com-plexity and the potential quality of the generated code, and as usual, we will take the paththat keeps the compiler simple. I think you'll find that, with this approach, we can keep thepotential complexity in check rather nicely.


Part 14 - Types

THE COWARD'S WAY OUT Before we get into the details (and potential complexity) of type conversion, I'd like you to seethat there is one super-simple way to solve the problem: simply promote every variable to along integer when we load it!

This takes the addition of only one line to LoadVar, although if we are not going to COM-PLETELY ignore efficiency, it should be guarded by an IF test. Here is the modified version:

{---------------------------------------------------------------}


procedure LoadVar(Name, Typ: char);

begin

if Typ <> 'L' then

EmitLn('CLR.L D0');


end;

{---------------------------------------------------------------}

(Note that StoreVar needs no similar change.)



99

If you run some tests with this new version, you will find that everything works correctlynow, albeit sometimes inefficiently. For example, consider the case a=b (for the samedeclarations shown above). Now the generated code turns out to be:

CLR.L D0

MOVE.W B(PC),D0

LEA A(PC),A0

MOVE.B D0,(A0)

In this case, the CLR turns out not to be necessary, since the result is going into a byte-sized variable. With a little bit of work, we can do better. Still, this is not bad, and it typicalof the kinds of inefficiencies that we've seen before in simple- minded compilers.

I should point out that, by setting the high bits to zero, we are in effect treating the num-bers as UNSIGNED integers. If we want to treat them as signed ones instead (the morelikely case) we should do a sign extension after the load, instead of a clear before it. Justto tie this part of the discussion up with a nice, red ribbon, let's change LoadVar as shownbelow:

{---------------------------------------------------------------}{ Load a Variable to Primary Register }procedure LoadVar(Name, Typ: char);begin if Typ = 'B' then EmitLn('CLR.L D0'); Move(Typ, Name + '(PC)', 'D0');

if Typ = 'W' then

EmitLn('EXT.L D0');

end;

{---------------------------------------------------------------}

With this version, a byte is treated as unsigned (as in Pascal and C), while a word istreated as signed.


Part 14 - Types

A MORE REASONABLE SOLUTION As we've seen, promoting every variable to long while it's in memory solves the problem, butit can hardly be called efficient, and probably wouldn't be acceptable even for those of us whoclaim be unconcerned about efficiency. It will mean that all arithmetic operations will be doneto 32-bit accuracy, which will DOUBLE the run time for most operations, and make it evenworse for multiplication and division. For those operations, we would need to call subroutinesto do them, even if the data were byte or word types. The whole thing is sort of a cop-out, too,since it ducks all the real issues.

OK, so that solution's no good. Is there still a relatively easy way to get data conversion? Canwe still Keep It Simple?

Yes, indeed. All we have to do is to make the conversion at the other end ... that is, we con-vert on the way _OUT_, when the data is stored, rather than on the way in.

But, remember, the storage part of the assignment is pretty much independent of the dataload, which is taken care of by procedure Expression. In general the expression may be arbi-trarily complex, so how can procedure Assignment know what type of data is left in registerD0?

Again, the answer is simple: We'll just _ASK_ procedure Expression! The answer can bereturned as a function value.

All of this requires several procedures to be modified, but the mods, like the method, arequite simple. First of all, since we aren't requiring LoadVar to do all the work of conversion,let's go back to the simple version:

{---------------------------------------------------------------}{ Load a Variable to Primary Register }procedure LoadVar(Name, Typ: char);begin


end;

{--------------------------------------------------------------}



99

Next, let's add a new procedure that will convert from one type to another:

{---------------------------------------------------------------}

{ Convert a Data Item from One Type to Another }

procedure Convert(Source, Dest: char);

begin

if Source <> Dest then begin

if Source = 'B' then

EmitLn('AND.W #$FF,D0');

if Dest = 'L' then

EmitLn('EXT.L D0');

end;

end;

{--------------------------------------------------------------}


Part 14 - Types

Next, we need to do the logic required to load and store a variable of any type. Here arethe routines for that:

{---------------------------------------------------------------}


function Load(Name: char): char;

var Typ : char;

begin

Typ := VarType(Name);

LoadVar(Name, Typ);

Load := Typ;

end;

{--------------------------------------------------------------}

{ Store a Variable from the Primary Register }

procedure Store(Name, T1: char);

var T2: char;

begin

T2 := VarType(Name);

Convert(T1, T2);

StoreVar(Name, T2);

end;

{--------------------------------------------------------------}



99

Note that Load is a function, which not only emits the code for a load, but also returns thevariable type. In this way, we always know what type of data we are dealing with. Whenwe execute a Store, we pass it the current type of the variable in D0. Since Store alsoknows the type of the destination variable, it can convert as necessary.

Armed with all these new routines, the implementation of our rudimentary assignmentstatement is essentially trivial. Procedure Expression now becomes a function, whichreturns its type to procedure Assignment:

{---------------------------------------------------------------}


function Expression: char;

begin

Expression := Load(GetName);

end;

{--------------------------------------------------------------}



var Name: char;

begin

Name := GetName;

Match('=');

Store(Name, Expression);

end;

{--------------------------------------------------------------}


Part 14 - Types

Again, note how incredibly simple these two routines are. We've encapsulated all the typelogic into Load and Store, and the trick of passing the type around makes the rest of the workextremely easy. Of course, all of this is for our special, trivial case of Expression. Naturally, forthe general case it will have to get more complex. But you're looking now at the FINAL ver-sion of procedure Assignment!

All this seems like a very simple and clean solution, and it is indeed. Compile this programand run the same test cases as before. You will see that all types of data are converted prop-erly, and there are few if any wasted instructions. Only the byte-to-long conversion uses twoinstructions where one would do, and we could easily modify Convert to handle this case,too.

Although we haven't considered unsigned variables in this case, I think you can see that wecould easily fix up procedure Convert to deal with these types as well. This is "left as an exer-cise for the student."



10

LITERAL ARGUMENTS Sharp-eyed readers might have noticed, though, that we don't even have a proper form ofa simple factor yet, because we don't allow for loading literal constants, only variables.Let's fix that now.

To begin with, we'll need a GetNum function. We've seen several versions of this, somereturning only a single character, some a string, and some an integer. The one neededhere will return a LongInt, so that it can handle anything we throw at it. Note that no typeinformation is returned here: GetNum doesn't concern itself with how the number will beused:

{--------------------------------------------------------------}{ Get a Number }function GetNum: LongInt;

var Val: LongInt;

begin


Val := 0;



GetChar;

end;

GetNum := Val;

SkipWhite;

end;

{---------------------------------------------------------------}


Part 14 - Types

Now, when dealing with literal data, we have one little small problem. With variables, weknow what type things should be because they've been declared to be that type. We have nosuch type information for literals. When the programmer says, "-1," does that mean a byte,word, or longword version? We have no clue. The obvious thing to do would be to use thelargest type possible, i.e. a longword. But that's a bad idea, because when we get to morecomplex expressions, we'll find that it will cause every expression involving literals to be pro-moted to long, as well.

A better approach is to select a type based upon the value of the literal, as shown next:

{--------------------------------------------------------------}

{ Load a Constant to the Primary Register }

function LoadNum(N: LongInt): char;

var Typ : char;

begin

if abs(N) <= 127 then

Typ := 'B'

else if abs(N) <= 32767 then

Typ := 'W'

else Typ := 'L';

LoadConst(N, Typ);

LoadNum := Typ;

end;

{---------------------------------------------------------------}



10

(I know, I know, the number base isn't really symmetric. You can store -128 in a singlebyte, and -32768 in a word. But that's easily fixed, and not worth the time or the addedcomplexity to fool with it here. It's the thought that counts.)

Note that LoadNum calls a new version of the code generator routine LoadConst, whichhas an added argument to define the type:

{---------------------------------------------------------------}

{ Load a Constant to the Primary Register }

procedure LoadConst(N: LongInt; Typ: char);

var temp:string;

begin

Str(N, temp);

Move(Typ, '#' + temp, 'D0');

end;

{--------------------------------------------------------------}


Part 14 - Types

Now we can modify procedure Expression to accomodate the two possible kinds of fac-tors:

{---------------------------------------------------------------}



begin


Expression := Load(GetName)

else

Expression := LoadNum(GetNum);

end;

{--------------------------------------------------------------}

(Wow, that sure didn't hurt too bad! Just a few extra lines do the job.)

OK, compile this code into your program and give it a try. You'll see that it now works foreither variables or constants as valid expressions.



10

ADDITIVE EXPRESSIONS If you've been following this series from the beginning, I'm sure you know what's comingnext: We'll expand the form for an expression to handle first additive expressions, thenmultiplicative, then general expressions with parentheses.

The nice part is that we already have a pattern for dealing with these more complexexpressions. All we have to do is to make sure that all the procedures called by Expres-sion (Term, Factor, etc.) always return a type identifier. If we do that, the program struc-ture gets changed hardly at all.


Part 14 - Types

The first step is easy: We can rename our existing function Expression to Term, as we'vedone so many times before, and create the new version of Expression:

{---------------------------------------------------------------}



var Typ: char;

begin


Typ := Unop

else

Typ := Term;


Push(Typ);

case Look of

'+': Typ := Add(Typ);

'-': Typ := Subtract(Typ);

end;

end;

Expression := Typ;

end;

{--------------------------------------------------------------}



10

Note in this routine how each procedure call has become a function call, and how thelocal variable Typ gets updated at each pass.

Note also the new call to a function Unop, which lets us deal with a leading unary minus.This change is not necessary ... we could still use a form more like what we've donebefore. I've chosen to introduce UnOp as a separate routine because it will make it easier,later, to produce somewhat better code than we've been doing. In other words, I'm look-ing ahead to optimization issues.

For this version, though, we'll retain the same dumb old code, which makes the new rou-tine trivial:

{---------------------------------------------------------------}

{ Process a Term with Leading Unary Operator }

function Unop: char;

begin

Clear;

Unop := 'W';

end;

{---------------------------------------------------------------}

Procedure Push is a code-generator routine, and now has a type argument:

{---------------------------------------------------------------}{ Push Primary onto Stack }procedure Push(Size: char);begin

Move(Size, 'D0', '-(SP)');

end;

{---------------------------------------------------------------}


Part 14 - Types

Now, let's take a look at functions Add and Subtract. In the older versions of these routines,we let them call code generator routines PopAdd and PopSub. We'll continue to do that,which makes the functions themselves extremely simple:

{---------------------------------------------------------------}


function Add(T1: char): char;

begin

Match('+');

Add := PopAdd(T1, Term);

end;

{-------------------------------------------------------------}


function Subtract(T1: char): char;

begin

Match('-');

Subtract := PopSub(T1, Term);

end;

{---------------------------------------------------------------}



10

The simplicity is deceptive, though, because what we've done is to defer all the logic toPopAdd and PopSub, which are no longer just code generation routines. They must alsonow take care of the type conversions required.

And just what conversion is that? Simple: Both arguments must be of the same size, andthe result is also of that size. The smaller of the two arguments must be "promoted" to thesize of the larger one.

But this presents a bit of a problem. If the argument to be promoted is the second argu-ment (i.e. in the primary register D0), we are in great shape. If it's not, however, we're in afix: we can't change the size of the information that's already been pushed onto the stack.

The solution is simple but a little painful: We must abandon that lovely "pop the data anddo something with it" instructions thoughtfully provided by Motorola.

The alternative is to assign a secondary register, which I've chosen to be R7. (Why notR1? Because I have later plans for the other registers.)

The first step in this new structure is to introduce a Pop procedure analogous to the Push.This procedure will always Pop the top element of the stack into D7:

{---------------------------------------------------------------}

{ Pop Stack into Secondary Register }

procedure Pop(Size: char);

begin

Move(Size, '(SP)+', 'D7');

end;

{---------------------------------------------------------------}


Part 14 - Types

The general idea is that all the "Pop-Op" routines can call this one. When this is done, we willthen have both operands in registers, so we can promote whichever one we need to. To dealwith this, procedure Convert needs another argument, the register name:

{---------------------------------------------------------------}

{ Convert a Data Item from One Type to Another }

procedure Convert(Source, Dest: char; Reg: String);

begin

if Source <> Dest then begin

if Source = 'B' then

EmitLn('AND.W #$FF,' + Reg);

if Dest = 'L' then

EmitLn('EXT.L ' + Reg);

end;

end;

{---------------------------------------------------------------}



10

The next function does a conversion, but only if the current type T1 is smaller in size thanthe desired type T2. It is a function, returning the final type to let us know what it decidedto do:

{---------------------------------------------------------------}

{ Promote the Size of a Register Value }

function Promote(T1, T2: char; Reg: string): char;

var Typ: char;

begin

Typ := T1;

if T1 <> T2 then

if (T1 = 'B') or ((T1 = 'W') and (T2 = 'L')) then begin

Convert(T1, T2, Reg);

Typ := T2;

end;

Promote := Typ;

end;

{---------------------------------------------------------------}


Part 14 - Types

Finally, the following function forces the two registers to be of the same type:

{---------------------------------------------------------------}

{ Force both Arguments to Same Type }

function SameType(T1, T2: char): char;

begin

T1 := Promote(T1, T2, 'D7');

SameType := Promote(T2, T1, 'D0');

end;

{---------------------------------------------------------------}



10

These new routines give us the ammunition we need to flesh out PopAdd and PopSub:

{---------------------------------------------------------------}

{ Generate Code to Add Primary to the Stack }

function PopAdd(T1, T2: char): char;

begin

Pop(T1);

T2 := SameType(T1, T2);

GenAdd(T2);

PopAdd := T2;

end;

{---------------------------------------------------------------}

{ Generate Code to Subtract Primary from the Stack }

function PopSub(T1, T2: char): char;

begin

Pop(T1);

T2 := SameType(T1, T2);

GenSub(T2);

PopSub := T2;

end;

{---------------------------------------------------------------}


Part 14 - Types

After all the buildup, the final results are almost anticlimactic. Once again, you can see thatthe logic is quite simple. All the two routines do is to pop the top-of-stack into D7, force thetwo operands to be the same size, and then generate the code.

Note the new code generator routines GenAdd and GenSub. These are vestigial forms of theORIGINAL PopAdd and PopSub. That is, they are pure code generators, producing a regis-ter-to-register add or subtract:

{---------------------------------------------------------------}


procedure GenAdd(Size: char);

begin

EmitLn('ADD.' + Size + ' D7,D0');

end;

{---------------------------------------------------------------}


procedure GenSub(Size: char);

begin

EmitLn('SUB.' + Size + ' D7,D0');

EmitLn('NEG.' + Size + ' D0');

end;

{---------------------------------------------------------------}



10

OK, I grant you: I've thrown a lot of routines at you since we last tested the code. But youhave to admit that each new routine is pretty simple and transparent. If you (like me) don'tlike to test so many new routines at once, that's OK. You can stub out routines like Con-vert, Promote, and SameType, since they don't read any inputs. You won't get the correctcode, of course, but things should work. Then flesh them out one at a time.

When testing the program, don't forget that you first have to declare some variables, andthen start the "body" of the program with an upper-case 'B' (for BEGIN). You should findthat the parser will handle any additive expressions. Once all the conversion routines arein, you should see that the correct code is generated, with type conversions insertedwhere necessary. Try mixing up variables of different sizes, and also literals. Make surethat everything's working properly. As usual, it's a good idea to try some erroneousexpressions and see how the compiler handles them.


Part 14 - Types

WHY SO MANY PROCEDURES? At this point, you may think I've pretty much gone off the deep end in terms of deeply nestedprocedures. There is admittedly a lot of overhead here. But there's a method in my madness.As in the case of UnOp, I'm looking ahead to the time when we're going to want better codegeneration. The way the code is organized, we can achieve this without major modificationsto the program. For example, in cases where the value pushed onto the stack does _NOT_have to be converted, it's still better to use the "pop and add" instruction. If we choose to testfor such cases, we can embed the extra tests into PopAdd and PopSub without changinganything else much.



10

MULTIPLICATIVE EXPRESSIONS The procedure for dealing with multiplicative operators is much the same. In fact, at thefirst level, they are almost identical, so I'll just show them here without much fanfare. Thefirst one is our general form for Factor, which includes parenthetical subexpressions:

{---------------------------------------------------------------}

{ Parse and Translate a Factor }

function Expression: char; Forward;

function Factor: char;

begin


Match('(');


Match(')');

end


Factor := Load(GetName)

else

Factor := LoadNum(GetNum);

end;


Part 14 - Types

{--------------------------------------------------------------}


Function Multiply(T1: char): char;

begin

Match('*');

Multiply := PopMul(T1, Factor);

end;

{--------------------------------------------------------------}


function Divide(T1: char): char;

begin

Match('/');

DIvide := PopDiv(T1, Factor);

end;



10

{---------------------------------------------------------------}


function Term: char;

var Typ: char;

begin

Typ := Factor;


Push(Typ);

case Look of

'*': Typ := Multiply(Typ);

'/': Typ := Divide(Typ);

end;

end;

Term := Typ;

end;

{---------------------------------------------------------------}

These routines parallel the additive ones almost exactly. As before, the complexity isencapsulated within PopMul and PopDiv. If you'd like to test the program before we getinto that, you can build dummy versions of them, similar to PopAdd and PopSub. Again,the code won't be correct at this point, but the parser should handle expressions of arbi-trary complexity.


Part 14 - Types

MULTIPLICATION Once you've convinced yourself that the parser itself is working properly, we need to figureout what it will take to generate the right code. This is where things begin to get a little sticky,because the rules are more complex.

Let's take the case of multiplication first. This operation is similar to the "addops" in that bothoperands should be of the same size. It differs in two important respects:

o The type of the product is typically not the same as that of the two operands. For the prod-uct of two words, we get a longword result.

o The 68000 does not support a 32 x 32 multiply, so a call to a software routine is needed.This routine will become part of the run-time library.

o It also does not support an 8 x 8 multiply, so all byte operands must be promoted to words.


Part 14 - Types

-----------------------------------------------------------------

| | | |

L | Convert D7 to L | Convert D7 to L | |

| JSR MUL32 | JSR MUL32 | JSR MUL32 |

| Result = L | Result = L | Result = L |

| | | |

-----------------------------------------------------------------

This table shows the actions to be taken for each combination of operand types. There arethree things to note: First, we assume a library routine MUL32 which performs a 32 x 32 mul-tiply, leaving a >> 32-bit << (not 64-bit) product. If there is any overflow in the process, wechoose to ignore it and return only the lower 32 bits.

Second, note that the table is symmetric ... the two operands enter in the same way. Finally,note that the product is ALWAYS a longword, except when both operands are bytes. (It'sworth noting, in passing, that this means that many expressions will end up being longwords,whether we like it or not. Perhaps the idea of just promoting them all up front wasn't all thatoutrageous, after all!)



10

Now, clearly, we are going to have to generate different code for the 16-bit and 32-bit mul-tiplies. This is best done by having separate code generator routines for the two cases:

{---------------------------------------------------------------}

{ Multiply Top of Stack by Primary (Word) }

procedure GenMult;

begin

EmitLn('MULS D7,D0')

end;

{---------------------------------------------------------------}

{ Multiply Top of Stack by Primary (Long) }

procedure GenLongMult;

begin

EmitLn('JSR MUL32');

end;

{---------------------------------------------------------------}


Part 14 - Types

An examination of the code below for PopMul should convince you that the conditions in thetable are met:

{---------------------------------------------------------------}

{ Generate Code to Multiply Primary by Stack }

function PopMul(T1, T2: char): char;

var T: char;

begin

Pop(T1);

T := SameType(T1, T2);

Convert(T, 'W', 'D7');

Convert(T, 'W', 'D0');

if T = 'L' then

GenLongMult

else

GenMult;

if T = 'B' then

PopMul := 'W'

else

PopMul:= 'L';

end;

{---------------------------------------------------------------}



10

As you can see, the routine starts off just like PopAdd. The two arguments are forced tothe same type. The two calls to Convert take care of the case where both operands arebytes. The data themselves are promoted to words, but the routine remembers the typeso as to assign the correct type to the result. Finally, we call one of the two code genera-tor routines, and then assign the result type. Not too complicated, really.

At this point, I suggest that you go ahead and test the program. Try all combinations ofoperand sizes.


Part 14 - Types

DIVISION The case of division is not nearly so symmetric. I also have some bad news for you:

All modern 16-bit CPU's support integer divide. The manufacturer's data sheet will describethis operation as a 32 x 16-bit divide, meaning that you can divide a 32-bit dividend by a 16-bit divisor. Here's the bad news:

THEY'RE LYING TO YOU!!!

If you don't believe it, try dividing any large 32-bit number (meaning that it has non-zero bits inthe upper 16 bits) by the integer 1. You are guaranteed to get an overflow exception.

The problem is that the instruction really requires that the resulting quotient fit into a 16-bitresult. This won't happen UNLESS the divisor is sufficiently large. When any number isdivided by unity, the quotient will of course be the same as the dividend, which had better fitinto a 16-bit word.

Since the beginning of time (well, computers, anyway), CPU architects have provided this lit-tle gotcha in the division circuitry. It provides a certain amount of symmetry in things, since itis sort of the inverse of the way a multiply works. But since unity is a perfectly valid (andrather common) number to use as a divisor, the division as implemented in hardware needssome help from us programmers.

The implications are as follows:

o The type of the quotient must always be the same as that of the dividend. It is independentof the divisor.

o In spite of the fact that the CPU supports a longword dividend, the hardware-providedinstruction can only be trusted for byte and word dividends. For longword dividends, we needanother library routine that can return a long result.


Part 14 - Types

-----------------------------------------------------------------

| | | |

L | Convert D7 to L | Convert D7 to L | |

| JSR DIV32 | JSR DIV32 | JSR DIV32 |


| | | |

-----------------------------------------------------------------

(You may wonder why it's necessary to do a 32-bit division, when the dividend is, say, only abyte in the first place. Since the number of bits in the result can only be as many as that in thedividend, why bother? The reason is that, if the divisor is a longword, and there are any highbits set in it, the result of the division must be zero. We might not get that if we only use thelower word of the divisor.)



10

The following code provides the correct function for PopDiv:

{---------------------------------------------------------------}

{ Generate Code to Divide Stack by the Primary }

function PopDiv(T1, T2: char): char;

begin

Pop(T1);

Convert(T1, 'L', 'D7');

if (T1 = 'L') or (T2 = 'L') then begin

Convert(T2, 'L', 'D0');

GenLongDiv;

PopDiv := 'L';

end

else begin

Convert(T2, 'W', 'D0');

GenDiv;

PopDiv := T1;

end;

end;

{---------------------------------------------------------------}


Part 14 - Types

The two code generation procedures are:

{---------------------------------------------------------------}

{ Divide Top of Stack by Primary (Word) }

procedure GenDiv;

begin


Move('W', 'D7', 'D0');

end;

{---------------------------------------------------------------}

{ Divide Top of Stack by Primary (Long) }

procedure GenLongDiv;

begin

EmitLn('JSR DIV32');

end;

{---------------------------------------------------------------}

Note that we assume that DIV32 leaves the (longword) result in D0.

OK, install the new procedures for division. At this point you should be able to generate codefor any kind of arithmetic expression. Give it a whirl!



10

BEGINNING TO WIND DOWN At last, in this installment, we've learned how to deal with variables (and literals) of differ-ent types. As you can see, it hasn't been too tough. In fact, in some ways most of thecode looks even more simple than it does in earlier programs. Only the multiplication anddivision operators require a little thinking and planning.

The main concept that made things easy was that of converting procedures such asExpression into functions that return the type of the result. Once this was done, we wereable to retain the same general structure of the compiler.

I won't pretend that we've covered every single aspect of the issue. I conveniently ignoredunsigned arithmetic. From what we've done, I think you can see that to include them addsno new challenges, just extra possibilities to test for.

I've also ignored the logical operators And, Or, etc. It turns out that these are pretty easyto handle. All the logical operators are bitwise operations, so they are symmetric andtherefore work in the same fashion as PopAdd. There is one difference, however: if it isnecessary to extend the word length for a logical variable, the extension should be doneas an UNSIGNED number. Floating point numbers, again, are straightforward to handle... just a few more procedures to be added to the run-time library, or perhaps instructionsfor a math chip.

Perhaps more importantly, I have also skirted the issue of type CHECKING, as opposedto conversion. In other words, we've allowed for operations between variables of all com-binations of types. In general this will not be true ... certainly you don't want to add an inte-ger, for example, to a string. Most languages also don't allow you to mix up character andinteger variables.

Again, there are really no new issues to be addressed in this case. We are already check-ing the types of the two operands ... much of this checking gets done in procedures likeSameType. It's pretty straightforward to include a call to an error handler, if the types ofthe two operands are incompatible.


Part 14 - Types

In the general case, we can think of every single operator as being handled by a different pro-cedure, depending upon the type of the two operands. This is straightforward, thoughtedious, to implement simply by implementing a jump table with the operand types as indices.In Pascal, the equivalent operation would involve nested Case statements. Some of thecalled procedures could then be simple error routines, while others could effect whatever kindof conversion we need. As more types are added, the number of procedures goes up by asquare-law rule, but that's still not an unreasonably large number of procedures.

What we've done here is to collapse such a jump table into far fewer procedures, simply bymaking use of symmetry and other simplifying rules.



10

TO COERCE OR NOT TO COERCE In case you haven't gotten this message yet, it sure appears that TINY and KISS willprobably _NOT_ be strongly typed languages, since I've allowed for automatic mixing andconversion of just about any type. Which brings up the next issue:

Is this really what we want to do?

The answer depends on what kind of language you want, and the way you'd like it tobehave. What we have not addressed is the issue of when to allow and when to deny theuse of operations involving different data types. In other words, what should be theSEMANTICS of our compiler? Do we want automatic type conversion for all cases, forsome cases, or not at all?

Let's pause here to think about this a bit more. To do so, it will help to look at a bit of his-tory.

FORTRAN II supported only two simple data types: Integer and Real. It allowed implicittype conversion between real and integer types during assignment, but not within expres-sions. All data items (including literal constants) on the right-hand side of an assignmentstatement had to be of the same type. That made things pretty easy ... much simpler thanwhat we've had to do here.

This was changed in FORTRAN IV to support "mixed-mode" arithmetic. If an expressionhad any real data items in it, they were all converted to reals and the expression itself wasreal. To round out the picture, functions were provided to explicitly convert from one typeto the other, so that you could force an expression to end up as either type.

This led to two things: code that was easier to write, and code that was less efficient.That's because sloppy programmers would write expressions with simple constants like 0and 1 in them, which the compiler would dutifully compile to convert at execution time.Still, the system worked pretty well, which would tend to indicate that implicit type conver-sion is a Good Thing.

C is also a weakly typed language, though it supports a larger number of types. C won'tcomplain if you try to add a character to an integer, for example. Partly, this is helped bythe C convention of promoting every char to integer when it is loaded, or passed through


Part 14 - Types

a parameter list. This simplifies the conversions quite a bit. In fact, in subset C compilers thatdon't support long or float types, we end up back where we were in our earlier, simple-mindedfirst try: every variable has the same representation, once loaded into a register. Makes lifepretty easy!

The ultimate language in the direction of automatic type conversion is PL/I. This languagesupports a large number of data types, and you can mix them all freely. If the implicit conver-sions of FORTRAN seemed good, then those of PL/I should have been Heaven, but it turnedout to be more like Hell! The problem was that with so many data types, there had to be alarge number of different conversions, AND a correspondingly large number of rules abouthow mixed operands should be converted. These rules became so complex that no onecould remember what they were! A lot of the errors in PL/I programs had to do with unex-pected and unwanted type conversions. Too much of a Good Thing can be bad for you!

Pascal, on the other hand, is a language which is "strongly typed," which means that in gen-eral you can't mix types, even if they differ only in _NAME_, and yet have the same basetype! Niklaus Wirth made Pascal strongly typed to help keep programmers out of trouble, andthe restrictions have indeed saved many a programmer from himself, because the compilerkept him from doing something dumb. Better to find the bug in compilation rather than thedebug phase. The same restrictions can also cause frustration when you really WANT to mixtypes, and they tend to drive an ex-C-programmer up the wall.

Even so, Pascal does permit some implicit conversions. You can assign an integer to a realvalue. You can also mix integer and real types in expressions of type Real. The integers willbe automatically coerced to real, just as in FORTRAN (and with the same hidden cost in run-time overhead).

You can't, however, convert the other way, from real to integer, without applying an explicitconversion function, Trunc. The theory here is that, since the numerical value of a real num-ber is necessarily going to be changed by the conversion (the fractional part will be lost), youreally shouldn't do it in "secret."

In the spirit of strong typing, Pascal will not allow you to mix Char and Integer variables, with-out applying the explicit coercion functions Chr and Ord.



10

Turbo Pascal also includes the types Byte, Word, and LongInt. The first two are basicallythe same as unsigned integers. In Turbo, these can be freely intermixed with variables oftype Integer, and Turbo will automatically handle the conversion. There are run-timechecks, though, to keep you from overflowing or otherwise getting the wrong answer.Note that you still can't mix Byte and Char types, even though they are stored internally inthe same representation.

The ultimate in a strongly-typed language is Ada, which allows _NO_ implicit type conver-sions at all, and also will not allow mixed-mode arithmetic. Jean Ichbiah's position is thatconversions cost execution time, and you shouldn't be allowed to build in such cost in ahidden manner. By forcing the programmer to explicitly request a type conversion, youmake it more apparent that there could be a cost involved.

I have been using another strongly-typed language, a delightful little language calledWhimsical, by John Spray. Although Whimsical is intended as a systems programminglanguage, it also requires explicit conversion EVERY time. There are NEVER any auto-matic conversions, even the ones supported by Pascal.

This approach does have certain advantages: The compiler never has to guess what todo: the programmer always tells it precisely what he wants. As a result, there tends to bea more nearly one-to-one correspondence between source code and compiled code, andJohn's compiler produces VERY tight code.

On the other hand, I sometimes find the explicit conversions to be a pain. If I want, forexample, to add one to a character, or AND it with a mask, there are a lot of conversionsto make. If I get it wrong, the only error message is "Types are not compatible." As it hap-pens, John's particular implementation of the language in his compiler doesn't tell youexactly WHICH types are not compatible ... it only tells you which LINE the error is in.

I must admit that most of my errors with this compiler tend to be errors of this type, andI've spent a lot of time with the Whimsical compiler, trying to figure out just WHERE in theline I've offended it. The only real way to fix the error is to keep trying things until some-thing works.


Part 14 - Types

So what should we do in TINY and KISS? For the first one, I have the answer: TINY will sup-port only the types Char and Integer, and we'll use the C trick of promoting Chars to Integersinternally. That means that the TINY compiler will be _MUCH_ simpler than what we'vealready done. Type conversion in expressions is sort of moot, since none will be required!Since longwords will not be supported, we also won't need the MUL32 and DIV32 run-timeroutines, nor the logic to figure out when to call them. I _LIKE_ it!

KISS, on the other hand, will support the type Long.

Should it support both signed and unsigned arithmetic? For the sake of simplicity I'd rathernot. It does add quite a bit to the complexity of type conversions. Even Niklaus Wirth haseliminated unsigned (Cardinal) numbers from his new language Oberon, with the argumentthat 32-bit integers should be long enough for anybody, in either case.

But KISS is supposed to be a systems programming language, which means that we shouldbe able to do whatever operations that can be done in assembler. Since the 68000 supportsboth flavors of integers, I guess KISS should, also. We've seen that logical operations needto be able to extend integers in an unsigned fashion, so the unsigned conversion proceduresare required in any case.



10

CONCLUSION That wraps up our session on type conversions. Sorry you had to wait so long for it, buthope you feel that it was worth the wait.

In the next few installments, we'll extend the simple types to include arrays and pointers,and we'll have a look at what to do about strings. That should pretty well wrap up themainstream part of the series. After that, I'll give you the new versions of the TINY andKISS compilers, and then we'll start to look at optimization issues.

See you then.


Part 15 - Back To The Future


INTRODUCTION Can it really have been four years since I wrote installment fourteen of this series? Is it reallypossible that six long years have passed since I began it? Funny how time flies when you'rehaving fun, isn't it?

I won't spend a lot of time making excuses; only point out that things happen, and prioritieschange. In the four years since installment fourteen, I've managed to get laid off, getdivorced, have a nervous breakdown, begin a new career as a writer, begin another one as aconsultant, move, work on two real-time systems, and raise fourteen baby birds, threepigeons, six possums, and a duck. For awhile there, the parsing of source code was not highon my list of priorities. Neither was writing stuff for free, instead of writing stuff for pay. But I dotry to be faithful, and I do recognize and feel my responsibility to you, the reader, to finishwhat I've started. As the tortoise said in one of my son's old stories, I may be slow, but I'msure. I'm sure that there are people out there anxious to see the last reel of this film, and Iintend to give it to them. So, if you're one of those who's been waiting, more or less patiently,to see how this thing comes out, thanks for your patience. I apologize for the delay. Let'smove on.



10

NEW STARTS, OLD DIRECTIONS Like many other things, programming languages and programming styles change withtime. In 1994, it seems a little anachronistic to be programming in Turbo Pascal, when therest of the world seems to have gone bananas over C++. It also seems a little strange tobe programming in a classical style when the rest of the world has switched to object-ori-ented methods. Still, in spite of the four-year hiatus, it would be entirely too wrenching achange, at this point, to switch to, say, C++ with object- orientation . Anyway, Pascal isstill not only a powerful programming language (more than ever, in fact), but it's a wonder-ful medium for teaching. C is a notoriously difficult language to read ... it's often beenaccused, along with Forth, of being a "write-only language." When I program in C++, I findmyself spending at least 50% of my time struggling with language syntax rather than withconcepts. A stray "&" or "*" can not only change the functioning of the program, but itscorrectness as well. By contrast, Pascal code is usually quite transparent and easy toread, even if you don't know the language. What you see is almost always what you get,and we can concentrate on concepts rather than implementation details. I've said fromthe beginning that the purpose of this tutorial series was not to generate the world's fast-est compiler, but to teach the fundamentals of compiler technology, while spending theleast amount of time wrestling with language syntax or other aspects of software imple-mentation. Finally, since a lot of what we do in this course amounts to software experi-mentation, it's important to have a compiler and associated environment that compilesquickly and with no fuss. In my opinion, by far the most significant time measure in soft-ware development is the speed of the edit/compile/test cycle. In this department, TurboPascal is king. The compilation speed is blazing fast, and continues to get faster in everyrelease (how do they keep doing that?). Despite vast improvements in C compilationspeed over the years, even Borland's fastest C/C++ compiler is still no match for TurboPascal. Further, the editor built into their IDE, the make facility, and even their superbsmart linker, all complement each other to produce a wonderful environment for quickturnaround. For all of these reasons, I intend to stick with Pascal for the duration of thisseries. We'll be using Turbo Pascal for Windows, one of the compilers provided BorlandPascal with Objects, version 7.0. If you don't have this compiler, don't worry ... nothing wedo here is going to count on your having the latest version. Using the Windows versionhelps me a lot, by allowing me to use the Clipboard to copy code from the compiler's edi-tor into these documents. It should also help you at least as much, copying the code inthe other direction.



I've thought long and hard about whether or not to introduce objects to our discussion. I'm abig advocate of object-oriented methods for all uses, and such methods definitely have theirplace in compiler technology. In fact, I've written papers on just this subject (Refs. 1-3). Butthe architecture of a compiler which is based on object-oriented approaches is vastly differentthan that of the more classical compiler we've been building. Again, it would seem to beentirely too much to change these horses in mid- stream. As I said, programming styleschange. Who knows, it may be another six years before we finish this thing, and if we keepchanging the code every time programming style changes, we may NEVER finish.

So for now, at least, I've determined to continue the classical style in Pascal, though we mightindeed discuss objects and object orientation as we go. Likewise, the target machine willremain the Motorola 68000 family. Of all the decisions to be made here, this one has beenthe easiest. Though I know that many of you would like to see code for the 80x86, the 68000has become, if anything, even more popular as a platform for embedded systems, and it's tothat application that this whole effort began in the first place. Compiling for the PC, MSDOSplatform, we'd have to deal with all the issues of DOS system calls, DOS linker formats, thePC file system and hardware, and all those other complications of a DOS environment. Anembedded system, on the other hand, must run standalone, and it's for this kind of applica-tion, as an alternative to assembly language, that I've always imagined that a language likeKISS would thrive. Anyway, who wants to deal with the 80x86 architecture if they don't haveto?

The one feature of Turbo Pascal that I'm going to be making heavy use of is units. In the past,we've had to make compromises between code size and complexity, and program functional-ity. A lot of our work has been in the nature of computer experimentation, looking at only oneaspect of compiler technology at a time. We did this to avoid to avoid having to carry aroundlarge programs, just to investigate simple concepts. In the process, we've re-invented thewheel and re-programmed the same functions more times than I'd like to count. Turbo unitsprovide a wonderful way to get functionality and simplicity at the same time: You write reus-able code, and invoke it with a single line. Your test program stays small, but it can do power-ful things.

One feature of Turbo Pascal units is their initialization block. As with an Ada package, anycode in the main begin-end block of a unit gets executed as the program is initialized. Asyou'll see later, this sometimes gives us neat simplifications in the code. Our procedure Init,which has been with us since Installment 1, goes away entirely when we use units. The vari-ous routines in the Cradle, another key features of our approach, will get distributed amongthe units.



10

The concept of units, of course, is no different than that of C modules. However, in C (andC++), the interface between modules comes via preprocessor include statements andheader files. As someone who's had to read a lot of other people's C programs, I'vealways found this rather bewildering. It always seems that whatever data structure you'dlike to know about is in some other file. Turbo units are simpler for the very reason thatthey're criticized by some: The function interfaces and their implementation are includedin the same file. While this organization may create problems with code security, it alsoreduces the number of files by half, which isn't half bad. Linking of the object files is alsoeasy, because the Turbo compiler takes care of it without the need for make files or othermechanisms.



STARTING OVER? Four years ago, in Installment 14, I promised you that our days of re-inventing the wheel, andrecoding the same software over and over for each lesson, were over, and that from now onwe'd stick to more complete programs that we would simply add new features to. I still intendto keep that promise; that's one of the main purposes for using units. However, because ofthe long time since Installment 14, it's natural to want to at least do some review, and anyhow,we're going to have to make rather sweeping changes in the code to make the transition tounits. Besides, frankly, after all this time I can't remember all the neat ideas I had in my headfour years ago. The best way for me to recall them is to retrace some of the steps we took toarrive at Installment 14. So I hope you'll be understanding and bear with me as we go back toour roots, in a sense, and rebuild the core of the software, distributing the routines among thevarious units, and bootstrapping ourselves back up to the point we were at lo, those manymoons ago. As has always been the case, you're going to get to see me make all the mis-takes and execute changes of direction, in real time. Please bear with me ... we'll start gettingto the new stuff before you know it.

Since we're going to be using multiple modules in our new approach, we have to address theissue of file management. If you've followed all the other sections of this tutorial, you knowthat, as our programs evolve, we're going to be replacing older, more simple-minded unitswith more capable ones. This brings us to an issue of version control. There will almost cer-tainly be times when we will overlay a simple file (unit), but later wish we had the simple oneagain. A case in point is embodied in our predilection for using single-character variablenames, keywords, etc., to test concepts without getting bogged down in the details of a lexi-cal scanner. Thanks to the use of units, we will be doing much less of this in the future. Still, Inot only suspect, but am certain that we will need to save some older versions of files, forspecial purposes, even though they've been replaced by newer, more capable ones.

To deal with this problem, I suggest that you create different directories, with different ver-sions of the units as needed. If we do this properly, the code in each directory will remain self-consistent. I've tentatively created four directories: SINGLE (for single-character experimen-tation), MULTI (for, of course, multi-character versions), TINY, and KISS.

Enough said about philosophy and details. Let's get on with the resurrection of the software.



10

THE INPUT UNIT A key concept that we've used since Day 1 has been the idea of an input stream with onelookahead character. All the parsing routines examine this character, without changing it,to decide what they should do next. (Compare this approach with the C/Unix approachusing getchar and unget, and I think you'll agree that our approach is simpler). We'll beginour hike into the future by translating this concept into our new, unit-based organization.The first unit, appropriately called Input, is shown below:

{--------------------------------------------------------------}

unit Input;

{--------------------------------------------------------------}

interface

var Look: char; { Lookahead character }

procedure GetChar; { Read new character }

{--------------------------------------------------------------}

implementation

{--------------------------------------------------------------}


procedure GetChar;

begin

Read(Look);

end;



{--------------------------------------------------------------}

{ Unit Initialization }

begin

GetChar;

end.

{--------------------------------------------------------------}

As you can see, there's nothing very profound, and certainly nothing complicated, about thisunit, since it consists of only a single procedure. But already, we can see how the use of unitsgives us advantages. Note the executable code in the initialization block. This code "primesthe pump" of the input stream for us, something we've always had to do before, by insertingthe call to GetChar in line, or in procedure Init. This time, the call happens without any specialreference to it on our part, except within the unit itself. As I predicted earlier, this mechanismis going to make our lives much simpler as we proceed. I consider it to be one of the mostuseful features of Turbo Pascal, and I lean on it heavily.

Copy this unit into your compiler's IDE, and compile it. To test the software, of course, wealways need a main program. I used the following, really complex test program, which we'lllater evolve into the Main for our compiler:

{--------------------------------------------------------------}

program Main;

uses WinCRT, Input;

begin

WriteLn(Look);

end.

{--------------------------------------------------------------}



10

Note the use of the Borland-supplied unit, WinCRT. This unit is necessary if you intend touse the standard Pascal I/O routines, Read, ReadLn, Write, and WriteLn, which of coursewe intend to do. If you forget to include this unit in the "uses" clause, you will get a reallybizarre and indecipherable error message at run time.

Note also that we can access the lookahead character, even though it's not declared inthe main program. All variables declared within the interface section of a unit are global,but they're hidden from prying eyes; to that extent, we get a modicum of information hid-ing. Of course, if we were writing in an object- oriented fashion, we should not allow out-side modules to access the units internal variables. But, although Turbo units have a lot incommon with objects, we're not doing object-oriented design or code here, so our use ofLook is appropriate.

Go ahead and save the test program as Main.pas. To make life easier as we get moreand more files, you might want to take this opportunity to declare this file as the compiler'sPrimary file. That way, you can execute the program from any file. Otherwise, if you pressCntl-F9 to compile and run from one of the units, you'll get an error message. You set theprimary file using the main submenu, "Compile," in the Turbo IDE.

I hasten to point out, as I've done before, that the function of unit Input is, and always hasbeen, considered to be a dummy version of the real thing. In a production version of acompiler, the input stream will, of course, come from a file rather than from the keyboard.And it will almost certainly include line buffering, at the very least, and more likely, a ratherlarge text buffer to support efficient disk I/O. The nice part about the unit approach is that,as with objects, we can modify the code in the unit to be as simple or as sophisticated aswe like. As long as the interface, as embodied in the public procedures and the looka-head character, don't change, the rest of the program is totally unaffected. And since unitsare compiled, rather than merely included, the time required to link with them is virtuallynil. Again, the result is that we can get all the benefits of sophisticated implementations,without having to carry the code around as so much baggage.

In later installments, I intend to provide a full-blown IDE for the KISS compiler, using atrue Windows application generated by Borland's OWL applications framework. For now,though, we'll obey my #1 rule to live by: Keep It Simple.



THE OUTPUT UNIT Of course, every decent program should have output, and ours is no exception. Our outputroutines included the Emit functions. The code for the corresponding output unit is shownnext:

{--------------------------------------------------------------}

unit Output;

{--------------------------------------------------------------}

interface

procedure Emit(s: string);{ Emit an instruction }

procedure EmitLn(s: string);{ Emit an instruction line }

{--------------------------------------------------------------}

implementation

const TAB = ^I;



10

{--------------------------------------------------------------}

{ Emit an Instruction }


begin

Write(TAB, s);

end;

{--------------------------------------------------------------}

{ Emit an Instruction, Followed By a Newline }


begin

Emit(s);

WriteLn;

end;

end.



{--------------------------------------------------------------}

(Notice that this unit has no initialization clause, so it needs no begin-block.)

Test this unit with the following main program:

{--------------------------------------------------------------}

program Test;

uses WinCRT, Input, Output, Scanner, Parser;

begin

WriteLn('MAIN:");

EmitLn('Hello, world!');

end.

{--------------------------------------------------------------}

Did you see anything that surprised you? You may have been surprised to see that youneeded to type something, even though the main program requires no input. That's becauseof the initialization in unit Input, which still requires something to put into the lookahead char-acter. Sorry, there's no way out of that box, or rather, we don't _WANT_ to get out. Except forsimple test cases such as this, we will always want a valid lookahead character, so the rightthing to do about this "problem" is ... nothing.

Perhaps more surprisingly, notice that the TAB character had no effect; our line of "instruc-tions" begins at column 1, same as the fake label. That's right: WinCRT doesn't support tabs.We have a problem.



10

There are a few ways we can deal with this problem. The one thing we can't do is to sim-ply ignore it. Every assembler I've ever used reserves column 1 for labels, and will rebelto see instructions starting there. So, at the very least, we must space the instructionsover one column to keep the assembler happy. . That's easy enough to do: Simplychange, in procedure Emit, the line:

Write(TAB, s);

by:

Write(' ', s);

I must admit that I've wrestled with this problem before, and find myself changing mymind as often as a chameleon changes color. For the purposes we're going to be using,99% of which will be examining the output code as it's displayed on a CRT, it would benice to see neatly blocked out "object" code. The line:

SUB1: MOVE #4,D0

just plain looks neater than the different, but functionally identical code,

SUB1:

MOVE #4,D0

In test versions of my code, I included a more sophisticated version of the procedurePostLabel, that avoids having labels on separate lines, but rather defers the printing of alabel so it can end up on the same line as the associated instruction. As recently as anhour ago, my version of unit Output provided full support for tabs, using an internal col-umn count variable and software to manage it. I had, if I do say so myself, some ratherelegant code to support the tab mechanism, with a minimum of code bloat. It was awfullytempting to show you the "prettyprint" version, if for no other reason than to show off theelegance.



Nevertheless, the code of the "elegant" version was considerably more complex and larger.Since then, I've had second thoughts. In spite of our desire to see pretty output, the inescap-able fact is that the two versions of the MAIN: code fragment shown above are functionallyidentical; the assembler, which is the ultimate destination of the code, couldn't care lesswhich version it gets, except that the prettier version will contain more characters, thereforewill use more disk space and take longer to assemble. but the prettier one not only takesmore code to generate, but will create a larger output file, with many more space charactersthan the minimum needed. When you look at it that way, it's not very hard to decide whichapproach to use, is it?

What finally clinched the issue for me was a reminder to consider my own first command-ment: KISS. Although I was pretty proud of all my elegant little tricks to implement tabbing, Ihad to remind myself that, to paraphrase Senator Barry Goldwater, elegance in the pursuit ofcomplexity is no virtue. Another wise man once wrote, "Any idiot can design a Rolls-Royce. Ittakes a genius to design a VW." So the elegant, tab-friendly version of Output is history, andwhat you see is the simple, compact, VW version.



10

THE ERROR UNIT Our next set of routines are those that handle errors. To refresh your memory, we take theapproach, pioneered by Borland in Turbo Pascal, of halting on the first error. Not onlydoes this greatly simplify our code, by completely avoiding the sticky issue of error recov-ery, but it also makes much more sense, in my opinion, in an interactive environment. Iknow this may be an extreme position, but I consider the practice of reporting all errors ina program to be an anachronism, a holdover from the days of batch processing. It's timeto scuttle the practice. So there.

In our original Cradle, we had two error-handling procedures: Error, which didn't halt, andAbort, which did. But I don't think we ever found a use for the procedure that didn't halt, soin the new, lean and mean unit Errors, shown next, procedure Error takes the place ofAbort.

{--------------------------------------------------------------}

unit Errors;

{--------------------------------------------------------------}

interface



{--------------------------------------------------------------}

implementation

{--------------------------------------------------------------}

{ Write error Message and Halt }




begin

WriteLn;


Halt;

end;

{--------------------------------------------------------------}

{ Write "<something> Expected" }


begin

Error(s + ' Expected');

end;

end.

{--------------------------------------------------------------}



10

As usual, here's a test program:

{--------------------------------------------------------------}

program Test;

uses WinCRT, Input, Output, Errors;

begin

Expected('Integer');

end.

{--------------------------------------------------------------}

Have you noticed that the "uses" line in our main program keeps getting longer? That'sOK. In the final version, the main program will only call procedures in our parser, so itsuse clause will only have a couple of entries. But for now, it's probably best to include allthe units so we can test procedures in them.



SCANNING AND PARSING The classical compiler architecture consists of separate modules for the lexical scanner,which supplies tokens in the language, and the parser, which tries to make sense of thetokens as syntax elements. If you can still remember what we did in earlier installments, you'llrecall that we didn't do things that way. Because we're using a predictive parser, we canalmost always tell what language element is coming next, just by examining the lookaheadcharacter. Therefore, we found no need to prefetch tokens, as a scanner would do.

But, even though there is no functional procedure called "Scanner," it still makes sense toseparate the scanning functions from the parsing functions. So I've created two more unitscalled, amazingly enough, Scanner and Parser. The Scanner unit contains all of the routinesknown as recognizers. Some of these, such as IsAlpha, are pure boolean routines whichoperate on the lookahead character only. The other routines are those which collect tokens,such as identifiers and numeric constants. The Parser unit will contain all of the routines mak-ing up the recursive-descent parser. The general rule should be that unit Parser contains allof the information that is language-specific; in other words, the syntax of the language shouldbe wholly contained in Parser. In an ideal world, this rule should be true to the extent that wecan change the compiler to compile a different language, merely by replacing the single unit,Parser.

In practice, things are almost never this pure. There's always a small amount of "leakage" ofsyntax rules into the scanner as well. For example, the rules concerning what makes up alegal identifier or constant may vary from language to language. In some languages, the rulesconcerning comments permit them to be filtered by the scanner, while in others they do not.So in practice, both units are likely to end up having language- dependent components, butthe changes required to the scanner should be relatively trivial.



10

Now, recall that we've used two versions of the scanner routines: One that handled onlysingle-character tokens, which we used for a number of our tests, and another that pro-vided full support for multi-character tokens. Now that we have our software separatedinto units, I don't anticipate getting much use out of the single- character version, but itdoesn't cost us much to provide for both. I've created two versions of the Scanner unit.The first one, called Scanner1, contains the single-digit version of the recognizers:

{--------------------------------------------------------------}

unit Scanner1;

{--------------------------------------------------------------}

interface

uses Input, Errors;








function GetNumber: char;



{--------------------------------------------------------------}

implementation

{--------------------------------------------------------------}



begin


end;

{--------------------------------------------------------------}

{ Recognize a Numeric Character }


begin

IsDigit := c in ['0'..'9'];

end;



10

{--------------------------------------------------------------}


function IsAlnum(c: char): boolean;

begin

IsAlnum := IsAlpha(c) or IsDigit(c);

end;

{--------------------------------------------------------------}

{ Recognize an Addition Operator }


begin

IsAddop := c in ['+','-'];

end;



{--------------------------------------------------------------}

{ Recognize a Multiplication Operator }


begin

IsMulop := c in ['*','/'];

end;

{--------------------------------------------------------------}

{ Match One Character }


begin



end;



10

{--------------------------------------------------------------}



begin



GetChar;

end;

{--------------------------------------------------------------}

{ Get a Number }

function GetNumber: char;

begin


GetNumber := Look;

GetChar;

end;

end.

{--------------------------------------------------------------}



The following code fragment of the main program provides a good test of the scanner. Forbrevity, I'll only include the executable code here; the rest remains the same. Don't forget,though, to add the name Scanner1 to the "uses" clause.

Write(GetName);

Match('=');

Write(GetNumber);

Match('+');

WriteLn(GetName);

This code will recognize all sentences of the form:

x=0+y

where x and y can be any single-character variable names, and 0 any digit. The code shouldreject all other sentences, and give a meaningful error message. If it did, you're in goodshape and we can proceed.



10

THE SCANNER UNIT The next, and by far the most important, version of the scanner is the one that handlesthe multi-character tokens that all real languages must have. Only the two functions, Get-Name and GetNumber, change between the two units, but just to be sure there are nomistakes, I've reproduced the entire unit here. This is unit Scanner:

{--------------------------------------------------------------}

unit Scanner;

{--------------------------------------------------------------}

interface

uses Input, Errors;








function GetNumber: longint;



{--------------------------------------------------------------}

implementation

{--------------------------------------------------------------}



begin


end;

{--------------------------------------------------------------}

{ Recognize a Numeric Character }


begin

IsDigit := c in ['0'..'9'];

end;



10

{--------------------------------------------------------------}


function IsAlnum(c: char): boolean;

begin

IsAlnum := IsAlpha(c) or IsDigit(c);

end;

{--------------------------------------------------------------}

{ Recognize an Addition Operator }


begin

IsAddop := c in ['+','-'];

end;



{--------------------------------------------------------------}

{ Recognize a Multiplication Operator }


begin

IsMulop := c in ['*','/'];

end;

{--------------------------------------------------------------}

{ Match One Character }


begin



end;



10

{--------------------------------------------------------------}



var n: string;

begin

n := '';


while IsAlnum(Look) do begin

n := n + Look;

GetChar;

end;

GetName := n;

end;



{--------------------------------------------------------------}

{ Get a Number }

function GetNumber: string;

var n: string;

begin

n := '';



n := n + Look;

GetChar;

end;

GetNumber := n;

end;

end.

{--------------------------------------------------------------}

The same test program will test this scanner, also. Simply change the "uses" clause to useScanner instead of Scanner1. Now you should be able to type multi-character names andnumbers.



10

DECISIONS, DECISIONS In spite of the relative simplicity of both scanners, a lot of thought has gone into them, anda lot of decisions had to be made. I'd like to share those thoughts with you now so youcan make your own educated decision, appropriate for your application. First, note thatboth versions of GetName translate the input characters to upper case. Obviously, therewas a design decision made here, and this is one of those cases where the languagesyntax splatters over into the scanner. In the C language, the case of characters in identi-fiers is significant. For such a language, we obviously can't map the characters to uppercase. The design I'm using assumes a language like Pascal, where the case of charac-ters doesn't matter. For such languages, it's easier to go ahead and map all identifiers toupper case in the scanner, so we don't have to worry later on when we're comparingstrings for equality.

We could have even gone a step further, and map the characters to upper case right asthey come in, in GetChar. This approach works too, and I've used it in the past, but it's tooconfining. Specifically, it will also map characters that may be part of quoted strings,which is not a good idea. So if you're going to map to upper case at all, GetName is theproper place to do it.

Note that the function GetNumber in this scanner returns a string, just as GetName does.This is another one of those things I've oscillated about almost daily, and the last swingwas all of ten minutes ago. The alternative approach, and one I've used many times inpast installments, returns an integer result.

Both approaches have their good points. Since we're fetching a number, the approachthat immediately comes to mind is to return it as an integer. But bear in mind that theeventual use of the number will be in a write statement that goes back to the outsideworld. Someone -- either us or the code hidden inside the write statement -- is going tohave to convert the number back to a string again. Turbo Pascal includes such string con-version routines, but why use them if we don't have to? Why convert a number from stringto integer form, only to convert it right back again in the code generator, only a few state-ments later?



Furthermore, as you'll soon see, we're going to need a temporary storage spot for the valueof the token we've fetched. If we treat the number in its string form, we can store the value ofeither a variable or a number in the same string. Otherwise, we'll have to create a second,integer variable.

On the other hand, we'll find that carrying the number as a string virtually eliminates anychance of optimization later on. As we get to the point where we are beginning to concernourselves with code generation, we'll encounter cases in which we're doing arithmetic onconstants. For such cases, it's really foolish to generate code that performs the constantarithmetic at run time. Far better to let the parser do the arithmetic at compile time, andmerely code the result. To do that, we'll wish we had the constants stored as integers ratherthan strings.

What finally swung me back over to the string approach was an aggressive application of theKISS test, plus reminding myself that we've studiously avoided issues of code efficiency. Oneof the things that makes our simple-minded parsing work, without the complexities of a "real"compiler, is that we've said up front that we aren't concerned about code efficiency. Thatgives us a lot of freedom to do things the easy way rather than the efficient one, and it's afreedom we must be careful not to abandon voluntarily, in spite of the urges for efficiencyshouting in our ear. In addition to being a big believer in the KISS philosophy, I'm also anadvocate of "lazy programming," which in this context means, don't program anything untilyou need it. As P.J. Plauger says, "Never put off until tomorrow what you can put off indefi-nitely." Over the years, much code has been written to provide for eventualities that neverhappened. I've learned that lesson myself, from bitter experience. So the bottom line is: Wewon't convert to an integer here because we don't need to. It's as simple as that.



10

For those of you who still think we may need the integer version (and indeed we may),here it is:

{--------------------------------------------------------------}

{ Get a Number (integer version) }

function GetNumber: longint;

var n: longint;

begin

n := 0;



n := 10 * n + (Ord(Look) - Ord('0'));

GetChar;

end;

GetNumber := n;

end;

{--------------------------------------------------------------}

You might file this one away, as I intend to, for a rainy day.



PARSINGAt this point, we have distributed all the routines that made up our Cradle into units that wecan draw upon as we need them. Obviously, they will evolve further as we continue the pro-cess of bootstrapping ourselves up again, but for the most part their content, and certainly thearchitecture that they imply, is defined. What remains is to embody the language syntax intothe parser unit. We won't do much of that in this installment, but I do want to do a little, just toleave us with the good feeling that we still know what we're doing. So before we go, let's gen-erate just enough of a parser to process single factors in an expression. In the process, we'llalso, by necessity, find we have created a code generator unit, as well.

Remember the very first installment of this series? We read an integer value, say n, and gen-erated the code to load it into the D0 register via an immediate move:

MOVE #n,D0

Shortly afterwards, we repeated the process for a variable,

MOVE X(PC),D0



10

and then for a factor that could be either constant or variable. For old times sake, let'srevisit that process. Define the following new unit:

{--------------------------------------------------------------}

unit Parser;

{--------------------------------------------------------------}

interface

uses Input, Scanner, Errors, CodeGen;

procedure Factor;

{--------------------------------------------------------------}

implementation

{--------------------------------------------------------------}


procedure Factor;

begin

LoadConstant(GetNumber);

end;

end.

{--------------------------------------------------------------}



As you can see, this unit calls a procedure, LoadConstant, which actually effects the output ofthe assembly-language code. The unit also uses a new unit, CodeGen. This step representsthe last major change in our architecture, from earlier installments: The removal of themachine-dependent code to a separate unit. If I have my way, there will not be a single line ofcode, outside of CodeGen, that betrays the fact that we're targeting the 68000 CPU. And thisis one place I think that having my way is quite feasible.

For those of you who wish I were using the 80x86 architecture (or any other one) instead ofthe 68000, here's your answer: Merely replace CodeGen with one suitable for your CPU ofchoice.

So far, our code generator has only one procedure in it. Here's the unit:

{--------------------------------------------------------------}

unit CodeGen;

{--------------------------------------------------------------}

interface

uses Output;

procedure LoadConstant(n: string);

{--------------------------------------------------------------}

implementation



10

{--------------------------------------------------------------}

{ Load the Primary Register with a Constant }


begin

EmitLn('MOVE #' + n + ',D0' );

end;

end.

{--------------------------------------------------------------}

Copy and compile this unit, and execute the following main program:

{--------------------------------------------------------------}

program Main;

uses WinCRT, Input, Output, Errors, Scanner, Parser;

begin

Factor;

end.

{--------------------------------------------------------------}

There it is, the generated code, just as we hoped it would be.



Now, I hope you can begin to see the advantage of the unit-based architecture of our newdesign. Here we have a main program that's all of five lines long. That's all of the program weneed to see, unless we choose to see more. And yet, all those units are sitting there, patientlywaiting to serve us. We can have our cake and eat it too, in that we have simple and shortcode, but powerful allies. What remains to be done is to flesh out the units to match the capa-bilities of earlier installments. We'll do that in the next installment, but before I close, let's fin-ish out the parsing of a factor, just to satisfy ourselves that we still know how. The finalversion of CodeGen includes the new procedure, LoadVariable:

{--------------------------------------------------------------}

unit CodeGen;

{--------------------------------------------------------------}

interface

uses Output;


procedure LoadVariable(Name: string);

{--------------------------------------------------------------}

implementation



10

{--------------------------------------------------------------}

{ Load the Primary Register with a Constant }


begin

EmitLn('MOVE #' + n + ',D0' );

end;

{--------------------------------------------------------------}


procedure LoadVariable(Name: string);

begin


end;

end.{--------------------------------------------------------------}



The parser unit itself doesn't change, but we have a more complex version of procedure Fac-tor:

{--------------------------------------------------------------}


procedure Factor;

begin


LoadConstant(GetNumber)

else if IsAlpha(Look)then

LoadVariable(GetName)

else

Error('Unrecognized character ' + Look);

end;

{--------------------------------------------------------------}

Now, without altering the main program, you should find that our program will process either avariable or a constant factor. At this point, our architecture is almost complete; we have unitsto do all the dirty work, and enough code in the parser and code generator to demonstratethat everything works. What remains is to flesh out the units we've defined, particularly theparser and code generator, to support the more complex syntax elements that make up a reallanguage. Since we've done this many times before in earlier installments, it shouldn't takelong to get us back to where we were before the long hiatus. We'll continue this process inInstallment 16, coming soon. See you then.



10

REFERENCES1. Crenshaw, J.W., "Object-Oriented Design of Assemblers and Compilers," Proc. Soft-ware Development '91 Conference, Miller Freeman, San Francisco, CA, February 1991,pp. 143-155.

2. Crenshaw, J.W., "A Perfect Marriage," Computer Language, Volume 8, #6, June 1991,pp. 44-55.

3. Crenshaw, J.W., "Syntax-Driven Object-Oriented Design," Proc. 1991 Embedded Sys-tems Conference, Miller Freeman, San Francisco, CA, September 1991, pp. 45-60.


Part 16 - Unit Construction


INTRODUCTION This series of tutorials promises to be perhaps one of the longest- running mini-series in his-tory, rivalled only by the delay in Volume IV of Knuth. Begun in 1988, the series ran into afour-year hiatus in 1990 when the "cares of this world," changes in priorities and interests,and the need to make a living seemed to stall it out after Installment 14. Those of you withloads of patience were finally rewarded, in the spring of last year, with the long-awaitedInstallment 15. In it, I began to try to steer the series back on track, and in the process, tomake it easier to continue on to the goal, which is to provide you with not only enough under-standing of the difficult subject of compiler theory, but also enough tools, in the form ofcanned subroutines and concepts, so that you would be able to continue on your own andbecome proficient enough to build your own parsers and translators. Because of that longhiatus, I thought it appropriate to go back and review the concepts we have covered so far,and to redo some of the software, as well. In the past, we've never concerned ourselvesmuch with the development of production-quality software tools ... after all, I was trying toteach (and learn) concepts, not production practice. To do that, I tended to give you, not com-plete compilers or parsers, but only those snippets of code that illustrated the particular pointwe were considering at the moment.

I still believe that's a good way to learn any subject; no one wants to have to make changes to100,000 line programs just to try out a new idea. But the idea of just dealing with code snip-pets, rather than complete programs, also has its drawbacks in that we often seemed to bewriting the same code fragments over and over. Although repetition has been thoroughlyproven to be a good way to learn new ideas, it's also true that one can have too much of agood thing. By the time I had completed Installment 14 I seemed to have reached the limits ofmy abilities to juggle multiple files and multiple versions of the same software functions. Whoknows, perhaps that's one reason I seemed to have run out of gas at that point.

Fortunately, the later versions of Borland's Turbo Pascal allow us to have our cake and eat ittoo. By using their concept of separately compilable units, we can still write small subroutinesand functions, and keep our main programs and test programs small and simple. But, oncewritten, the code in the Pascal units will always be there for us to use, and linking them in istotally painless and transparent.



10

Since, by now, most of you are programming in either C or C++, I know what you're think-ing: Borland, with their Turbo Pascal (TP), certainly didn't invent the concept of separatelycompilable modules. And of course you're right. But if you've not used TP lately, or ever,you may not realize just how painless the whole process is. Even in C or C++, you stillhave to build a make file, either manually or by telling the compiler how to do so. Youmust also list, using "extern" statements or header files, the functions you want to import.In TP, you don't even have to do that. You need only name the units you wish to use, andall of their procedures automatically become available.

It's not my intention to get into a language-war debate here, so I won't pursue the subjectany further. Even I no longer use Pascal on my job ... I use C at work and C++ for my arti-cles in Embedded Systems Programming and other magazines. Believe me, when I setout to resurrect this series, I thought long and hard about switching both languages andtarget systems to the ones that we're all using these days, C/C++ and PC architecture,and possibly object-oriented methods as well. In the end, I felt it would cause more confu-sion than the hiatus itself has. And after all, Pascal still remains one of the best possiblelanguages for teaching, not to mention production programming. Finally, TP still compilesat the speed of light, much faster than competing C/C++ compilers. And Borland's smartlinker, used in TP but not in their C++ products, is second to none. Aside from being muchfaster than Microsoft-compatible linkers, the Borland smart linker will cull unused proce-dures and data items, even to the extent of trimming them out of defined objects if they'renot needed. For one of the few times in our lives, we don't have to compromise betweencompleteness and efficiency. When we're writing a TP unit, we can make it as completeas we like, including any member functions and data items we may think we will everneed, confident that doing so will not create unwanted bloat in the compiled and linkedexecutable.

The point, really, is simply this: By using TP's unit mechanism, we can have all the advan-tages and convenience of writing small, seemingly stand-alone test programs, withouthaving to constantly rewrite the support functions that we need. Once written, the TPunits sit there, quietly waiting to do their duty and give us the support we need, when weneed it.



Using this principle, in Installment 15 I set out to minimize our tendency to re-invent the wheelby organizing our code into separate Turbo Pascal units, each containing different parts of thecompiler. We ended up with the following units:

*Input

*Output

*Errors

*Scanner

*Parser

*CodeGen

Each of these units serves a different function, and encapsulates specific areas of functional-ity. The Input and Output units, as their name implies, provide character stream I/O and theall-important lookahead character upon which our predictive parser is based. The Errors unit,of course, provides standard error handling. The Scanner unit contains all of our booleanfunctions such as IsAlpha, and the routines GetName and GetNumber, which process multi-character tokens.

The two units we'll be working with the most, and the ones that most represent the personal-ity of our compiler, are Parser and CodeGen. Theoretically, the Parser unit should encapsu-late all aspects of the compiler that depend on the syntax of the compiled language (though,as we saw last time, a small amount of this syntax spills over into Scanner). Similarly, thecode generator unit, CodeGen, contains all of the code dependent upon the target machine.In this installment, we'll be continuing with the development of the functions in these two all-important units.



10

JUST LIKE CLASSICAL? Before we proceed, however, I think I should clarify the relationship between, and thefunctionality of these units. Those of you who are familiar with compiler theory as taughtin universities will, of course, recognize the names, Scanner, Parser, and CodeGen, all ofwhich are components of a classical compiler implementation. You may be thinking thatI've abandoned my commitment to the KISS philosophy, and drifted towards a more con-ventional architecture than we once had. A closer look, however, should convince youthat, while the names are similar, the functionalities are quite different.

Together, the scanner and parser of a classical implementation comprise the so-called"front end," and the code generator, the back end. The front end routines process the lan-guage-dependent, syntax-related aspects of the source language, while the code genera-tor, or back end, deals with the target machine-dependent parts of the problem. Inclassical compilers, the two ends communicate via a file of instructions written in an inter-mediate language (IL).

Typically, a classical scanner is a single procedure, operating as a co- procedure with theparser. It "tokenizes" the source file, reading it character by character, recognizing lan-guage elements, translating them into tokens, and passing them along to the parser. Youcan think of the parser as an abstract machine, executing "op codes," which are thetokens. Similarly, the parser generates op codes of a second abstract machine, whichmechanizes the IL. Typically, the IL file is written to disk by the parser, and read backagain by the code generator.

Our organization is quite different. We have no lexical scanner, in the classical sense; ourunit Scanner, though it has a similar name, is not a single procedure or co-procedure, butmerely a set of separate subroutines which are called by the parser as needed.

Similarly, the classical code generator, the back end, is a translator in its own right, read-ing an IL "source" file, and emitting an object file. Our code generator doesn't work thatway. In our compiler, there IS no intermediate language; every construct in the sourcelanguage syntax is converted into assembly language as it is recognized by the parser.Like Scanner, the unit CodeGen consists of individual procedures which are called by theparser as needed.



This "code 'em as you find 'em" philosophy may not produce the world's most efficient code -- for example, we haven't provided (yet!) a convenient place for an optimizer to work its magic-- but it sure does simplify the compiler, doesn't it?

And that observation prompts me to reflect, once again, on how we have managed to reducea compiler's functions to such comparatively simple terms. I've waxed eloquent on this sub-ject in past installments, so I won't belabor the point too much here. However, because of thetime that's elapsed since those last soliloquies, I hope you'll grant me just a little time toremind myself, as well as you, how we got here. We got here by applying several principlesthat writers of commercial compilers seldom have the luxury of using. These are:

o The KISS philosophy -- Never do things the hard way without a reason

o Lazy coding -- Never put off until tomorrow what you can put of forever (with credits to P.J.Plauger)

o Skepticism -- Stubborn refusal to do something just because that's the way it's always beendone.

o Acceptance of inefficient code o Rejection of arbitrary constraints

As I've reviewed the history of compiler construction, I've learned that virtually every produc-tion compiler in history has suffered from pre- imposed conditions that strongly influenced itsdesign. The original FORTRAN compiler of John Backus, et al, had to compete with assem-bly language, and therefore was constrained to produce extremely efficient code. The IBMcompilers for the minicomputers of the 70's had to run in the very small RAM memories thenavailable -- as small as 4k. The early Ada compiler had to compile itself. Per Brinch Hansendecreed that his Pascal compiler developed for the IBM PC must execute in a 64k machine.Compilers developed in Computer Science courses had to compile the widest variety of lan-guages, and therefore required LALR parsers.

In each of these cases, these preconceived constraints literally dominated the design of thecompiler.



10

A good example is Brinch Hansen's compiler, described in his excellent book, "BrinchHansen on Pascal Compilers" (highly recommended). Though his compiler is one of themost clear and un-obscure compiler implementations I've seen, that one decision, tocompile large files in a small RAM, totally drives the design, and he ends up with not justone, but many intermediate files, together with the drivers to write and read them.

In time, the architectures resulting from such decisions have found their way into com-puter science lore as articles of faith. In this one man's opinion, it's time that they were re-examined critically. The conditions, environments, and requirements that led to classicalarchitectures are not the same as the ones we have today. There's no reason to believethe solutions should be the same, either.

In this tutorial, we've followed the leads of such pioneers in the world of small compilersfor Pcs as Leor Zolman, Ron Cain, and James Hendrix, who didn't know enough compilertheory to know that they "couldn't do it that way." We have resolutely refused to acceptarbitrary constraints, but rather have done whatever was easy. As a result, we haveevolved an architecture that, while quite different from the classical one, gets the job donein very simple and straightforward fashion.

I'll end this philosophizing with an observation re the notion of an intermediate language.While I've noted before that we don't have one in our compiler, that's not exactly true; we_DO_ have one, or at least are evolving one, in the sense that we are defining code gen-eration functions for the parser to call. In essence, every call to a code generation proce-dure can be thought of as an instruction in an intermediate language. Should we ever findit necessary to formalize an intermediate language, this is the way we would do it: emitcodes from the parser, each representing a call to one of the code generator procedures,and then process each code by calling those procedures in a separate pass, imple-mented in a back end. Frankly, I don't see that we'll ever find a need for this approach, butthere is the connection, if you choose to follow it, between the classical and the currentapproaches.



FLESHING OUT THE PARSER Though I promised you, somewhere along about Installment 14, that we'd never again writeevery single function from scratch, I ended up starting to do just that in Installment 15. Onereason: that long hiatus between the two installments made a review seem eminently justified... even imperative, both for you and for me. More importantly, the decision to collect the pro-cedures into modules (units), forced us to look at each one yet again, whether we wanted toor not. And, finally and frankly, I've had some new ideas in the last four years that warranteda fresh look at some old friends. When I first began this series, I was frankly amazed, andpleased, to learn just how simple parsing routines can be made. But this last time around, I'vesurprised myself yet again, and been able to make them just that last little bit simpler, yet.

Still, because of this total rewrite of the parsing modules, I was only able to include so muchin the last installment. Because of this, our hero, the parser, when last seen, was a shadow ofhis former self, consisting of only enough code to parse and process a factor consisting ofeither a variable or a constant. The main effort of this current installment will be to help fleshout the parser to its former glory. In the process, I hope you'll bear with me if we sometimescover ground we've long since been over and dealt with.



10

First, let's take care of a problem that we've addressed before: Our current version of pro-cedure Factor, as we left it in Installment 15, can't handle negative arguments. To fix that,we'll introduce the procedure SignedFactor:

{--------------------------------------------------------------}

{ Parse and Translate a Factor with Optional Sign }


var Sign: char;

begin

Sign := Look;


GetChar;

Factor;

if Sign = '-' then Negate;

end;

{--------------------------------------------------------------}



Note that this procedure calls a new code generation routine, Negate:

{--------------------------------------------------------------}

{ Negate Primary }

procedure Negate;

begin

EmitLn('NEG D0');

end;

{--------------------------------------------------------------}

(Here, and elsewhere in this series, I'm only going to show you the new routines. I'm count-ing on you to put them into the proper unit, which you should normally have no trouble identi-fying. Don't forget to add the procedure's prototype to the interface section of the unit.)

In the main program, simply change the procedure called from Factor to SignedFactor, andgive the code a test. Isn't it neat how the Turbo linker and make facility handle all the details?

Yes, I know, the code isn't very efficient. If we input a number, -3, the generated code is:

MOVE #3,D0

NEG D0

which is really, really dumb. We can do better, of course, by simply pre-appending a minussign to the string passed to LoadConstant, but it adds a few lines of code to SignedFactor,and I'm applying the KISS philosophy very aggressively here. What's more, to tell the truth, Ithink I'm subconsciously enjoying generating "really, really dumb" code, so I can have thepleasure of watching it get dramatically better when we get into optimization methods.



10

Most of you have never heard of John Spray, so allow me to introduce him to you here.John's from New Zealand, and used to teach computer science at one of its universities.John wrote a compiler for the Motorola 6809, based on a delightful, Pascal-like languageof his own design called "Whimsical." He later ported the compiler to the 68000, and forawhile it was the only compiler I had for my homebrewed 68000 system.

For the record, one of my standard tests for any new compiler is to see how the compilerdeals with a null program like:

program main;

begin

end.

My test is to measure the time required to compile and link, and the size of the object filegenerated. The undisputed _LOSER_ in the test is the DEC C compiler for the VAX,which took 60 seconds to compile, on a VAX 11/780, and generated a 50k object file.John's compiler is the undisputed, once, future, and forever king in the code size depart-ment. Given the null program, Whimsical generates precisely two bytes of code, imple-menting the one instruction,

RET

By setting a compiler option to generate an include file rather than a standalone program,John can even cut this size, from two bytes to zero! Sort of hard to beat a null object file,wouldn't you say?

Needless to say, I consider John to be something of an expert on code optimization, and Ilike what he has to say: "The best way to optimize is not to have to optimize at all, but toproduce good code in the first place." Words to live by. When we get started on optimiza-tion, we'll follow John's advice, and our first step will not be to add a peephole optimizer orother after-the-fact device, but to improve the quality of the code emitted before optimiza-tion. So make a note of SignedFactor as a good first candidate for attention, and for nowwe'll leave it be.



TERMS AND EXPRESSIONS I'm sure you know what's coming next: We must, yet again, create the rest of the proceduresthat implement the recursive-descent parsing of an expression. We all know that the hierar-chy of procedures for arithmetic expressions is:

expression

term

factor

However, for now let's continue to do things one step at a time, and consider only expres-sions with additive terms in them. The code to implement expressions, including a possiblysigned first term, is shown next:

{--------------------------------------------------------------}



begin

SignedFactor;

while IsAddop(Look) do

case Look of

'+': Add;

'-': Subtract;

end;

end;

{--------------------------------------------------------------}



10

This procedure calls two other procedures to process the operations:

{--------------------------------------------------------------}

{ Parse and Translate an Addition Operation }

procedure Add;

begin

Match('+');

Push;

Factor;

PopAdd;

end;

{--------------------------------------------------------------}

{ Parse and Translate a Subtraction Operation }

procedure Subtract;

begin

Match('-');

Push;

Factor;

PopSub;

end;

{--------------------------------------------------------------}



The three procedures Push, PopAdd, and PopSub are new code generation routines. As thename implies, procedure Push generates code to push the primary register (D0, in our 68000implementation) to the stack. PopAdd and PopSub pop the top of the stack again, and add itto, or subtract it from, the primary register. The code is shown next:

{--------------------------------------------------------------}

{ Push Primary to Stack }

procedure Push;

begin


end;

{--------------------------------------------------------------}

{ Add TOS to Primary }

procedure PopAdd;

begin


end;



10

{--------------------------------------------------------------}

{ Subtract TOS from Primary }

procedure PopSub;

begin


Negate;

end;

{--------------------------------------------------------------}

Add these routines to Parser and CodeGen, and change the main program to call Expres-sion. Voila!

The next step, of course, is to add the capability for dealing with multiplicative terms. Tothat end, we'll add a procedure Term, and code generation procedures PopMul and Pop-Div. These code generation procedures are shown next:

{--------------------------------------------------------------}

{ Multiply TOS by Primary }

procedure PopMul;

begin


end;



{--------------------------------------------------------------}

{ Divide Primary by TOS }

procedure PopDiv;

begin


EmitLn('EXT.L D7');



end;

{--------------------------------------------------------------}

I admit, the division routine is a little busy, but there's no help for it. Unfortunately, while the68000 CPU allows a division using the top of stack (TOS), it wants the arguments in thewrong order, just as it does for subtraction. So our only recourse is to pop the stack to ascratch register (D7), perform the division there, and then move the result back to our primaryregister, D0. Note the use of signed multiply and divide operations. This follows an implied,but unstated, assumption, that all our variables will be signed 16-bit integers. This decisionwill come back to haunt us later, when we start looking at multiple data types, type conver-sions, etc.



10

Our procedure Term is virtually a clone of Expression, and looks like this:

{--------------------------------------------------------------}

{ Parse and Translate a Term }

procedure Term;

begin

Factor;

while IsMulop(Look) do

case Look of

'*': Multiply;

'/': Divide;

end;

end;

{--------------------------------------------------------------}



Our next step is to change some names. SignedFactor now becomes SignedTerm, and thecalls to Factor in Expression, Add, Subtract and SignedTerm get changed to call Term:

{--------------------------------------------------------------}

{ Parse and Translate a Term with Optional Leading Sign }

procedure SignedTerm;

var Sign: char;

begin

Sign := Look;


GetChar;

Term;

if Sign = '-' then Negate;

end;

{--------------------------------------------------------------}

...



10

{--------------------------------------------------------------}



begin

SignedTerm;


case Look of

'+': Add;

'-': Subtract;

end;

end;

{--------------------------------------------------------------}

If memory serves me correctly, we once had BOTH a procedure SignedFactor and a pro-cedure SignedTerm. I had reasons for doing that at the time ... they had to do with thehandling of Boolean algebra and, in particular, the Boolean "not" function. But certainly,for arithmetic operations, that duplication isn't necessary. In an expression like:

-x*y

it's very apparent that the sign goes with the whole TERM, x*y, and not just the factor x,and that's the way Expression is coded.

Test this new code by executing Main. It still calls Expression, so you should now be ableto deal with expressions containing any of the four arithmetic operators.



Our last bit of business, as far as expressions goes, is to modify procedure Factor to allow forparenthetical expressions. By using a recursive call to Expression, we can reduce theneeded code to virtually nothing. Five lines added to Factor do the job:

{--------------------------------------------------------------}


procedure Factor;

begin

if Look ='(' then begin

Match('(');

Expression;

Match(')');

end


LoadConstant(GetNumber)

else if IsAlpha(Look)then

LoadVariable(GetName)

else

Error('Unrecognized character ' + Look); end; {--------------------------------------------------------------}

At this point, your "compiler" should be able to handle any legal expression you can throw atit. Better yet, it should reject all illegal ones!



10

ASSIGNMENTS As long as we're this close, we might as well create the code to deal with an assignmentstatement. This code needs only to remember the name of the target variable where weare to store the result of an expression, call Expression, then store the number. The pro-cedure is shown next:

{--------------------------------------------------------------}



var Name: string;

begin

Name := GetName;

Match('=');

Expression;

StoreVariable(Name);

end;

{--------------------------------------------------------------}

The assignment calls for yet another code generation routine:



{--------------------------------------------------------------}

{ Store the Primary Register to a Variable }

procedure StoreVariable(Name: string);

begin



end;

{--------------------------------------------------------------}

Now, change the call in Main to call Assignment, and you should see a full assignment state-ment being processed correctly. Pretty neat, eh? And painless, too.

In the past, we've always tried to show BNF relations to define the syntax we're developing. Ihaven't done that here, and it's high time I did. Here's the BNF:

<factor> ::= <variable> | <constant> | '(' <expression> ')'

<signed_term> ::= [<addop>] <term>

<term> ::= <factor> (<mulop> <factor>)*

<expression> ::= <signed_term> (<addop> <term>)*

<assignment> ::= <variable> '=' <expression>



10

BOOLEANS The next step, as we've learned several times before, is to add Boolean algebra. In thepast, this step has at least doubled the amount of code we've had to write. As I've goneover this step in my mind, I've found myself diverging more and more from what we did inprevious installments. To refresh your memory, I noted that Pascal treats the Booleanoperators pretty much identically to the way it treats arithmetic ones. A Boolean "and" hasthe same precedence level as multiplication, and the "or" as addition. C, on the otherhand, sets them at different precedence levels, and all told has a whopping 17 levels. Inour earlier work, I chose something in between, with seven levels. As a result, we endedup with things called Boolean expressions, paralleling in most details the arithmeticexpressions, but at a different precedence level. All of this, as it turned out, came aboutbecause I didn't like having to put parentheses around the Boolean expressions in state-ments like:

IF (c >= 'A') and (c <= 'Z') then ...

In retrospect, that seems a pretty petty reason to add many layers of complexity to theparser. Perhaps more to the point, I'm not sure I was even able to avoid the parens.

For kicks, let's start anew, taking a more Pascal-ish approach, and just treat the Booleanoperators at the same precedence level as the arithmetic ones. We'll see where it leadsus. If it seems to be down the garden path, we can always backtrack to the earlierapproach.



For starters, we'll add the "addition-level" operators to Expression. That's easily done; first,modify the function IsAddop in unit Scanner to include two extra operators: '|' for "or," and '~'for "exclusive or":

{--------------------------------------------------------------}


begin

IsAddop := c in ['+','-', '|', '~'];

end;

{--------------------------------------------------------------}

Next, we must include the parsing of the operators in procedure

Expression:

{--------------------------------------------------------------}


begin

SignedTerm;


case Look of

'+': Add;

'-': Subtract; '|': _Or; '~': _Xor; end;

{--------------------------------------------------------------}



11

(The underscores are needed, of course, because "or" and "xor" are reserved words inTurbo Pascal.)

Next, the procedures _Or and _Xor:

{--------------------------------------------------------------}


procedure _Or;

begin

Match('|');

Push;

Term;

PopOr;

end;

{--------------------------------------------------------------}


procedure _Xor;

begin

Match('~');

Push;

Term;

PopXor; end;

{--------------------------------------------------------------}



And, finally, the new code generator procedures:

{--------------------------------------------------------------}

{ Or TOS with Primary }

procedure PopOr;

begin


end;

{--------------------------------------------------------------}

{ Exclusive-Or TOS with Primary }

procedure PopXor;

begin


end;

{--------------------------------------------------------------}

Now, let's test the translator (you might want to change the call in Main back to a call toExpression, just to avoid having to type "x=" for an assignment every time).



11

So far, so good. The parser nicely handles expressions of the form:

x|y~z

Unfortunately, it also does nothing to protect us from mixing Boolean and arithmetic alge-bra. It will merrily generate code for:

(a+b)*(c~d)

We've talked about this a bit, in the past. In general the rules for what operations are legalor not cannot be enforced by the parser itself, because they are not part of the syntax ofthe language, but rather its semantics. A compiler that doesn't allow mixed-mode expres-sions of this sort must recognize that c and d are Boolean variables, rather than numericones, and balk at multiplying them in the next step. But this "policing" can't be done by theparser; it must be handled somewhere between the parser and the code generator. Wearen't in a position to enforce such rules yet, because we haven't got either a way ofdeclaring types, or a symbol table to store the types in. So, for what we've got to work withat the moment, the parser is doing precisely what it's supposed to do.

Anyway, are we sure that we DON'T want to allow mixed-type operations? We made thedecision some time ago (or, at least, I did) to adopt the value 0000 as a Boolean "false,"and -1, or FFFFh, as a Boolean "true." The nice part about this choice is that bitwise oper-ations work exactly the same way as logical ones. In other words, when we do an opera-tion on one bit of a logical variable, we do it on all of them. This means that we don't needto distinguish between logical and bitwise operations, as is done in C with the operators &and &&, and | and ||. Reducing the number of operators by half certainly doesn't seem allbad.



From the point of view of the data in storage, of course, the computer and compiler couldn'tcare less whether the number FFFFh represents the logical TRUE, or the numeric -1. Shouldwe? I sort of think not. I can think of many examples (though they might be frowned upon as"tricky" code) where the ability to mix the types might come in handy. Example, the Diracdelta function, which could be coded in one simple line:

-(x=0)

or the absolute value function (DEFINITELY tricky code!):

x*(1+2*(x<0))

Please note, I'm not advocating coding like this as a way of life. I'd almost certainly writethese functions in more readable form, using IFs, just to keep from confusing later maintain-ers. Still, a moral question arises: Do we have the right to ENFORCE our ideas of good cod-ing practice on the programmer, but writing the language so he can't do anything else? That'swhat Nicklaus Wirth did, in many places in Pascal, and Pascal has been criticized for it -- fornot being as "forgiving" as C.

An interesting parallel presents itself in the example of the Motorola 68000 design. ThoughMotorola brags loudly about the orthogonality of their instruction set, the fact is that it's farfrom orthogonal. For example, you can read a variable from its address:

MOVE X,D0 (where X is the name of a variable)

but you can't write in the same way. To write, you must load an address register with theaddress of X. The same is true for PC- relative addressing:

MOVE X(PC),DO (legal)

MOVE D0,X(PC) (illegal)

When you begin asking how such non-orthogonal behavior came about, you find that some-one in Motorola had some theories about how software should be written. Specifically, in thiscase, they decided that self-modifying code, which you can implement using PC-relativewrites, is a Bad Thing. Therefore, they designed the processor to prohibit it. Unfortunately, inthe process they also prohibited _ALL_ writes of the forms shown above, however benign.Note that this was not something done by default. Extra design work had to be done, andextra gates added, to destroy the natural orthogonality of the instruction set.



11

One of the lessons I've learned from life: If you have two choices, and can't decide whichone to take, sometimes the best thing to do is nothing. Why add extra gates to a proces-sor to enforce some stranger's idea of good programming practice? Leave the instruc-tions in, and let the programmers debate what good programming practice is. Similarly,why should we add extra code to our parser, to test for and prevent conditions that theuser might prefer to do, anyway? I'd rather leave the compiler simple, and let the softwareexperts debate whether the practices should be used or not.

All of which serves as rationalization for my decision as to how to prevent mixed-typearithmetic: I won't. For a language intended for systems programming, the fewer rules,the better. If you don't agree, and want to test for such conditions, we can do it once wehave a symbol table.



BOOLEAN "AND" With that bit of philosophy out of the way, we can press on to the "and" operator, which goesinto procedure Term. By now, you can probably do this without me, but here's the code, any-way:

In Scanner,

{--------------------------------------------------------------}


begin

IsMulop := c in ['*','/', '&'];

end;

{--------------------------------------------------------------}



11

In Parser,

{--------------------------------------------------------------}

procedure Term;

begin

Factor;

while IsMulop(Look) do

case Look of

'*': Multiply;

'/': Divide;

'&': _And;

end;

end;



{--------------------------------------------------------------}

{ Parse and Translate a Boolean And Operation }

procedure _And;

begin

Match('&');

Push;

Factor;

PopAnd;

end;

{--------------------------------------------------------------}



11

and in CodeGen, {--------------------------------------------------------------}

{ And Primary with TOS }

procedure PopAnd;

begin


end;

{--------------------------------------------------------------}

Your parser should now be able to process almost any sort of logical expression, and(should you be so inclined), mixed-mode expressions as well.

Why not "all sorts of logical expressions"? Because, so far, we haven't dealt with the logi-cal "not" operator, and this is where it gets tricky. The logical "not" operator seems, at firstglance, to be identical in its behavior to the unary minus, so my first thought was to let theexclusive or operator, '~', double as the unary "not." That didn't work. In my first attempt,procedure SignedTerm simply ate my '~', because the character passed the test for anaddop, but SignedTerm ignores all addops except '-'. It would have been easy enough toadd another line to SignedTerm, but that would still not solve the problem, because notethat Expression only accepts a signed term for the _FIRST_ argument.

Mathematically, an expression like:

-a * -b

makes little or no sense, and the parser should flag it as an error. But the same expres-sion, using a logical "not," makes perfect sense:

not a and not b



In the case of these unary operators, choosing to make them act the same way seems anartificial force fit, sacrificing reasonable behavior on the altar of implementational ease. WhileI'm all for keeping the implementation as simple as possible, I don't think we should do so atthe expense of reasonableness. Patching like this would be missing the main point, which isthat the logical "not" is simply NOT the same kind of animal as the unary minus. Consider theexclusive or, which is most naturally written as:

a~b ::= (a and not b) or (not a and b)

If we allow the "not" to modify the whole term, the last term in parentheses would be inter-preted as:

not(a and b)

which is not the same thing at all. So it's clear that the logical "not" must be thought of as con-nected to the FACTOR, not the term.

The idea of overloading the '~' operator also makes no sense from a mathematical point ofview. The implication of the unary minus is that it's equivalent to a subtraction from zero:

-x <=> 0-x

In fact, in one of my more simple-minded versions of Expression, I reacted to a leading addopby simply preloading a zero, then processing the operator as though it were a binary opera-tor. But a "not" is not equivalent to an exclusive or with zero ... that would just give back theoriginal number. Instead, it's an exclusive or with FFFFh, or -1.

In short, the seeming parallel between the unary "not" and the unary minus falls apart undercloser scrutiny. "not" modifies the factor, not the term, and it is not related to either the unaryminus nor the exclusive or. Therefore, it deserves a symbol to call its own. What better sym-bol than the obvious one, also used by C, the '!' character? Using the rules about the way wethink the "not" should behave, we should be able to code the exclusive or (assuming we'dever need to), in the very natural form:

a & !b | !a & b

Note that no parentheses are required -- the precedence levels we've chosen automaticallytake care of things.



11

If you're keeping score on the precedence levels, this definition puts the '!' at the top ofthe heap. The levels become:

1.!

2.- (unary)

3.*, /, &

4.+, -, |, ~

Looking at this list, it's certainly not hard to see why we had trouble using '~' as the "not"symbol!



So how do we mechanize the rules? In the same way as we did with SignedTerm, but at thefactor level. We'll define a procedure NotFactor:

{--------------------------------------------------------------}

{ Parse and Translate a Factor with Optional "Not" }


begin

if Look ='!' then begin

Match('!');

Factor;

Notit;

end

else

Factor;

end;

{--------------------------------------------------------------}



11

and call it from all the places where we formerly called Factor, i.e., from Term, Multiply,Divide, and _And. Note the new code generation procedure:

{--------------------------------------------------------------}

{ Bitwise Not Primary }

procedure NotIt;

begin

EmitLn('EOR #-1,D0');

end;

{--------------------------------------------------------------}



Try this now, with a few simple cases. In fact, try that exclusive or example,

a&!b|!a&b

You should get the code (without the comments, of course):

MOVE A(PC),DO ; load a

MOVE D0,-(SP); push it

MOVE B(PC),DO; load b

EOR #-1,D0; not it

AND (SP)+,D0; and with a

MOVE D0,-(SP); push result

MOVE A(PC),DO; load a

EOR #-1,D0; not it

MOVE D0,-(SP); push it

MOVE B(PC),DO; load b

AND (SP)+,D0; and with !a

OR (SP)+,D0; or with first term

That's precisely what we'd like to get. So, at least for both arithmetic and logical operators,our new precedence and new, slimmer syntax hang together. Even the peculiar, but legal,expression with leading addop:

~x



11

makes sense. SignedTerm ignores the leading '~', as it should, since the expression isequivalent to:

0~x,

which is equal to x.

When we look at the BNF we've created, we find that our boolean algebra now adds onlyone extra line:

<not_factor> ::= [!] <factor>

<factor> ::= <variable> | <constant> | '(' <expression> ')'

<signed_term> ::= [<addop>] <term>

<term> ::= <not_factor> (<mulop> <not_factor>)*

<expression> ::= <signed_term> (<addop> <term>)*

<assignment> ::= <variable> '=' <expression>

That's a big improvement over earlier efforts. Will our luck continue to hold when we getto relational operators? We'll find out soon, but it will have to wait for the next installment.We're at a good stopping place, and I'm anxious to get this installment into your hands.It's already been a year since the release of Installment 15. I blush to admit that all of thiscurrent installment has been ready for almost as long, with the exception of relationaloperators. But the information does you no good at all, sitting on my hard disk, and byholding it back until the relational operations were done, I've kept it out of your hands forthat long. It's time for me to let go of it and get it out where you can get value from it.Besides, there are quite a number of serious philosophical questions associated with therelational operators, as well, and I'd rather save them for a separate installment where Ican do them justice.

Have fun with the new, leaner arithmetic and logical parsing, and I'll see you soon withrelationals.


CHAPTER 3 Practical problems and their solutions...

We want to discuss in this chapter important pratical problems and their solutions. Most of theproblems seem to have a simple solution but when you go deeper in the development of thedisassembler you will see that these problems should be discussed.

Some of the problems are how to load the file into memory or how to catch the files entry-point. As well we will have a look how you can do complex parsing in assembly language.

This chapter is for the unexperienced users and should help to solve some basic problems.


Practical problems and their solutions...

61

Lesson 1 - Loading Files Into Memory


Lesson 2 - Receiving Infos Of The Sections Of A PE-File

Lesson 2 - Receiving Infos Of The Sections Of A PE-File



61

Lesson 3 - Catching The Entry-Point Of A PE-File


Lesson 4 - Linked Lists


Assembler-Source-Code1

; #########################################################################; LinkedList.inc; The following code is for educational purposes only.; However, since linkedlists are a fundamental part of programming,; feel free to use this file as you please.; ######################################################################### ;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~; Initial Linkedlist Code:; KillEntry, AddEntry, plus initial structure; EvilHomer2k, 15 August 2002, 3:54 in the morning. ;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~; Initial Example Program:; bug fixes, and the addition of KillEntryPlusChildren; Scronty, 15 August 2002, 11:24pm. ;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~; Initial Linkedlist Code Update:; References to false LinkedObject fields corrected.; Code in AddEntry altered to include an ObjectSize param for; new entries.; EvilHomer2k, 15 August 2002, 9:00 pm. ;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

1. This code was taken from http://board.win32asmcommunity.net/showthread.php?s=&threadid=7361&high-light=linked+lists



62

; Example Program Update:; Implemented EvilHomer2ks altered AddEntry param (ObjectSize).; Added Sibling fieldnames in LINKEDOBJECT struct (no procs).; Added Application-Specific fieldnames in LINKEDOBJECT; struct (NAME).; Added NewName and KillName procs for the NAME fieldnames.; Scronty, 16 August 2002, 11:08am.; --------; Changed name of AddEntry procedure to AddChildEntry.; Added procedure: AddSiblingEntry; Added procedure: KillEntryPlusYoungerSiblings; Modified procedure: KillEntry to also patch Sibling links.; EvilHomer, 18 August 2002, 11:02pm. ;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~; Current Table of Procedures:; ============================; -AddChildEntry; -AddSiblingEntry; -NewName; -KillName; -KillEntry (not recursive)(Checks all links); -KillEntryPlusChildren (recursive)(does not check Sibling links); -KillEntryPlusYoungerSiblings (recursive)(does not check Parent-Child links);;-------------------------------------------------------------------;Structure of an Entry in a Linked List;-------------------------------------------------------------------

_LINKEDOBJECT STRUCT ;Example structure is minimal - add some more fields to it

; Everything between the "~~~~" lines are mandatory. ;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ pParent DWORD ? ;Pointer to my parent if I have one (Parent) pChild DWORD ? ;Pointer to my child if I have one (Child) pOlderSibling DWORD ? ;Pointer to my older sibling if I have one (Older Sibling) pYoungerSibling DWORD ? ;Pointer to my younger sibling if I have one (Younger Sibling) hLock DWORD ? ;Handle for freeing this memory



;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

; Add Application-Specific fields here ;_____________________________________ ; NAME pName DWORD ? ;Pointer to the name of this object (Name) hNameLock DWORD ? ;Handle for freeing the Names' memory ;_____________________________________

_LINKEDOBJECT ENDS LINKEDOBJECT TYPEDEF _LINKEDOBJECT LPLINKEDOBJECT TYPEDEF PTR _LINKEDOBJECT

; ########################################################################

; macros:; CTEXT macro; eg.; invoke MessageBox, NULL, CTEXT("Hello World!"), NULL, MB_OKCTEXT macro y:vararg local sym

const segment

ifidni <y>, <> sym db 0 else sym db y, 0 endif const ends exitm <offset sym>endm;## 'return' Macro ##return MACRO returnvalue mov eax, returnvalue retENDM



62

; ########################################################################.data

ERRbuff db 128 DUP (0)

; ########################################################################.codeKillName PROC lpThis:PTR LINKEDOBJECT push edi

mov eax, lpThis mov edi, eax

.if [edi].LINKEDOBJECT.hNameLock != NULL ;I have a Name mov eax,[edi].LINKEDOBJECT.hNameLock ;This bit happens regardless... invoke GlobalUnlock,eax ;Unlock this memory mov eax,[edi].LINKEDOBJECT.hNameLock ;Grab the handle to this memory invoke GlobalFree,eax ;Release this memory ;_____________________________________ ;Null-out Name fields mov [edi].LINKEDOBJECT.pName, NULL mov [edi].LINKEDOBJECT.hNameLock, NULL ;_____________________________________

invoke MessageBox,NULL,CTEXT("Killed Name!"),CTEXT("Success!"),MB_OK .endif

pop edi

return TRUEKillName ENDP



NewName PROC lpThis:PTR LINKEDOBJECT, pszNewName:DWORDLOCAL dwSize:DWORD

push edi push esi


;________________________________ ;Get the string length mov eax, pszNewName @@: mov dl, [eax] inc eax cmp dl, 0 jne @B sub eax, pszNewName dec eax ; correct count mov dwSize, eax ;________________________________ mov eax, dwSize inc eax invoke GlobalAlloc,GPTR,eax ;Allocate memory for name mov [edi].LINKEDOBJECT.hNameLock, eax ;Remember the unlock handle



62

.if eax != NULL invoke GlobalLock,[edi].LINKEDOBJECT.hNameLock mov [edi].LINKEDOBJECT.pName, eax .if eax != NULL ;________________________________ ;Copy name into allocated memory cld mov esi, [pszNewName] mov eax, [edi].LINKEDOBJECT.pName mov edi, eax mov ecx, dwSize shr ecx, 2 rep movsd mov ecx, dwSize and ecx, 3 rep movsb inc edi mov BYTE PTR [edi], 0 ;Appended a 0 ;________________________________

mov eax, lpThis invoke MessageBox,NULL,CTEXT("Added Name!"),[eax].LINKEDOB-JECT.pName,MB_OK

;Return the node-pointer back to the caller mov eax, lpThis pop esi pop edi return eax ;Return pointer to the new Object in EAX .else ;GlobalLock failed... invoke GetLastError invoke wsprintf,addr ERRbuff,CTEXT("GlobalLock err #%lu"),eax invoke MessageBox,NULL, addr ERRbuff,CTEXT("Error!"),MB_OK+MB_ICONERROR invoke GlobalFree,[edi].LINKEDOBJECT.hNameLock ;Free the mem-ory we Failed to Lock mov [edi].LINKEDOBJECT.hNameLock, NULL .endif



.else ;GlobalAlloc failed... invoke GetLastError invoke wsprintf,addr ERRbuff,CTEXT("GlobalAlloc err #%lu"),eax invoke MessageBox,NULL,addr ERRbuff,CTEXT("Error!"),MB_OK+MB_ICONERROR .endif

pop esi pop edi

xor eax,eax ret ;Return ERROR in eax since we have Failed

NewName ENDP



62

;-------------------------------------------------------------------;KillEntry Procedure;-Removes an entry from a Linked-List.;-Revised to handle parent-child and/or sibling links.;-Examines Parent<--<THIS>-->Child links and;-Patches- Parent<-->Child (Bypassing Self).;-Examines OlderSibling<--<THIS>--YoungerSibling links and;-Patches- OlderSibling<-->YoungerSibling (Bypassing Self).;-Also detects and patches NewRoot and NewLast nodes.;-Releases the allocated memory used by the killed entry.;-In other words, flawless and transparent removal;-of a single entry in our List.;-------------------------------------------------------------------KillEntry PROC lpThis:PTR LinkedObject ;pointer to entry to be deleted push edi push esi


;Kill any Name for this node invoke KillName, edi

.if ([edi].LINKEDOBJECT.pParent != NULL) || ([edi].LINKEDOBJECT.pOlderSibling != NULL) ;I have a Parent and thus am not Root.. .if [edi].LINKEDOBJECT.pChild != NULL ;..and I also have a Child.. mov eax, [edi].LINKEDOBJECT.pParent ;(fetch ParentPointer) mov esi, eax mov eax, [edi].LINKEDOBJECT.pChild ;(fetch ChildPointer) mov [esi].LINKEDOBJECT.pChild, eax ;link Parent to Child, bypassing me mov eax, [edi].LINKEDOBJECT.pChild ;(ChildPointer) mov esi, eax mov eax, [edi].LINKEDOBJECT.pParent ;(ParentPointer) mov [esi].LINKEDOBJECT.pParent, eax ;link Child to Parent, bypassing me



.else ;..and I have no Child.. mov eax, [edi].LINKEDOBJECT.pParent mov esi, eax mov [esi].LINKEDOBJECT.pChild,NULL ;Kill Parent's link to me .endif ;---------- .if [edi].LINKEDOBJECT.pYoungerSibling !=NULL ;..and I have a younger sib-ling mov eax, [edi].LINKEDOBJECT.pOlderSibling mov esi,eax mov eax,[edi].LINKEDOBJECT.pYoungerSibling mov [esi].LINKEDOBJECT.pYoungerSibling,eax mov eax,[edi].LINKEDOBJECT.pYoungerSibling mov esi,eax mov eax,[edi].LINKEDOBJECT.pOlderSibling mov [esi].LINKEDOBJECT.pOlderSibling,eax .else ;..I have no younger sibling.. mov eax, [edi].LINKEDOBJECT.pOlderSibling mov esi, eax mov [esi].LINKEDOBJECT.pYoungerSibling,NULL ;Kill Parent's link to me .endif ;(setting it as Last) invoke MessageBox,NULL,CTEXT("killed Child!"),CTEXT("Success!"),MB_OK

.else ;I am Root and have no Parent.. .if [edi].LINKEDOBJECT.pChild != NULL ;..but I do have a Child mov eax, [edi].LINKEDOBJECT.pChild mov esi, eax mov [esi].LINKEDOBJECT.pParent,NULL ;Kill Child's link to Parent .endif ;(setting it as Root) .if [edi].LINKEDOBJECT.pYoungerSibling != NULL ;..but I do have a Child mov eax, [edi].LINKEDOBJECT.pYoungerSibling mov esi, eax mov [esi].LINKEDOBJECT.pOlderSibling,NULL ;Kill Child's link to Par-ent .endif ;(setting it as Root) invoke MessageBox,NULL,CTEXT("killed Root!"),CTEXT("Success!"),MB_OK .endif



62

; (no parent and no child? alone? nothing to repair then)

mov eax,[edi].LINKEDOBJECT.hLock ;This bit happens regard-less... invoke GlobalUnlock,eax ;Unlock this memory mov eax,[edi].LINKEDOBJECT.hLock ;Grab the handle to this memory invoke GlobalFree,eax ;Release this memory

pop esi pop edi

return TRUE ;cyaKillEntry ENDP

;-------------------------------------------------------------------;AddChildEntry Procedure;-Adds an entry to a Linked-List...;-Allocates memory for a new entry,;-Examines Parent<--->Child links and;-Patches- Parent<--<THIS>-->Child (Inserting Self).;-Bidirection links are preserved.;-------------------------------------------------------------------AddChildEntry PROC lpParent:PTR LINKEDOBJECT, ObjectSize:DWORDLOCAL lpOldChild:PTR LINKEDOBJECTLOCAL hMem:DWORD push edi push esi

mov eax, lpParent mov esi, eax



invoke GlobalAlloc,GPTR,ObjectSize mov hMem,eax .if eax != NULL invoke GlobalLock,hMem mov edi, eax .if eax != NULL mov eax, hMem mov [edi].LINKEDOBJECT.hLock, eax ;Remember my unlock handle .if esi != NULL ;I have a Parent and thus am not Root mov eax, [esi].LINKEDOBJECT.pChild mov lpOldChild, eax ;store possible child mov [esi].LINKEDOBJECT.pChild, edi ;Tell Parent hes my new daddy- APPENDING mov [edi].LINKEDOBJECT.pParent, esi ;Tell Myself I have a Parent mov eax, lpOldChild .if eax != NULL ;and that Parent had a Child - INSERTING! mov eax, lpOldChild mov esi, eax mov [esi].LINKEDOBJECT.pParent, edi ;Tell Child I'm their new sugardaddy mov [edi].LINKEDOBJECT.pChild, esi ;Tell Myself I have a Child invoke MessageBox,NULL,CTEXT("Child Inserted!"),CTEXT("Suc-cess!"),MB_OK .else mov [edi].LINKEDOBJECT.pChild, NULL invoke MessageBox,NULL,CTEXT("Child Appended!"),CTEXT("Suc-cess!"),MB_OK .endif ;I have no kids to worry about or .else ;I am Root with No Parent and No Child mov [edi].LINKEDOBJECT.pParent, NULL mov [edi].LINKEDOBJECT.pChild, NULL mov [edi].LINKEDOBJECT.pOlderSibling, NULL mov [edi].LINKEDOBJECT.pYoungerSibling, NULL invoke MessageBox,NULL,CTEXT("Added Root!"),CTEXT("Success!"),MB_OK .endif



63

;_____________________________________ ;Null-out Application-Specific fields mov [edi].LINKEDOBJECT.pName, NULL mov [edi].LINKEDOBJECT.hNameLock, NULL ;_____________________________________ mov eax, edi pop esi pop edi return eax ;Return pointer to the new Object in EAX .else ;GlobalLock failed... invoke GetLastError invoke wsprintf,addr ERRbuff,CTEXT("GlobalLock err #%lu"),eax invoke MessageBox,NULL, addr ERRbuff,CTEXT("Error!"),MB_OK+MB_ICONERROR invoke GlobalFree,hMem ;Free the memory we Failed to Lock .endif .else ;GlobalAlloc failed... invoke GetLastError invoke wsprintf,addr ERRbuff,CTEXT("GlobalAlloc err #%lu"),eax invoke MessageBox,NULL,addr ERRbuff,CTEXT("Error!"),MB_OK+MB_ICONERROR .endif

pop esi pop edi


AddChildEntry ENDP



;-------------------------------------------------------------------;AddSiblingEntry Procedure;-Adds an entry to a Linked-List...;-Allocates memory for a new entry,;-Examines OlderSibling<--->YoungerSibling links and;-Patches- Parent<--<THIS>-->Child (Inserting Self).;-Bidirection links are preserved.;-------------------------------------------------------------------AddSiblingEntry PROC lpParent:PTR LINKEDOBJECT, ObjectSize:DWORDLOCAL lpOldChild:PTR LINKEDOBJECTLOCAL hMem:DWORD push edi push esi

mov eax, lpParent mov esi, eax

invoke GlobalAlloc,GPTR,ObjectSize mov hMem,eax .if eax != NULL invoke GlobalLock,hMem mov edi, eax .if eax != NULL mov eax, hMem mov [edi].LINKEDOBJECT.hLock, eax ;Remember my unlock handle .if esi != NULL ;I have a Parent and thus am not Root mov eax, [esi].LINKEDOBJECT.pYoungerSibling mov lpOldChild, eax ;store possible child mov [esi].LINKEDOBJECT.pYoungerSibling, edi ;Tell Parent hes my new daddy- APPENDING mov [edi].LINKEDOBJECT.pOlderSibling, esi ;Tell Myself I have a Parent mov eax, lpOldChild .if eax != NULL ;and that Parent had a Child - INSERTING! mov eax, lpOldChild mov esi, eax mov [esi].LINKEDOBJECT.pOlderSibling, edi ;Tell Child I'm their new sugardaddy



63

mov [edi].LINKEDOBJECT.pYoungerSibling, esi ;Tell Myself I have a Child invoke MessageBox,NULL,CTEXT("Sibling Inserted!"),CTEXT("Suc-cess!"),MB_OK .else mov [edi].LINKEDOBJECT.pChild, NULL invoke MessageBox,NULL,CTEXT("Sibling Appended!"),CTEXT("Suc-cess!"),MB_OK .endif ;I have no kids to worry about or .else ;I am Root with No Parent and No Child mov [edi].LINKEDOBJECT.pParent, NULL mov [edi].LINKEDOBJECT.pChild, NULL mov [edi].LINKEDOBJECT.pOlderSibling, NULL mov [edi].LINKEDOBJECT.pYoungerSibling, NULL invoke MessageBox,NULL,CTEXT("Added Root!"),CTEXT("Suc-cess!"),MB_OK .endif

;_____________________________________ ;Null-out Application-Specific fields mov [edi].LINKEDOBJECT.pName, NULL mov [edi].LINKEDOBJECT.hNameLock, NULL ;_____________________________________ mov eax, edi pop esi pop edi return eax ;Return pointer to the new Object in EAX .else ;GlobalLock failed... invoke GetLastError invoke wsprintf,addr ERRbuff,CTEXT("GlobalLock err #%lu"),eax invoke MessageBox,NULL, addr ERRbuff,CTEXT("Error!"),MB_OK+MB_ICONERROR invoke GlobalFree,hMem ;Free the memory we Failed to Lock .endif



.else ;GlobalAlloc failed... invoke GetLastError invoke wsprintf,addr ERRbuff,CTEXT("GlobalAlloc err #%lu"),eax invoke MessageBox,NULL,addr ERRbuff,CTEXT("Error!"),MB_OK+MB_ICONERROR .endif

pop esi pop edi


AddSiblingEntry ENDP



63

;-------------------------------------------------------------------;KillEntryPlusChildren Procedure;-Removes an entry from a Linked-List.;-Examines Parent<--<THIS>-->Child links and;-Patches- Parent (Deleting Self).;-Also detects Child links and recursively removes them.;-Releases the allocated memory used by the killed entry.;-In other words, flawless and transparent removal;-of a single entry in our List.;-------------------------------------------------------------------KillEntryPlusChildren PROC lpThis:PTR LinkedObject ;pointer to entry to be deleted push edi push esi


.if [edi].LINKEDOBJECT.pParent != NULL ;I have a Parent and thus am not Root.. mov eax, [edi].LINKEDOBJECT.pParent mov esi, eax mov [esi].LINKEDOBJECT.pChild,NULL ;Kill Par-ent's link to me .if [edi].LINKEDOBJECT.pChild !=NULL ;..and I also have a Child.. invoke KillEntryPlusChildren, [edi].LINKEDOBJECT.pChild ; Kill Child ;Kill any Name for this node invoke KillName, lpThis invoke MessageBox,NULL,CTEXT("killed Child Node!"),CTEXT("Suc-cess!"),MB_OK .else ;Kill any Name for this node invoke KillName, lpThis invoke MessageBox,NULL,CTEXT("killed End Node!"),CTEXT("Suc-cess!"),MB_OK .endif

.else ;I am Root with No Parent



.if [edi].LINKEDOBJECT.pChild != NULL ;..and I also have a Child.. invoke KillEntryPlusChildren, [edi].LINKEDOBJECT.pChild ; Kill Child .endif ;Kill any Name for this node invoke KillName, lpThis invoke MessageBox,NULL,CTEXT("Killed Root!"),CTEXT("Success!"),MB_OK

.endif

invoke GlobalUnlock,[edi].LINKEDOBJECT.hLock ;Unlock this memory invoke GlobalFree,[edi].LINKEDOBJECT.hLock ;Release this memory

pop esi pop edi

return TRUE ;cyaKillEntryPlusChildren ENDP



63

;-------------------------------------------------------------------;KillEntryPlusYoungerSiblings Procedure;-Removes an entry from a Linked-List.;-Examines OlderSibling<--<THIS>-->YoungerSibling links and;-Patches- OlderSibling<-->YoungerSibling (Deleting Self).;-Also detects Child links and recursively removes them.;-Releases the allocated memory used by the killed entry.;-In other words, flawless and transparent removal;-of a single entry in our List plus its Younger Siblings.;-------------------------------------------------------------------KillEntryPlusYoungerSiblings PROC lpThis:PTR LinkedObject ;pointer to entry to be deleted push edi push esi


.if [edi].LINKEDOBJECT.pOlderSibling != NULL ;I have a Parent and thus am not Root.. mov eax, [edi].LINKEDOBJECT.pOlderSibling mov esi, eax mov [esi].LINKEDOBJECT.pYoungerSibling,NULL ;Kill Parent's link to me .if [edi].LINKEDOBJECT.pYoungerSibling !=NULL ;..and I also have a Child.. invoke KillEntryPlusYoungerSiblings, [edi].LINKEDOBJECT.pYoungerSib-ling; Kill Child ;Kill any Name for this node invoke KillName, lpThis invoke MessageBox,NULL,CTEXT("killed Child Sibling Node!"),CTEXT("Success!"),MB_OK .else ;Kill any Name for this node invoke KillName, lpThis invoke MessageBox,NULL,CTEXT("killed End Sibling Node!"),CTEXT("Suc-cess!"),MB_OK .endif

.else ;I am Root with No Parent



.if [edi].LINKEDOBJECT.pChild != NULL ;..and I also have a Child.. invoke KillEntryPlusYoungerSiblings [edi].LINKEDOBJECT.pYoungerSib-ling ; Kill Child .endif ;Kill any Name for this node invoke KillName, lpThis invoke MessageBox,NULL,CTEXT("Killed Root!"),CTEXT("Success!"),MB_OK

.endif

invoke GlobalUnlock,[edi].LINKEDOBJECT.hLock ;Unlock this memory invoke GlobalFree,[edi].LINKEDOBJECT.hLock ;Release this memory

pop esi pop edi

return TRUE ;cyaKillEntryPlusYoungerSiblings ENDP



63

Lesson 5 - Parsing2

Here we have some code-snippets which could be helpful. For sure parsing is very veryvarious and can be very complex, so these are just some examples to help you on yourway!

2. This lesson contains various code-snippets from different authors and have been all posted at http://board.win32asmcommunity.net. Respect the authors work!


Lesson 5 - Parsing

By SliverFind_First_OfFind_Last_OfFind_First_Not_Of required arguements:1) the string to be searched2) the separators to be found3) the starting position

How it works:

1) Find_First_Of returns the first instance of a given separator assuming the sentence was:"Hello everyone how are you doing" and the separator was " " (a space) it would return the 5in eax first letter in the string is at starting position (0)

2)Find_Last_Of find last occurance of separator

3)Find_First_Not_Of find first occurance of something that's not a separator



64

code: ; #########################################################################;; Find First Of / Find_Last_Of / Find_First_Not_Of; Suppose you had a string -- a paragraph of prose, perhaps -- and you wanted; break it up into individual words. You would need to find where the ; separators were, and those could be any of a number of different characters;; there could be spaces, commas, periods, colons and so on. This is a procedure; where for any one of a given set of characters occurs in a string -- this could; tell you where the delimiter for the words are. I hope this makes someones; life a little easier :-) Cheers, Walter Reid (Sliver);;; Works like this:; invoke Find_First_Of, string to be searched, separators, starting position; returns the locations of the first separator in eax;; invoke Find_First_Of, string to be searched, separators, starting position; returns the location of the last separator in eax; #########################################################################

.386 .model flat, stdcall option casemap :none ; case sensitive

; #########################################################################


include \masm32\include\user32.inc include \masm32\include\kernel32.inc include \masm32\include\masm32.inc include \masm32\include\debug.inc includelib \masm32\lib\user32.lib includelib \masm32\lib\kernel32.lib includelib \masm32\lib\masm32.lib includelib \masm32\lib\debug.lib


Lesson 5 - Parsing

Main PROTO Find_First_Of PROTO :DWORD, :DWORD, :DWORD Find_Last_Of PROTO :DWORD, :DWORD, :DWORD Find_First_Not_Of PROTO :DWORD, :DWORD, :DWORD

Find_Last_Of proc lpszSource:DWORD, lpszTarget:DWORD, StartPos:DWORD LOCAL val:DWORD

mov val, 0 mov edi, lpszTarget xor ecx, ecx start_scan: mov esi, lpszSource add esi, StartPos add esi, ecx next: mov al, byte ptr [esi] inc esi cmp al, byte ptr [edi] je found inc ecx

cmp al, 0 jne next

found2: mov ecx, val inc edi cmp byte ptr [edi], 0 jne start_scan jmp done found: mov val, ecx jmp found2



64

done: mov eax, ecx ret Find_Last_Of endp

Find_First_Of proc lpszSource:DWORD, lpszTarget:DWORD, StartPos:DWORD LOCAL val:DWORD mov val, 100 mov edi, lpszTarget start_scan: mov esi, lpszSource add esi, StartPos xor ecx, ecx next: mov al, byte ptr [esi] inc esi

cmp ecx, val je found

cmp al, byte ptr [edi] je found inc ecx cmp al, 0 je start_scan

jmp next


Lesson 5 - Parsing

found: mov val, ecx xor ecx, ecx inc edi cmp byte ptr [edi], 0 jne start_scan done: mov eax, val add eax, StartPos ret Find_First_Of endp

Find_First_Not_Of proc lpszSource:DWORD, lpszTarget:DWORD, StartPos:DWORD LOCAL val:DWORD mov val, 100 xor ecx, ecx mov esi, lpszSource add esi, StartPos start_scan: mov edi, lpszTarget next: mov al, byte ptr [esi] cmp al, 0 je done cmp al, byte ptr [edi] jne no_match



64

match: inc esi cmp byte ptr [esi], 0 je done

inc ecx jmp start_scan

no_match: inc edi cmp byte ptr [edi], 0 jne next mov val, ecx done: mov eax, val add eax, StartPos ret Find_First_Not_Of endp; #########################################################################

.data Msg1 db "Hi! My name is Walter. How are you?",0 Msg2 db "aeioufaefeaio",0 Txt db " ?.!,",0 Txt2 db "uoaei",0; #########################################################################

.code

start: invoke Main invoke ExitProcess,0


Lesson 5 - Parsing

Main proc

invoke Find_First_Of, ADDR Msg1, ADDR Txt, 0 PrintText "Find the first separator ( ?.!,) -- starting at pos 0" PrintText "in the sentance 'Hi! My name is Walter. How are you?'" PrintDec eax PrintText "Value returned is from the first character (0)"

PrintText " " PrintText " " invoke Find_First_Of, ADDR Msg1, ADDR Txt2, 14 PrintText "Find the first vowel (uoaei) -- starting at pos 14 (space after 'is')" PrintText "in the sentance 'Hi! My name is Walter. How are you?'" PrintDec eax PrintText "Value returned is from the first character (0)"

PrintText " " PrintText " " PrintText " " PrintText " " invoke Find_Last_Of, ADDR Msg1, ADDR Txt, 0 PrintText "Find the last separator ( ?.!,) -- starting at pos 0" PrintText "in the sentance 'Hi! My name is Walter. How are you?'" PrintDec eax PrintText "Value returned is from the first character (0)"

PrintText " " PrintText " " invoke Find_Last_Of, ADDR Msg1, ADDR Txt2, 0 PrintText "Find the last vowel (uoaei) -- starting at pos 0" PrintText "in the sentance 'Hi! My name is Walter. How are you?'" PrintDec eax PrintText "Value returned is from the first character (0)"



64

PrintText " " PrintText " " PrintText " " PrintText " " invoke Find_First_Not_Of, ADDR Msg2, ADDR Txt2, 0 PrintText "Find the first not of separator (uoaei) -- starting at pos 0" PrintText "in the sentance 'aeioufaefeaio'" PrintDec eax PrintText "Value returned is from the first character (0)"

PrintText " " PrintText " " invoke Find_First_Not_Of, ADDR Msg2, ADDR Txt2, 6 PrintText "Find the first not of separator (uoaei) -- starting at pos 6" PrintText "in the sentance 'aeioufaefeaio'" PrintDec eax PrintText "Value returned is from the first character (0)" retMain endp end start


Lesson 5 - Parsing

by Eóincode: ParseString Proc uses ebx esi edi pStr:DWORD,sPos:DWORD,pBuf:DWORDInRange MACRO a,b,clea ecx,[a-b]lea edx,[a-c-1]xor edx,ecxor ebx,edxEndM

Ranges MACROInRange eax,'a','z'InRange eax,'0','9'InRange eax,'A','Z'EndM

mov esi,pStrmov edi,pBufadd esi,sPosassume esi:ptr byteassume edi:ptr byte

@@:movzx eax,[esi]xor ebx,ebxtest eax,eaxjz nlbRangesjs @Finc esijmp @B

@@:mov [edi],alinc esiinc edi

movzx eax,[esi]xor ebx,ebxtest eax,eaxjz nlb



64

Rangesjs @B

nlb:mov [edi],0mov eax,esisub eax,pStrretParseString EndP

Usage is simple, call the function with a pointer to the string you wish to parse, the startposition and a pointer to a buffer to contain the parse part.

.dataszTest db "This is a test",0.data?Pos dd ?Buf db 64 dup (?)

.codeInvoke ParseString,addr szTest,0,addr Bufmov Pos,eax ; Buf contains "This",0

Invoke ParseString,addr szTest,Pos,addr Bufmov Pos,eax ; Buf contain "is",0…


Lesson 5 - Parsing

By StrykerString ReverseOutput: dlrow leurc ollehcode:

.386

.MODEL flat, stdcalloption casemap:none

INCLUDE \masm32\include\windows.incINCLUDE \masm32\include\kernel32.incINCLUDELIB \masm32\lib\kernel32.libINCLUDE \masm32\include\user32.incINCLUDELIB \masm32\lib\user32.lib

.data

mystringdata db "hello cruel world", 0 buffer db 20 DUP(0) .code

Start:

invoke lstrlen, OFFSET mystringdata mov ecx, eax mov esi, OFFSET mystringdata mov edi, OFFSET buffer @@: dec ecx mov dl, BYTE ptr [esi+ecx] mov BYTE ptr[edi], dl inc edi or ecx, ecx ja @b invoke MessageBox, 0, OFFSET buffer, 0, 0 invoke ExitProcess, 0END Start



65

Reverses string until the center character then reverses up the string again.

Output: dlrow leuel worldcode: .386.MODEL flat, stdcalloption casemap:none

INCLUDE \masm32\include\windows.incINCLUDE \masm32\include\kernel32.incINCLUDELIB \masm32\lib\kernel32.libINCLUDE \masm32\include\user32.incINCLUDELIB \masm32\lib\user32.lib

.data

mystringdata db "hello cruel world", 0 .code

Start:

invoke lstrlen, OFFSET mystringdata mov ecx, eax mov esi, OFFSET mystringdata mov edi, OFFSET mystringdata @@: dec ecx mov dl, BYTE ptr [esi+ecx] mov BYTE ptr[edi], dl inc edi or ecx, ecx ja @b mov BYTE ptr[edi], cl invoke MessageBox, 0, OFFSET mystringdata, 0, 0 invoke ExitProcess, 0END Start


Lesson 6 -OOP

Lesson 6 -OOP3

Main-File of OOP

.386

.model flat,stdcalloption casemap:none

include \masm32\include\windows.incinclude \masm32\include\kernel32.incinclude \masm32\include\user32.incinclude \masm32\include\masm32.inc

includelib \masm32\lib\kernel32.libincludelib \masm32\lib\user32.libincludelib \masm32\lib\masm32.lib

include \masm32\include\Objects.inc ; Our Object Include Macro Setinclude myClass.asm ; The Class Definition File

.data? myNiceClass dd ? ; Class Instance Handle .codestart:

mov myNiceClass, $NEW( myClass ) ; init class: myNiceClass = new myClass() METHOD myNiceClass, myClass, SetVariable ; now set variable: myNice-Class.setVariable();

METHOD myNiceClass, myClass, Print ; and print: print (myNice-Class.myVariable);

3. This is is copyrighted By NaN ( [email protected] ). He submitted this source at board.win32assembler.net after I asked for help on OOP for MASM32. Respect this please.



65

DESTROY myNiceClass ; Must Clean up when finished.

invoke ExitProcess, NULLend startend


Lesson 6 -OOP

Class-File of OOP

IFNDEF _myClass__myClass_ equ 1

; --=====================================================================================--; #CLASS: myClass ; #VERSION: 1.0; --=====================================================================================--; Built by NaN's Object Class Creator; © Sept 19, 2001;; By NaN ( [email protected] ); http://nan32asm.cjb.net;; --=====================================================================================--; #AUTHOR: NaN ; #DATE: Sept. 25, 2001;; #DESCRIPTION:;; Test "Hello World" class for example.;; --=====================================================================================--; CLASS METHOD PROTOS; --=====================================================================================-- myClass_Init PROTO :DWORD



65

; --=====================================================================================--; FUNCTION POINTER PROTOS; --=====================================================================================-- myCl_destructorPto TYPEDEF PROTO :DWORD myCl_PrintPto TYPEDEF PROTO :DWORD myCl_SetVariablePto TYPEDEF PROTO :DWORD

; --=====================================================================================--; CLASS STRUCTURE; --=====================================================================================-- CLASS myClass, myCl CMETHOD destructor ; MUST BE THE FIRST, OR OBJECTS.INC WILL FAIL CMETHOD Print ; Used to create a Message Box, and Print the Vari-able Data CMETHOD SetVariable ; Used to fill the internal buffer with a text string PrivateBuffer dd 32 dup(?) ; 128 byte buffer myClass ENDS

.data

BEGIN_INIT dd offset myCl_destructor_Funct dd offset myCl_Print_Funct dd offset myCl_SetVariable_Funct dd 32 dup( 0 ) ; 32 NULL's for initial buffer. END_INIT

.code


Lesson 6 -OOP

; --=====================================================================================--; #METHOD: CONSTRUCTOR (NONE);; #DESCRIPTION: Empty Constructor, that does nothing specific..;; --=====================================================================================--myClass_Init PROC uses edi esi lpTHIS:DWORD SET_CLASS myClass SetObject edi, myClass

ReleaseObject edi retmyClass_Init ENDP

; --=====================================================================================--; #METHOD: destructor (NONE);; #DESCRIPTION: Empty Destructor, that does nothing specific..;; --=====================================================================================--myCl_destructor_Funct PROC uses edi lpTHIS:DWORD SetObject edi, myClass

ReleaseObject edi retmyCl_destructor_Funct ENDP



65

; --=====================================================================================--; #METHOD: Print();; #DESCRIPTION: Creates a Message Box with the String Data, IF and only IF the string; string is set using the SetVariable Method..;; --=====================================================================================--myCl_Print_Funct PROC uses edi lpTHIS:DWORD SetObject edi, myClass mov al, BYTE PTR [edi].PrivateBuffer ; Get first buffer byte cmp al, 0 ; See if its NULL je @F ; Yes, then dont print, and exit invoke MessageBox, NULL, addr [edi].PrivateBuffer, ; Print Out the Message NULL, MB_OK ; @@: ; Exit ReleaseObject edi retmyCl_Print_Funct ENDP

; --=====================================================================================--; #METHOD: SetVariable (R);; #DESCRIPTION: Fills a private class buffer with string data..;


Lesson 6 -OOP

; --=====================================================================================--myCl_SetVariable_Funct PROC uses edi lpTHIS:DWORD SetObject edi, myClass .data SetDataString db "Hello ASM Coder, this is the NaN/Thomas OOP model!",0 .code invoke StrLen, addr SetDataString mov edx, eax invoke MemCopy, addr SetDataString, addr [edi].PrivateBuffer, edx ReleaseObject edi retmyCl_SetVariable_Funct ENDP

ENDIF



65

Lesson 7 - SEH4

SEH.asm

.386

.model flat,stdcalloption casemap:none

INCLUDE \masm32\include\user32.incINCLUDELIB \masm32\lib\user32.lib

INCLUDE \masm32\include\windows.incINCLUDE SEH.inc

.DATAszGood DB "SEH succeed :)",0szCap DB "OK",0

.CODEmain:InstSehFrame <OFFSET SavePlace1>

; CRASH CODE 1XOR EAX, EAXXCHG DWORD PTR [EAX], EAX

SavePlace1:KillSehFrameInstSehFrame <OFFSET SavePlace2>

; CRASH CODE 2XOR EBX, EBXXOR EDX, EDX

4. Code-snippet coded by yoda (http://y0da.cjb.net)


Lesson 7 - SEH

MOV EAX, 2DIV EBX

SavePlace2:KillSehFrame

INVOKE MessageBox,0,offset szGood,offset szCap,MB_OKRETend main



66

SEH.inc

COMMENT @

SEH.inc (MASM)-------...a lame include file for SEH macro's.

by yoda

@

;---- STRUCTs ----sSEH STRUCTOrgEsp DD ?OrgEbp DD ?SaveEip DD ?sSEH ENDS

;---- MACROs ----InstSehFrame MACRO ContinueAddrASSUME FS : NOTHING

IFNDEF SehStruct SehStruct EQU 1 .DATA SEH sSEH <> ENDIF .CODEMOV SEH.SaveEip, ContinueAddrMOV SEH.OrgEbp, EBPPUSH OFFSET SehHandlerPUSH FS:[0]MOV SEH.OrgEsp, ESPMOV FS:[0], ESPENDM

KillSehFrame MACRO


Lesson 7 - SEH

POP FS:[0]ADD ESP, 4ENDM

;---- ROUTINEs ----.CODESehHandler PROC C pExcept:DWORD,pFrame:DWORD,pContext:DWORD,pDispatch:DWORD

MOV EAX, pContextASSUME EAX : PTR CONTEXT

PUSH SEH.SaveEipPOP [EAX].regEipPUSH SEH.OrgEspPOP [EAX].regEspPUSH SEH.OrgEbpPOP [EAX].regEbp

MOV EAX, ExceptionContinueExecution

RETSehHandler ENDP



66

Lesson 8 - Trees5

CurNode == rootNode code:

PrefixPrint PROC CurNode:DWORD mov eax, CurNode or eax, eax jnz @F ret @@: push eax invoke dwtoa, (BINTREE PTR[eax]).ID, OFFSET tmpBufr invoke SendDlgItemMessage, wHndle, IDE_BINTREEOUTPUT, EM_REPLACESEL, NULL, OFFSET Newline invoke SendDlgItemMessage, wHndle, IDE_BINTREEOUTPUT, EM_REPLACESEL, NULL, OFFSET tmpBufr pop eax push eax mov eax, (BINTREE PTR[eax]).ptLeft invoke PrefixPrint, eax pop eax mov eax, (BINTREE PTR[eax]).ptRight invoke PrefixPrint, eax ret PrefixPrint ENDPIDValue == id/key of the structurecode:

CreateNode PROC IDValue:DWORD

invoke GetProcessHeap mov hPrcs, eax

5. The following code-snippets were coded by stryker and published at http://board.win32asmcommunity.net. I have added parts of the thread.


Lesson 8 - Trees

invoke HeapAlloc, eax, HEAP_ZERO_MEMORY, SIZEOF BINTREE mov hMem, eax mov edx, IDValue mov (BINTREE PTR [eax]).ID, edx mov (BINTREE PTR [eax]).ptLeft, NULL mov (BINTREE PTR [eax]).ptRight, NULL ret

CreateNode ENDP CurNode == rootNodeIDValue == id/key of the structure code:

FindASpot PROC CurNode:DWORD, IDValue:DWORD

mov eax, CurNode or eax, eax jnz @@NodeNotNull ret @@NodeNotNull: mov edx, IDValue push eax push edx cmp edx, (BINTREE PTR [eax]).ID jl @@GoLeft ja @@GoRight invoke MessageBox, NULL, OFFSET BinTreeError, OFFSET BinTreeTitle, MB_OK pop edx pop eax ret @@GoLeft:



66

cmp (BINTREE PTR [eax]).ptLeft, NULL jne @@RecurseOnLeft push eax invoke CreateNode, edx pop ecx mov (BINTREE PTR [ecx]).ptLeft, eax jmp @@FoundASpot @@RecurseOnLeft: mov eax, (BINTREE PTR [eax]).ptLeft invoke FindASpot, eax, edx jmp @@FoundASpot @@GoRight: cmp (BINTREE PTR [eax]).ptRight, NULL jne @@RecurseOnRight push eax invoke CreateNode, edx pop ecx mov (BINTREE PTR [ecx]).ptRight, eax jmp @@FoundASpot @@RecurseOnRight: mov eax, (BINTREE PTR [eax]).ptRight invoke FindASpot, eax, edx @@FoundASpot: pop edx pop eax

ret

FindASpot ENDP


Lesson 8 - Trees

nNode == rootNode code:

DestroyTree PROC nNode:DWORD

mov eax, nNode or eax, eax jnz @F ret @@: push eax mov eax, (BINTREE PTR[eax]).ptLeft invoke DestroyTree, eax pop eax push eax mov eax, (BINTREE PTR[eax]).ptRight invoke DestroyTree, eax pop eax invoke HeapFree, hPrcs, NULL, eax ret

DestroyTree ENDP code:

@@BTN_ADDTOBTREE: invoke GetDlgItemInt, hWnd, IDE_INTOBTREE, OFFSET bState, FALSE cmp rootNode, NULL jne @@rootExists invoke CreateNode, eax mov rootNode, eax jmp @@PRINTCURRENTBINTREE



66

@@rootExists: invoke FindASpot, rootNode, eax @@PRINTCURRENTBINTREE: invoke SetDlgItemText, hWnd, IDE_BINTREEOUTPUT, NULL invoke PrefixPrint, rootNode jmp @@RETURN_TRUE

Assuming we clicked a button called AddToTree...

1. First we get the key/id number of the node from an edit box.

2. We then check if a root node exists, if not we will call CreateNode, the return value willbe the pointer in memory that was allocated. if it exists we will call the FindASpot proce-dure, what this does is it will recurse until it finds the right spot to place the leaf node.

3. Then print the current members of the current tree.

I forgot to add these 2 things on how to print the tree. code:

InfixPrint PROC CurNode:DWORD

mov eax, CurNode or eax, eax jnz @F ret @@: push eax mov eax, (BINTREE PTR[eax]).ptLeft invoke InfixPrint, eax pop eax


Lesson 8 - Trees

push eax invoke dwtoa, (BINTREE PTR[eax]).ID, OFFSET tmpBufr invoke SendDlgItemMessage, wHndle, IDE_BINTREEOUTPUT, EM_REPLACESEL, NULL, OFFSET Newline invoke SendDlgItemMessage, wHndle, IDE_BINTREEOUTPUT, EM_REPLACESEL, NULL, OFFSET tmpBufr pop eax mov eax, (BINTREE PTR[eax]).ptRight invoke InfixPrint, eax ret

InfixPrint ENDP

PostfixPrint PROC CurNode:DWORD mov eax, CurNode or eax, eax jnz @F ret @@: push eax mov eax, (BINTREE PTR[eax]).ptLeft invoke PostfixPrint, eax pop eax push eax mov eax, (BINTREE PTR[eax]).ptRight invoke PostfixPrint, eax pop eax invoke dwtoa, (BINTREE PTR[eax]).ID, OFFSET tmpBufr invoke SendDlgItemMessage, wHndle, IDE_BINTREEOUTPUT, EM_REPLACESEL, NULL, OFFSET Newline invoke SendDlgItemMessage, wHndle, IDE_BINTREEOUTPUT, EM_REPLACESEL, NULL, OFFSET tmpBufr ret PostfixPrint ENDP



66

If spring semester is over I'll release the whole source. Here are the struc-tures and "variables" I used code:

_DATA SEGMENT Newline DB 0Dh, 0Ah, 0 BinTreeTitle DB "stryker", 0 BinTreeError DB "Cannot Add To The Tree. Current", 0Dh, 0Ah DB "ID Value Already Exists On The Tree.", 0_DATA ENDS

_BSS SEGMENT tmpBufr DB 9 DUP(?) bState DD ? hPrcs DD ? hMem DD ? rootNode DD ? wHndle DD ?_BSS ENDS

BINTREE STRUCT ID DD ? ptLeft DD ? ptRight DD ?BINTREE ENDS Here's another one on searching for a node in the tree code:

BSearch PROC nNode:DWORD, IDValue:DWORD

mov eax, nNode mov edx, IDValue

@@SwingNode:

cmp eax, NULL


Lesson 8 - Trees

je @@BSearchExit

cmp edx, (BINTREE PTR [eax]).ID jl @@GoLeft ja @@GoRight ret

@@GoLeft:

mov eax, (BINTREE PTR [eax]).ptLeft jmp @@SwingNode

@@GoRight:

mov eax, (BINTREE PTR [eax]).ptRight jmp @@SwingNode

@@BSearchExit:

xor eax, eax ret

BSearch ENDP The return value will be in EAX, if it returns 0 then the key/id doesn't exists, else ...

Preliminary call: invoke BSearch, rootNode, IDorKeyToSearch Here's another update. This one will count all the nodes in a tree. code:

BCount PROC nNode:DWORD, nCount:DWORD

mov eax, nCount mov ecx, nNode or ecx, ecx jz @F inc eax push ecx



67

mov ecx, (BINTREE PTR [ecx]).ptLeft invoke BCount, ecx, eax pop ecx mov ecx, (BINTREE PTR [ecx]).ptRight invoke BCount, ecx, eax @@: ret BCount ENDP Return value will be in eax.

Preliminary Call: invoke BCount, rootNode, 0 Here's another update:code:

Max PROC A:DWORD, B:DWORD

mov eax, A mov edx, B cmp eax, edx jb @F ret @@: mov eax, edx ret Max ENDP

BHeight PROC nNode:DWORD

mov ecx, nNode or ecx, ecx jnz @F mov eax, -1 ret @@:


Lesson 8 - Trees

push ecx mov ecx, (BINTREE PTR [ecx]).ptLeft invoke BHeight, ecx pop ecx push eax mov ecx, (BINTREE PTR [ecx]).ptRight invoke BHeight, ecx pop edx inc eax inc edx invoke Max, eax, edx ret

BHeight ENDP

BLevel PROC nNode:DWORD

mov ecx, nNode or ecx, ecx jnz @F xor eax, eax ret @@: push ecx mov ecx, (BINTREE PTR [ecx]).ptLeft invoke BLevel, ecx pop ecx push eax mov ecx, (BINTREE PTR [ecx]).ptRight invoke BLevel, ecx pop edx inc eax inc edx invoke Max, eax, edx ret

BLevel ENDP

"BHeight will give you the height of the tree.



67

"BLevel will give you the level of the tree."Preliminary Call: invoke Function, rootNode"Return Value/s : in EAX.

Here's the final installment for the binary trees. I don't know if this one works perfectly butI did my best to hunt down the bugs.

To remove a node from the tree, just pass the root node and the key of the node to deleteand call BDelete. code:

BRemove PROC nParent:DWORD, nNode:DWORD

mov ecx, nParent mov eax, nNode

cmp (BINTREE PTR [eax]).ptLeft, NULL jne @@CheckRightNode cmp (BINTREE PTR [eax]).ptRight, NULL jne @@ChildOnRight

;Leaf Node

cmp eax, rootNode jne @F mov rootNode, 0 jmp @@DeallocateNode @@:

cmp (BINTREE PTR [ecx]).ptLeft, eax jne @@NullifyRight mov (BINTREE PTR [ecx]).ptLeft, NULL jmp @@DeallocateNode

@@NullifyRight:

mov (BINTREE PTR [ecx]).ptRight, NULL jmp @@DeallocateNode


Lesson 8 - Trees

@@CheckRightNode:

cmp (BINTREE PTR [eax]).ptRight, NULL jne @@TwoChildren

;Child On Left

or ecx, ecx jnz @F

mov ecx, (BINTREE PTR [eax]).ptLeft mov rootNode, ecx jmp @@DeallocateNode

@@:

cmp (BINTREE PTR [ecx]).ptLeft, eax je @@JoinLeft

mov edx, (BINTREE PTR [eax]).ptLeft mov (BINTREE PTR [ecx]).ptRight, edx jmp @@DeallocateNode

@@JoinLeft:

mov edx, (BINTREE PTR [eax]).ptLeft mov (BINTREE PTR [ecx]).ptLeft, edx jmp @@DeallocateNode

@@ChildOnRight:

;Child On Right

or ecx, ecx jnz @F

mov ecx, (BINTREE PTR [eax]).ptRight mov rootNode, ecx jmp @@DeallocateNode



67

@@:

cmp (BINTREE PTR [ecx]).ptLeft, eax je @@JoinRight

mov edx, (BINTREE PTR [eax]).ptRight mov (BINTREE PTR [ecx]).ptRight, edx jmp @@DeallocateNode

@@JoinRight:

mov edx, (BINTREE PTR [eax]).ptRight mov (BINTREE PTR [ecx]).ptLeft, edx jmp @@DeallocateNode

@@TwoChildren:

;Two Child Nodes

mov edx, eax mov ecx, eax mov eax, (BINTREE PTR [eax]).ptLeft

@@FindTheLargestKey:

cmp (BINTREE PTR [eax]).ptRight, NULL je @@Replace mov ecx, eax mov eax, (BINTREE PTR [eax]).ptRight jmp @@FindTheLargestKey

@@Replace:

;Just copy the contents to its new location

push (BINTREE PTR [eax]).pID pop (BINTREE PTR [edx]).pID

;Process other structure field names.

;Revert to the 2 cases above. Because the one to replace


Lesson 8 - Trees

;cannot be and will not have 2 child nodes. The one to replace ;will either fall into cases 1 and 2 which is either a leaf node ;or a node with only one child.

invoke BRemove, ecx, eax ret

@@DeallocateNode:

invoke HeapFree, hPrcs, NULL, eax ret

BRemove ENDP

BDelete PROC nNode:DWORD, IDValue:DWORD

mov eax, nNode mov edx, IDValue

@@SwingNode:

or eax, eax jz @@BSearchExit

cmp edx, (BINTREE PTR [eax]).pID jl @@GoLeft ja @@GoRight

invoke BRemove, ecx, eax ret @@GoLeft:

mov ecx, eax mov eax, (BINTREE PTR [eax]).ptLeft jmp @@SwingNode

@@GoRight:

mov ecx, eax mov eax, (BINTREE PTR [eax]).ptRight jmp @@SwingNode



67

@@BSearchExit:

xor eax, eax ret

BDelete ENDP


CHAPTER 4 The Basic Skeleton Of The Disassembler

Before we go to the coding of our disassembler we should define how our disasembler willlook like.

At first we need to design the GUI. There we define what buttons, list, menu-points we need.

Next we try to modularise the project. We need to define what functionality the disassemblerneeds and try to package it into modules. If you are experienced in coding HLL languageslike C++ you know that you normally package your "real" procedures into modules, proce-dures and libraries. We do the same here.

Imagine: your code is getting longer and longer and you have just one file! Your wheel mousewould be thankfull if you do not so…

Another important part is that changing modules is faser than doing so in one big long file.

This chapter is the beginning of our "real" disassembler. Even the experienced users shouldhave a look at it because this code-design is our base layout. Sure you can adapt the layoutlater for your needs, but please let us all talk about the same.


The Basic Skeleton Of The Disassembler

62

To go further with our disassembler we should have the same design and GUI for all read-ers.

Well, this is it...


This skeleton shows us HOW the disassembler will look after we are finished. As you can seewe will get all necessary informations about our file: ImageBase, EntryPoint RVA and File Off-set, the number of sections and much more.

So now we have designed the GUI of the diassembler and as you can see this normallyreflects all the modules and procedures we need to code.

But were can we start with it ?

Well, designing a software-product should always be were you should begin. Never start withhacking some code into your IDE... when the project grows, you will be lost in your ownsource and debugging will be a pain !

In this chapter we will make a working skeleton of our disassembler-engine with it´s GUI.

Therefore we will need the following files:

AodBasicDisasm.AsmAodBasicDisasm.rcConst.incIdata.incMain.incPE.asmProtos.incStruct.incTypes.incUdata.incThe resource files

We will discuss each file on it´s own. First we look at the Main-Files, then we have a look atthe include-files and last we have a deeper look into the PE file.



62

Part 1 - AodBasicDisasm.asmAs you can guess this is the main-file of our disassembler-engine. Here we “draw” ourGUI for the engine and include all necessary files.

One of the necessary commands isinclude Main.inc ;Libraries, Definitions & Modules

Here we bind the needed libraries and modules into the engine. We will have later adeeper look into these files.

So this is what I have learned from the Iczelion tutorials: I will give you now the sourcecode of this file, then we will discuss it nearly line by line...;======================================================================; AoD Basic Disassembler; http://aod.anticrack.de/;======================================================================

.686

.model flat, stdcall;32 bit memory modeloption casemap :none;case sensitive

include Main.inc;Libraries, Definitions & Modules

.code

start: invoke GetModuleHandle,NULL;Get the Main hInstance movhInstance,eax mov icex.dwICC,ICC_PROGRESS_CLASS invoke InitCommonControlsEx,addr icex mov AllocatedMem,0;First use invoke LoadIconA,hInstance,IDI_ICON mov hIcon,eax invoke DialogBoxParam,hInstance,IDD_MAIN,NULL,addr DlgProc,NULL ;Show Main Dialog invoke ExitProcess,0


;>-- Dialog Proc --<;DlgProc proc uses esi edi ebx ebp hWin:HWND,uMsg:UINT,wParam:WPARAM,lParam:LPARAM

push hWin pop hWnd;Store Dialog Window Handle moveax,uMsg;Window Msg in EAX .if eax==WM_INITDIALOG ;-------------------------- Dialog Init ------------------ invoke SendMessage,hWnd,WM_SETICON,ICON_SMALL,hIcon;Set Icon invoke GetDlgItem,hWin,IDC_DISASM ;Get some handles mov hDisassembler,eax invoke GetDlgItem,hWnd,IDC_STATUSBAR mov hStatusbar,eax invoke GetDlgItem,hWnd,IDC_PROGRESS mov hProgressbar,eax invoke lstrcpy,addr lf.lfFaceName,addr FontC;Set Font mov lf.lfCharSet,DEFAULT_CHARSET;CharSet mov lf.lfHeight,-12;Height mov lf.lfWidth,FW_DONTCARE;Width mov lf.lfPitchAndFamily,DEFAULT_PITCH OR FF_MODERN;Pitch & Family invoke CreateFontIndirect,addr lf;Create & Get Font Handle mov hLfnt,eax;Store Font Handle invoke SendMessage,hDisassembler,WM_SETFONT,hLfnt,FALSE ;Set Font for Disassembler invoke lstrcpy,addr lf.lfFaceName,addr FontT;Set Font mov lf.lfHeight,-11;Height mov lf.lfWidth,FW_DONTCARE;Width mov lf.lfPitchAndFamily,DEFAULT_PITCH OR FF_DONTCARE;Pitch & Family invoke CreateFontIndirect,addr lf;Create & Get Font Handle mov hSfnt,eax;Store Font Handle invoke SendMessage,hStatusbar,WM_SETFONT,hSfnt,FALSE ;Set Font for Status invoke CreateSolidBrush,dwListBoxBack;ListBox BackGround mov hListBoxBack,eax;Store the brush handle



62

.elseif eax==WM_CTLCOLORLISTBOX ;----------------- Colorize Our Disassembler ----- mov eax,wParam;wParam = Handle to HDC mov ebx,lParam;lParam = Control Handle .if ebx==hDisassembler invoke SetTextColor,eax,dwDisasmFore;Set ForeGround Color .elseif ebx==hStatusbar invoke SetTextColor,eax,dwStatusFore;Set ForeGround Color .endif mov eax,wParam;wParam = Handle to HDC invoke SetBkColor,eax,dwListBoxBack;Set BackGround Color mov eax,hListBoxBack;Return the brush handle ret .elseif eax==WM_COMMAND ;---------------------------- WM_COMMAND ----------------- moveax,wParam .if ax==IDM_OPEN ;----- MenuItem OPEN --------------------------------- invoke ResetVars;Reset Variables & Close Files if needed invoke OpenTheFile;Open the file to be disassembled cmp eax,0 ;If the function succeeds the file is mapped in memory jz ErrInOpening invoke CheckPE;Check for valid PE file cmp eax,0 jz ErrInPE invoke DisplayWelcome ;invoke DisassembleFile, CodeSection, dwCodeSize;Disassemble it! ;invoke AddLine, offset disNewLine ;invoke AddLine, offset disEnd .elseif ax==IDM_GOTOOFFSET ;----- MenuItem GOTO OFFSET ---------------------------- invoke DialogBoxParam,hInstance,IDD_GOTOOFFSET,hWin, addr GotoOffsetDlgProc ,NULL .elseif ax==IDM_GOTOENTRY ;----- MenuItem GOTO ENTRY POINT ----------------------- invoke SendMessage,hDisassembler,LB_FINDSTRING, -1,addr szEntryPoint cmp eax,LB_ERR jz NotFound


invoke SendMessage,hDisassembler,LB_SETCURSEL,eax, 0;If found, move the cursor at this position ret

NotFound: invoke MessageBeep,-1;If not, BEEPs ret .elseif ax==IDM_ABOUT ;----- MenuItem ABOUT ------------------------------------- invoke MessageBox,hWnd,addr About,addr CapAbout, MB_OK;Show About Box .elseif ax==IDM_EXIT ;----- MenuItem EXIT -------------------------------------- invoke SendMessage,hWnd,WM_CLOSE,NULL,NULL;Same as WM_CLOSE .endif .elseif eax==WM_CLOSE ;------------------------------ WM_CLOSE --------------------------

ErrInOpening: invoke MessageBox, hWnd,addr AreYouSure,addr Exit,MB_YESNO cmp eax, IDYES jnz NoExit invoke DeleteObject,hListBoxBack;Delete the brush invoke DeleteObject,hLfnt;Delete Font Handles invoke DeleteObject,hSfnt invoke ResetVars;Close Files invoke EndDialog,hWnd,0;The End .else

NoExit: ErrInPE: moveax,FALSE ret .endif moveax,TRUE ret DlgProc endp end start



63

Part 1 - Discussion of AodBasicDisasm.asmFirst we have to define our program:

______________________________________________________________________.686.model flat, stdcall;32 bit memory modeloption casemap :none;case sensitive

include Main.inc;Libraries, Definitions & Modules______________________________________________________________________

Well, this is not very impressive. We define our memory model and include our main.incfile which includes our Libraries, definitions and modules.

______________________________________________________________________ start:invoke GetModuleHandle,NULL;Get the Main hInstancemov hInstance,eaxmov icex.dwICC,ICC_PROGRESS_CLASS invoke InitCommonControlsEx,addr icex mov AllocatedMem,0;First useinvoke LoadIconA,hInstance,IDI_ICONmov hIcon,eaxinvoke DialogBoxParam,hInstance,IDD_MAIN,NULL,addr DlgProc,NULL;Show Main Dialoginvoke ExitProcess,0______________________________________________________________________

Here we have the main-routine of our disassembler. We initialise our common controls,allocate memory, load the icon of our application, show our application as DialogBox andfinally we exit the application.


______________________________________________________________________DlgProc proc uses esi edi ebx ebp hWin:HWND,uMsg:UINT,wParam:WPARAM,lParam:LPARAM______________________________________________________________________

Here we define our main procedure. I don´t have to explain this...

So the next block contains the routine which is responsible for our GUI when the disassem-bler starts:(I don´t go here into details since this is no assembly-course)______________________________________________________________________.if eax==WM_INITDIALOG;-------------------------- Dialog Init ------------------invoke SendMessage,hWnd,WM_SETICON,ICON_SMALL,hIcon;Set Iconinvoke GetDlgItem,hWin,IDC_DISASM ;Get some handlesmov hDisassembler,eaxinvoke GetDlgItem,hWnd,IDC_STATUSBARmov hStatusbar,eaxinvoke GetDlgItem,hWnd,IDC_PROGRESSmov hProgressbar,eaxinvoke lstrcpy,addr lf.lfFaceName,addr FontC;Set Fontmov lf.lfCharSet,DEFAULT_CHARSET;CharSetmov lf.lfHeight,-12;Heightmov lf.lfWidth,FW_DONTCARE;Widthmov lf.lfPitchAndFamily,DEFAULT_PITCH OR FF_MODERN;Pitch & Familyinvoke CreateFontIndirect,addr lf;Create & Get Font Handlemov hLfnt,eax;Store Font Handleinvoke SendMessage,hDisassembler,WM_SETFONT,hLfnt,FALSE;Set Font for Disas-semblerinvoke lstrcpy,addr lf.lfFaceName,addr FontT;Set Fontmov lf.lfHeight,-11;Heightmov lf.lfWidth,FW_DONTCARE;Widthmov lf.lfPitchAndFamily,DEFAULT_PITCH OR FF_DONTCARE;Pitch & Familyinvoke CreateFontIndirect,addr lf;Create & Get Font Handlemov hSfnt,eax;Store Font Handleinvoke SendMessage,hStatusbar,WM_SETFONT,hSfnt,FALSE;Set Font for Statusinvoke CreateSolidBrush,dwListBoxBack;ListBox BackGroundmov hListBoxBack,eax;Store the brush handle______________________________________________________________________



63

Next we do some colors to our ListBox-element since we want to make differences in thedisassembled code better viewable:______________________________________________________________________

.elseif eax==WM_CTLCOLORLISTBOX;----------------- Colorize Our Disassem-bler -----mov eax,wParam;wParam = Handle to HDCmov ebx,lParam;lParam = Control Handle.if ebx==hDisassembler invoke SetTextColor,eax,dwDisasmFore;Set ForeGround Color .elseif ebx==hStatusbar invoke SetTextColor,eax,dwStatusFore;Set ForeGround Color .endifmov eax,wParam;wParam = Handle to HDCinvoke SetBkColor,eax,dwListBoxBack;Set BackGround Colormov eax,hListBoxBack;Return the brush handleret______________________________________________________________________

And no we can start with the interesting part !Our disassembling-routines !First we have to check which command the user has send:______________________________________________________________________

.elseif eax==WM_COMMAND;---------------------------- WM_COMMAND ----------------- moveax,wParam______________________________________________________________________


The next is THE disassembler main routine ! We have for now 3 parts we want to handle...1. Open a file2. Go to a specific offset3. Go to the Entry Point

Don´t be shocked ! The next lines are few and contain not much information!______________________________________________________________________

.if ax==IDM_OPEN ;----- MenuItem OPEN --------------------------------- invoke ResetVars;Reset Variables & Close Files if needed invoke OpenTheFile;Open the file to be disassembled cmp eax,0;If the function succeeds the file is mapped in memory jz ErrInOpening invoke CheckPE;Check for valid PE file cmp eax,0 jz ErrInPE invoke DisplayWelcome ;invoke DisassembleFile, CodeSection, dwCodeSize;Disassemble it! ;invoke AddLine, offset disNewLine ;invoke AddLine, offset disEnd______________________________________________________________________

As we can see we first reset all variables before we do anything else. This is necessary if wehad another disassembled file in memory. Imagine we will merge here some offset or what-ever of 2 different files...

After this we call a routine which opens the wanted file and loads it into memory. Howeverythis function will work, it does the job. We will take a deeper look at this functions later.

Before we come to the interesting part we do some small error handling.



63

Now we are ready to do the important parts...______________________________________________________________________

invoke CheckPE;Check for valid PE file

______________________________________________________________________

This is an important function ! We have to check if we have a valid file. I not, we shouldstop with the disassembling process or our machine my hang or whatever !

Well, after this we show a little “Welcome Message” - whatever this means. We don´thave to know yet.______________________________________________________________________ ;invoke DisassembleFile, CodeSection, dwCodeSize;Disassemble it! ;invoke AddLine, offset disNewLine ;invoke AddLine, offset disEnd______________________________________________________________________

Yes. This is the heart of our main application. Finally we have reached the core. The corecontains 3 main procedures:

Disassembling the file and adding our output so that we can see it with our GUI.______________________________________________________________________ .elseif ax==IDM_GOTOOFFSET ;----- MenuItem GOTO OFFSET ---------------------------- invoke DialogBoxParam,hInstance,IDD_GOTOOFFSET,hWin, addr GotoOffsetDlgProc ,NULL______________________________________________________________________

This handles our offset problem. Whatever it does, it is not important here. Even for thisproblem we will need to have a deeper look later.


______________________________________________________________________.elseif ax==IDM_GOTOENTRY ;----- MenuItem GOTO ENTRY POINT ----------------------- invoke SendMessage,hDisassembler,LB_FINDSTRING,-1,addr szEntryPoint cmp eax,LB_ERR jz NotFound invoke SendMessage,hDisassembler,LB_SETCURSEL,eax,0 ;If found, move the cursor at this position ret______________________________________________________________________

This routine handles the “Jump to our Entry-Point”. As I said before: We will discuss this later.______________________________________________________________________ NotFound:invoke MessageBeep,-1;If not, BEEPsret.elseif ax==IDM_ABOUT ;----- MenuItem ABOUT ------------------------------------- invoke MessageBox,hWnd,addr About,addr CapAbout,MB_OK;Show About Box.elseif ax==IDM_EXIT ;----- MenuItem EXIT -------------------------------------- invoke SendMessage,hWnd,WM_CLOSE,NULL,NULL;Same as WM_CLOSE.endif.elseif eax==WM_CLOSE ;------------------------------ WM_CLOSE --------------------------______________________________________________________________________

Here we handle the rest of our possible command and something we never want:NotFound is our error-message if we do not find an entry-point.



63

______________________________________________________________________ ErrInOpening:invoke MessageBox, hWnd,addr AreYouSure,addr Exit,MB_YESNOcmp eax, IDYESjnz NoExitinvoke DeleteObject,hListBoxBack;Delete the brushinvoke DeleteObject,hLfnt;Delete Font Handlesinvoke DeleteObject,hSfntinvoke ResetVars;Close Filesinvoke EndDialog,hWnd,0;The End.else______________________________________________________________________

Here we handle the problems when we can not open the wanted file. Maybe it is dam-aged or opened by another applocation - who knows, but we handle this.

Well, this was the easy beginning of our disassembler engine. I promise that you will getmuch harder stuff when we go into details.


Part 2 - PE.asm.code ;>-- Get Sections Info --<;GetSections proc uses esi edi ebx;esi points to PE-HEADERassume esi:ptr IMAGE_NT_HEADERSxor eax,eaxmov ax,word ptr [esi].FileHeader.NumberOfSections;Get # of Sectionsmov wSections,ax;Store itpush eaxinvoke wsprintfA,addr StatusText,addr stTempSections,eax;Display in Status invoke SetStatus,addr StatusText pop eax push eaxinvoke wsprintf,addr StatusText,addr stHex,eaxinvoke SetDlgItemText,hWnd,IDC_SECTIONS,addr StatusText pop eax;Display correct number of sections cmp ax,MAX_SECTIONS;But check if they fit on sections buffer jbe NSectionsOk;And adjust if they don't mov wSections,MAX_SECTIONS



63

NSectionsOk:add esi,sizeof IMAGE_NT_HEADERS;1st Section's name (esi points to IMAGE_SECTION_HEADER)assume edi: ptr SECTIONlea edi,FileSections;edi points to Section's dataassume esi: ptr IMAGE_SECTION_HEADER;Assume esi as an IMAGE_SECTION_HEADERassume edi: ptr SECTION;Assume edi as a SECTIONxor ebx,ebx;Section Index = 0

GetSectionsInfo: push esi push edi mov ecx,8;Section's Name Length rep movsb;Copy Name pop edi pop esi mov ax,word ptr [esi].Misc.VirtualSize;VirtualSize mov word ptr [edi].VirtualSize,ax mov ax,word ptr [esi].VirtualAddress;VirtualAddress mov word ptr [edi].VirtualAddress,ax mov ax,word ptr [esi].SizeOfRawData;PhysicalSize mov word ptr [edi].RawSize,ax mov ax,word ptr [esi].PointerToRawData;PhysicalOffset mov word ptr [edi].RawAddress,ax mov eax,[esi].Characteristics;Characteristics mov [edi].Characteristics,eaxadd esi,sizeof IMAGE_SECTION_HEADER;Next Section (Source)add edi,sizeof SECTION;Next Section (Destination) inc ebx;Inc Section Index cmp bx,word ptr [wSections];Last Section? jnz GetSectionsInforetGetSections endp


;>-- Display Sections Info --<;DisplaySections proc uses esiLOCAL dwVOffset:DWORD,\dwVSize:DWORD,\dwROffset:DWORD,\dwRSize:DWORD,\dwChars:DWORD

assume esi: ptr SECTIONlea esi,FileSectionsxor ecx,ecx;Section Index = 0

ShowSections:push ecxmovzx eax,word ptr [esi].VirtualAddress;VirtualAddressmov dwVOffset,eaxmov ax,word ptr [esi].VirtualSize;VirtualSizemov dwVSize,eaxmov ax,word ptr [esi].RawAddress;PhysicalOffsetmov dwROffset,eaxmov ax,word ptr [esi].RawSize;PhysicalSizemov dwRSize,eaxmov eax,dword ptr [esi].Characteristics;Characteristicsmov dwChars,eax invoke wsprintfA,addr StatusText,addr stSectionsFound,\ dwVOffset,dwVSize,dwROffset,dwRSize,dwChars;Display In Status ListBox invoke lstrcat,addr StatusText,esi;Append the section's name invoke lstrcat,addr StatusText,addr stRightBracket invoke SetStatus, addr StatusTextadd esi,sizeof SECTION;Next Sectionpop ecx;Restore Index inc cx;Next Section cmp cx,word ptr [wSections];Last Section? jb ShowSectionsretDisplaySections endp



64

;>-- Detects Code Section --<;GetCodeSection proc uses esi ebxassume esi: ptr SECTIONmovzx eax,EntryPointRVApush eaxinvoke wsprintf,addr StatusText,addr stHex,eaxinvoke SetDlgItemText,hWnd,IDC_EPRVA,addr StatusTextpop eaxpush eaxinvoke RVAToOffset,eax;Convert EntryPointRVA to Offsetmov EntryPointOffset,ax;eax = EntryPointOffsetinvoke wsprintf,addr StatusText,addr stHex,eaxinvoke SetDlgItemText,hWnd,IDC_EPOFFSET,addr StatusTextpop ebxadd ebx,ImageBase;eax = ImageBase + EntryPointRVAinvoke wsprintf,addr szEntryPoint,addr findEP,ebxmovzx ecx,byte ptr [CodeSectionIndex];Get Code Section Indexmov eax,sizeof SECTIONmul ecxadd eax,offset FileSections;Get the stored Code Sectionmov esi,eaxmovzx eax,word ptr [esi].RawAddress;Get Code Section PhysicalOffsetmov dword ptr [CodeSection],eaxinvoke wsprintf,addr StatusText,addr stEntryPoint,ebx,eaxinvoke SetStatus, addr StatusTextmov ecx, AllocatedMemadd [CodeSection],ecx;CodeSection = Offset of Code Section in memorymovzx eax,word ptr [esi].VirtualSize;Get Code Section Sizemov dword ptr [dwCodeSize],eax;Store Code Size movzx eax,word ptr [esi].VirtualAddress add [VirtualAddr],eax;VirtualAddr holds the Virtual Address of the firstret;instruction in code sectionGetCodeSection endp


;>-- Check For Valid PE --<;CheckPE proc uses esi ebx mov esi,AllocatedMem ;esi points to beginning of mapped file cmp word ptr [esi],IMAGE_DOS_SIGNATURE;Check For 'MZ' jnz NotValidMZ;Jump if not valid invoke SetStatus,addr stValidMZ;Valid 'MZ' Signature! assume esi: ptr IMAGE_DOS_HEADER movzx eax,word ptr [esi].e_lfanew;Get The PE Offset add esi,eax;esi points to PE Header cmp word ptr [esi],IMAGE_NT_SIGNATURE;Check For 'PE' jnz NotValidPE;Jump if not validpush esi;Store Pointer to PE Headeradd esi,sizeof IMAGE_NT_HEADERS - sizeof IMAGE_OPTIONAL_HEADER32 assume esi: ptr IMAGE_OPTIONAL_HEADER32 mov ax,word ptr [esi].AddressOfEntryPoint;Entry Point RVA mov word ptr [EntryPointRVA],ax mov ebx,dword ptr [esi].ImageBase;ImageBase mov ImageBase,ebx invoke wsprintf,addr StatusText,addr stHex,ebx invoke SetDlgItemText,hWnd,IDC_IMAGEBASE,addr StatusText movzx eax,word ptr [esi].Subsystem;SubSystem mov eax,SubSystem[eax*4];Retrieve SubSystem Text from Array invoke wsprintf,addr StatusText,addr String,eax invoke SetDlgItemText,hWnd,IDC_SUBSYSTEM,addr StatusText mov eax,[esi].DataDirectory.VirtualAddress;Exports RVA invoke wsprintf,addr StatusText,addr sRVA,eax invoke SetDlgItemText,hWnd,IDC_AEXPORT,addr StatusText mov eax,[esi].DataDirectory.isize;Exports Size invoke wsprintf,addr StatusText,addr sSize,eax invoke SetDlgItemText,hWnd,IDC_SEXPORT,addr StatusText mov eax,[esi].DataDirectory[sizeof IMAGE_DATA_DIRECTORY].VirtualAddress;Imports RVA invoke wsprintf,addr StatusText,addr sRVA,eax invoke SetDlgItemText,hWnd,IDC_AIMPORT,addr StatusText mov eax,[esi].DataDirectory[sizeof IMAGE_DATA_DIRECTORY].isize;Imports Size invoke wsprintf,addr StatusText,addr sSize,eax invoke SetDlgItemText,hWnd,IDC_SIMPORT,addr StatusText mov eax,[esi].DataDirectory[sizeof IMAGE_DATA_DIRECTORY*2].VirtualAddress;Rsrc RVA



64

invoke wsprintf,addr StatusText,addr sRVA,eax invoke SetDlgItemText,hWnd,IDC_ARESOURCE,addr StatusText mov eax,[esi].DataDirectory[sizeof IMAGE_DATA_DIRECTORY*2].isize;Rsrc Size invoke wsprintf,addr StatusText,addr sSize,eax invoke SetDlgItemText,hWnd,IDC_SRESOURCE,addr StatusText mov eax,dword ptr [esi].SizeOfImage;ImageBaseSize mov ImageBaseSize,eax mov dword ptr [VirtualAddr],ebx ;VirtualAddr = ImageBase pop esi;Retrieve Pointer to PE Header invoke SetStatus,addr stValidPE;Valid PE Detected. invoke GetSections;Get the file sections invoke DisplaySections ;Display their Infos invoke GetCodeSection ;Determine the Code Sectioncmp dword ptr [CodeSection],0;Valid Section?jz NoCodeSection ;No Code Secion? then Exit ;-------------------------------------------------------; ; VirtualAddr = VirtualAddr + CodeSection's VirtualAddr ;-------------------------------------------------------; mov eax,1 ret;No Errors, return (1) NotValidMZ: invoke SetStatus,addr stNotValidMZ;Not a valid MZ... jmp PExit NotValidPE: invoke SetStatus,addr stNotValidPE;Not a valid PE...jmp PExit NoCodeSection: invoke SetStatus,addr stNoCodeSection;No Code Section... PExit: invoke ByeBye xor eax,eax retCheckPE endp


Part 2 - Discussion of PE.asm



64

Part 3 - Tools.asm.code;>-- Clear Disassembler ListBox --<;ResetDisassembler procinvoke SendMessage,hDisassembler,LB_RESETCONTENT,0,0retResetDisassembler endp

;>-- Add Line In Disasm --<;AddLine proc dwLineToAdd:DWORDinvoke SendMessage,hDisassembler,LB_ADDSTRING,0,dwLineToAddretAddLine endp

;>-- Display Welcome Message --<;DisplayWelcome procinvoke ResetDisassemblerinvoke AddLine,addr Welcome1invoke AddLine,addr Welcome2invoke AddLine,addr Welcome3invoke AddLine,addr EmptyLineretDisplayWelcome endp


;>-- Reset Variables for a New File --<;ResetVars proccmp AllocatedMem,0jz NotAllocatedBefore;Test if we must release memory usedinvoke UnmapViewOfFile,AllocatedMem;and reset variablesinvoke CloseHandle,hmapFile NotAllocatedBefore:mov FileSize,0mov AllocatedMem,0mov AllocatedMemEnd,0mov CodeSection,0mov wSections,0mov ImageBase,0mov ImageBaseSize,0mov dwCodeSize,0mov CurVirtualOffset,0mov VirtualAddr,0retResetVars endp

;>-- Show Msgs in Status Bar --<;SetStatus proc dwMsg:DWORD invoke SendMessage,hStatusbar,LB_ADDSTRING,0,dwMsginvoke SendMessage,hStatusbar,LB_SETTOPINDEX ,eax, 0 retSetStatus endp

;>-- Clear Status Messages --<;ClearStatus procinvoke SendMessage,hStatusbar,LB_RESETCONTENT,0,0retClearStatus endp

;>-- Display Msg & Exit --<;ByeBye proc invoke SetStatus, addr stExiting mov byte ptr [FileName],0;Clear FileNameinvoke SetDlgItemText,hWnd,IDC_FILENAME,addr FileNameretByeBye endp



64

;>-- Goto Any Virtual Offset --<;GotoOffsetDlgProc proc hWin:HWND,uMsg:UINT,wParam:WPARAM,lParam:LPARAMmoveax,uMsg.if eax==WM_COMMANDmoveax,wParam.if ax==IDC_GOTO;--> Goto Buttoninvoke GetDlgItemText,hWin, IDC_GOTOOFFSET,addr OffsetToGoto+1,9mov byte ptr [OffsetToGoto], ' ';First char is a spaceinvoke SendMessage,hDisassembler,LB_FINDSTRING,0,addr OffsetToGoto ;Find Offsetcmp eax, LB_ERRjz OffNotFoundinvoke SendMessage,hDisassembler,LB_SETCURSEL,eax,0;Select Line With Offset in Disasmjmp OffFound

OffNotFound:invoke MessageBox,hWin,addr OffsetNotFound,addr Err,MB_ICONINFORMATIONjmp TryAgain.elseif ax==IDC_GOTOCANCEL;--> Cancel Buttoninvoke SendMessage,hWin,WM_CLOSE,NULL,NULL;Close.endif.elseif eax==WM_CLOSE;--> Close Dialog

OffFound:invoke EndDialog,hWin,0.else

TryAgain:mov eax,FALSEret.endifmov eax,TRUEretGotoOffsetDlgProc endp


;>-- Convert RVA To Offset --<;RVAToOffset proc uses ebx esi dwRVA:DWORDassume esi: ptr SECTIONlea esi,FileSections;esi = Section's Dataxor ecx,ecx;Section indexmov edx,dwRVA;Move RVA to edx

SearchNewSection:movzx eax,word ptr [esi].VirtualAddress;RVA Section Startcmp edx,eaxjl @F; RVA >= RVA Section Startmovzx ebx,word ptr [esi].RawSize;Get Section's Sizeadd ebx,eax;RVA Section Endcmp edx,ebx;jbe SectionFound;RVA Section Start <= RVA <= RVA Section End

@@:add esi,sizeof SECTION;Next Section inc cx cmp cx,wSections;Check if we looped through all sections jnz SearchNewSectionxor eax,eax;If nothing found return -1dec eaxret

SectionFound: mov byte ptr [CodeSectionIndex],cl;Store the Code Section Indexmov ebx,eaxmovzx eax,word ptr [esi].RawAddress;Get CodeSection's PhysicalOffsetsub ebx,eax;ebx = RVA Section Start - Offset Section Startsub edx,ebx;edx = RVA - (VirtualAddr - PhysicalOffset)mov eax,edx;Return File Offset in EAXretRVAToOffset endp



64

;>-- Procedure to open files --<;OpenTheFile proc Invoke MessageBox,hWnd,addr msgLoadDefault,addr msgCap-tion,MB_ICONQUESTION or MB_YESNO;Load Default? cmp eax, IDYES jnz LoadTheFile invoke lstrcpy,addr FileName,addr TestingFile jmp ContinueLoading

LoadTheFile: mov ofn.lStructSize,SIZEOF ofn ;Prepare ofn structure push hInstance pop ofn.hInstance mov ofn.lpstrFilter, OFFSET FilterString mov ofn.lpstrFile, OFFSET FileName mov ofn.nMaxFile,255h mov ofn.Flags, OFN_FILEMUSTEXIST or \ OFN_PATHMUSTEXIST or OFN_LONGNAMES or\ OFN_EXPLORER or OFN_HIDEREADONLY invoke GetOpenFileName,addr ofn

ContinueLoading: invoke CreateFile,addr FileName,\;Open the file as READ_ONLY GENERIC_READ,\ FILE_SHARE_READ,\ NULL,OPEN_EXISTING,\ FILE_ATTRIBUTE_ARCHIVE,NULL cmp eax,-1 jz ErrorHappened;CreateFile Failed? mov hFile,eax;Store the file handle invoke GetFileSize,hFile,0; mov FileSize,eax;Store the file size mov AllocatedMemEnd,eax;Store the end of the AllocatedMem (1)invoke CreateFileMapping,hFile,0,PAGE_READONLY,0,0,0;Map the file in memory cmp eax,0;


jz ErrorHappened;Error Creating File Mapping? mov hmapFile,eax;Store the file mapped handle invoke MapViewOfFile,eax,FILE_MAP_READ,0,0,0;Map View of File cmp eax,0; jz ErrorMapping;Error Mapping View of File? :) mov AllocatedMem,eax;Store allocated file offset add AllocatedMemEnd,eax;Store the end of the AllocatedMem (2) invoke ClearStatus;Clear the status barinvoke SetDlgItemText,hWnd,IDC_FILENAME,addr FileName;Display File Name invoke SetStatus,offset stFileLoaded;In the status bar tooinvoke CloseHandle,hFile;Close the file (still mapped in memory)mov eax,1; ret;Return (1) = succeeded

ErrorMapping:invoke CloseHandle,hmapFile;Close the mapped file ErrorHappened: xor eax,eax ret;Something's wrong return (0)OpenTheFile endp



65

Part 3 - Discussion of Tools.asm


Part 4 - Const.inc.const;-- Main Dialog equates -----IDI_ICONequ 300

IDD_MAINequ 101IDD_GOTOOFFSETequ 103

IDC_DISASMequ 1001IDC_STATUSBARequ 1002IDC_FILENAMEequ 1005IDC_PROGRESSequ 1006IDC_SUBSYSTEMequ 1008IDC_IMAGEBASEequ 1010IDC_EPRVAequ 1012IDC_EPOFFSETequ 1014IDC_SECTIONSequ 1016IDC_AEXPORTequ 1019IDC_SEXPORTequ 1020IDC_AIMPORTequ 1022IDC_SIMPORTequ 1023IDC_ARESOURCEequ 1025IDC_SRESOURCEequ 1026

;-- Main Menu Items ---------IDM_FILEequ 3001IDM_OPENequ 3002IDM_EXITequ 3003IDM_VIEWequ 3010IDM_VEXPORTequ 3011IDM_VIMPORTequ 3012IDM_VRSRCequ 3013IDM_VAPIequ 3014IDM_VSTRINGSequ 3015IDM_GOTOequ 3020IDM_GOTOENTRY equ 3021IDM_GOTOOFFSETequ 3022IDM_HELPequ 3090IDM_GETHELPequ 3091IDM_ABOUTequ 3092



65

;-- GotoOffset Dialog equates ---IDC_GOTOOFFSETequ 1002IDC_GOTOequ 1003IDC_GOTOCANCELequ 1004

;-- Constants ---------------------------------------------MAX_SECTIONSequ 10;Max number of sections allowedMAX_BUFFERequ 256 ;Size of Buffers


Part 4 - Discussion of Const.inc



65

Part 5 - Idata.inc.data;== MESSAGES ============================================================================;-- GotoOffset Msgs ---------------------------------------------------------------------OffsetNotFounddb "No matching offset found!",0Errdb "Err..",0

;-- Titles & Msgs -----------------------------------------------------------------------CapAboutdb "AoD Basic Disassembler",0Aboutdb "AoD Basic Disassembler Stage-1",13,10,"September, 2002",0

AreYouSuredb "Are you sure you want to exit?",0Exitdb "Are you nuts!? :p",0

msgLoadDefaultdb "Load the default file ""test.exe""?",0msgCaptiondb "Load File",0

FilterString db "(*.exe)",0,"*.exe",0,"(*.dll)",0,"*.dll",0,0TestingFiledb "test.exe",0

Welcome1db "---------------------------------",0Welcome2db " AoD Basic Disassembler Stage - 1",0Welcome3db "---------------------------------",0

;-- Status Msgs -------------------------------------------------------------------------stFileLoadeddb "File loaded.",0stValidPEdb "Valid PE Detected.",0stValidMZdb "Valid MZ Detected.",0stNotValidPEdb "Inalid PE Detected.",0stNotValidMZdb "Invalid MZ Detected.",0stExitingdb "Exiting...",0stTempSectionsdb "Found %X Section(s).",0stSectionsFounddb "Virtual Address %08X - Virtual Size %08X"db " - Raw Offset %08X - Raw Size %08X. - Chars. %08X ( ", 0


stRightBracketdb " )",0stEntryPointdb "EntryPoint (RVA) %08X - EntryPoint (Offset) %08X.",0stNoCodeSectiondb "Code section couldn't be found! at least in this version :P",0stHexdb "%08X",0

;--Misc ---------------------------------------------------------------------------------EmptyLinedb " ",0

;== TEMPLATES ===========================================================================;-- To find EntryPoint ------------------------------------------------------------------szEntryPointdb " 00000000:",0findEPdb " %08X:",0

;-- To Show RVA's & Sizes ---------------------------------------------------------------sRVAdb "RVA: %08X",0sSizedb "Size: %08X",0

;-- Misc --------------------------------------------------------------------------------Stringdb "%s",0

;== ARRAYS ==============================================================================;--- SubSystem types --------------------------------------------------------------------S0BYTE "Unknown",0S1BYTE "Native",0S2BYTE "Windows-GUI",0S3BYTE "Windows-Console",0S5BYTE "OS/2 Console",0S7BYTE "Posix Console",0S8BYTE "Native Win9x Driver",0S9BYTE "Windows CE",0

SubSystemPBYTE S0,S1,S2,S3,S0,S5,S0,S7,S8,S9



65

;== COLORS ==============================================================================;-- List Boxes Colors -------------------------------------------------------------------dwListBoxBackCOLORREF White; Back Color for both List BoxesdwDisasmForeCOLORREF 000490093h; Fore Color for Disassembler List BoxdwStatusForeCOLORREF Blue; Fore Color for Status List Box

;== FONTS ===============================================================================;-- Disassembler Font -------------------------------------------------------------------FontCdb "Courier New",0FontTdb "Tahoma",0


Part 5 - Discussion of Idata.inc



65

Part 6 - Main.inc;--System includes --------include windows.incinclude kernel32.incinclude user32.incinclude comdlg32.incinclude Comctl32.incinclude gdi32.inc

;-- System libraries ------includelib kernel32.libincludelib user32.libincludelib Comctl32.libincludelib comdlg32.libincludelib gdi32.lib

;-- Includes -----------include Protos.incinclude Types.incinclude Const.incinclude Idata.incinclude Udata.incinclude Struct.inc

;-- Modules ---------------include Tools.asminclude PE.asm


Part 6 - Discussion of Main.inc



66

Part 7 - Protos.inc;-- Main Module Prototypes ---------------------------------------------------------------DlgProcPROTO :HWND,:UINT,:WPARAM,:LPARAM

;-- PE.asm Prototypes --------------------------------------------------------------------CheckPEPROTO; Check for a valid PEGetSectionsPROTO; Get the sections names & infoDisplaySectionsPROTO; Display info about sectionsGetCodeSectionPROTO; Detect the code section

;-- Tools.asm Prototypes -----------------------------------------------------------------ResetVarsPROTO; Reset Variables & Close FilesOpenTheFilePROTO; Open the file to disasmSetStatusPROTO:DWORD; Display a Status MsgByeByePROTO; Display exit message & exitClearStatusPROTO; Clear the Status List BoxRVAToOffsetPROTO:DWORD; Converts RVA to OffsetAddLine PROTO:DWORD; Display a line in disassemblerDisplayWelcome PROTO; Display the silly welcome textResetDisassembler PROTO; Clear the text in disassemblerGotoOffsetDlgProc PROTO:HWND,:UINT,:WPARAM,:LPARAM ; Goto any Virtual Offset


Part 7 - Discussion of Protos.inc



66

Part 8- Struct.inc.data;== DEFINITIONS =========================================================================;-- Section info ------------------------------------------------------------------------SECTION struct 2;Section's:sNameDWORD 0;NamesName1DWORD 0;Name (cont.)sName2BYTE 0;Name endVirtualAddress WORD ?;RVAVirtualSizeWORD ?;SizeRawAddressWORD ?;File OffsetRawSizeWORD ?;File SizeCharacteristicsDWORD ?;Characteristics. (i.e. executable code, in/uni-tialized data)SECTION ends

;== INITIALIZATIONS =====================================================================icexINITCOMMONCONTROLSEX <sizeof INITCOMMONCONTROLSEX,0>;Common ControlslfLOGFONT <>;FontofnOPENFILENAME <>;FileNameDialog ParametersFileSectionsSECTION MAX_SECTIONS dup ({});Sections info


Part 8 - Discussion of Struct.inc



66

Part 9 - Types.inc;-- Type Definitions ----PBYTE TYPEDEF PTR BYTE; Pointer to Byte


Part 9 - Discussion of Types.inc



66

Part 10 - Udata.inc.data?;== BUFFERS ========================================================================OffsetToGotodb 10 dup (?);Buffer for Offset to look forFileNamedb MAX_BUFFER dup (?);Buffer to hold the file nameStatusTextdb MAX_BUFFER dup (?);Buffer to store the Status Text

;== HANDLES ========================================================================hInstanceHINSTANCE ?;Main hInstancehWndHWND ?;Main hWndhIconHICON ?;Icon HandlehLfntHFONT ?;Font Handle for DisasmhSfntHFONT ? ;Font Handle for StatushListBoxBackHBRUSH ?;Brush HandlehDisassemblerHWND ?;Disassembler (listbox) HandlehStatusbarHWND ?;Status (ListBox) HandlehProgressbarHWND ?;ProgressBar HandlehFileHANDLE ?;File HandlehmapFileHANDLE ?;Mapped File Handle

;== GLOBALS ========================================================================FileSizeDWORD ?; File SizeImageBaseDWORD ?; PE Image BaseImageBaseSizeDWORD ? ; PE Image Base SizeAllocatedMemDWORD ?; Mapped File OffsetAllocatedMemEndDWORD ?; AllocatedMem + FileSizeCodeSectionDWORD ?; Mapped File Code Section OffsetEntryPointRVAWORD ?; EntryPoint (RVA)EntryPointOffsetWORD ?; EntryPoint (Offset)VirtualAddrDWORD ?; First Instruction to disassemble (VA)CurVirtualOffsetDWORD ?; Current Virtual Address being diassembleddwCodeSizeDWORD ?; Size of Code SectionwSectionsWORD ?;# of Sections in filedwCurSectionWORD ?CodeSectionIndexBYTE ?; Code Section Index



66

Part 10 - Discussion of Udata.inc


Part 11 - AoDBasicDisasm.rc#define IDI_ICON300IDI_ICONICONDISCARDABLE Res\chip.ico#include <Res\BasicDisasmMainDlg.rc>#include <Res\MainMenuMnu.rc>#include <Res\GotoOffsetDlg.rc>



67

Part 12 - BasicDisasmMainDlg.Rc#define IDD_MAIN 101#define IDC_DISASM 1001#define IDC_SEP 1004#define IDC_FILENAME 1005#define IDC_STATUSBAR 1002#define IDC_PROGRESS 1006#define IDC_PEINFO 1003#define IDC_LSUBSYSTEM 1007#define IDC_SUBSYSTEM 1008#define IDC_LIBASE 1009#define IDC_IMAGEBASE 1010#define IDC_LEP 1011#define IDC_EPRVA 1012#define IDC_LEPO 1013#define IDC_EPOFFSET 1014#define IDC_LSECTIONS 1015#define IDC_SECTIONS 1016#define IDC_DIRECTORY 1017#define IDC_LEXPORT 1018#define IDC_AEXPORT 1019#define IDC_SEXPORT 1020#define IDC_LIMPORT 1021#define IDC_AIMPORT 1022#define IDC_SIMPORT 1023#define IDC_LRESOURCE 1024#define IDC_ARESOURCE 1025#define IDC_SRESOURCE 1026#define IDC_FNAME 1027IDD_MAIN DIALOGEX 6,5,546,387CAPTION "AoD Basic Disassembler Stage-1"FONT 8,"MS Sans Serif"MENU 3000STYLE 0x10CA0800EXSTYLE 0x00040000BEGIN LISTBOX IDC_DISASM,4,31,450,310,NOT 0x00820000|0x502100C0,0x00000201 CONTROL "",IDC_SEP,"Static",NOT 0x00830000|0x50000012,4,16,450,1,0x00000000 LTEXT "",IDC_FILENAME,44,1,410,12,NOT 0x00830000|0x50001000,0x00000000


LISTBOX IDC_STATUSBAR,4,328,538,56,NOT 0x00820000|0x50210040,0x00000201 CONTROL "",IDC_PROGRESS,"msctls_progress32",NOT 0x10830000|0x40000000,4,20,450,7,0x00000300 LTEXT "PE Information",IDC_PEINFO,460,3,84,15,NOT 0x00830000|0x50000001,0x00000201 LTEXT "SubSystem:",IDC_LSUBSYSTEM,460,24,84,9,NOT 0x00830000|0x50000000,0x00000000 LTEXT "",IDC_SUBSYSTEM,460,36,84,9,NOT 0x00830000|0x50000002,0x00000000 LTEXT "ImageBase:",IDC_LIBASE,460,49,84,9,NOT 0x00830000|0x50000000,0x00000000 LTEXT "",IDC_IMAGEBASE,460,62,84,9,NOT 0x00830000|0x50000002,0x00000000 LTEXT "EntryPoint RVA:",IDC_LEP,460,75,84,9,NOT 0x00830000|0x50000000,0x00000000 LTEXT "",IDC_EPRVA,460,88,84,9,NOT 0x00830000|0x50000002,0x00000000 LTEXT "EntryPoint File Offset:",IDC_LEPO,460,101,84,9,NOT 0x00830000|0x50000000,0x00000000 LTEXT "",IDC_EPOFFSET,460,114,84,9,NOT 0x00830000|0x50000002,0x00000000 LTEXT "# of Sections:",IDC_LSECTIONS,460,127,84,9,NOT 0x00830000|0x50000000,0x00000000 LTEXT "",IDC_SECTIONS,460,140,84,9,NOT 0x00830000|0x50000002,0x00000000 LTEXT "Directory",IDC_DIRECTORY,460,153,84,13,NOT 0x00830000|0x50000001,0x00000001 LTEXT "Export Table:",IDC_LEXPORT,460,169,84,9,NOT 0x00830000|0x50000000,0x00000000 LTEXT "",IDC_AEXPORT,460,182,84,9,NOT 0x00830000|0x50000002,0x00000000 LTEXT "",IDC_SEXPORT,460,195,84,9,NOT 0x00830000|0x50000002,0x00000000 LTEXT "Import Table:",IDC_LIMPORT,460,208,84,9,NOT 0x00830000|0x50000000,0x00000000 LTEXT "",IDC_AIMPORT,460,221,84,9,NOT 0x00830000|0x50000002,0x00000000 LTEXT "",IDC_SIMPORT,460,234,84,9,NOT 0x00830000|0x50000002,0x00000000 LTEXT "Resource Table:",IDC_LRESOURCE,460,247,84,9,NOT 0x00830000|0x50000000,0x00000000 LTEXT "",IDC_ARESOURCE,460,260,84,9,NOT 0x00830000|0x50000002,0x00000000 LTEXT "",IDC_SRESOURCE,460,273,84,9,NOT 0x00830000|0x50000002,0x00000000 LTEXT "File Name:",IDC_FNAME,4,1,36,12,NOT 0x00830000|0x50001000,0x00000000END



67

Part 13 - GotoOffsetDlg.Rc#define IDD_GOTOOFFSET 103#define IDC_STC5 1001#define IDC_GOTOOFFSET 1002#define IDC_GOTO 1003#define IDC_GOTOCANCEL 1004IDD_GOTOOFFSET DIALOGEX 6,6,108,36CAPTION "Goto Offset:"FONT 8,"MS Sans Serif"STYLE 0x10CF0000EXSTYLE 0x00000080BEGIN LTEXT "Offset:",IDC_STC5,10,7,24,9,NOT 0x00830000|0x50000000,0x00000000 EDITTEXT IDC_GOTOOFFSET,38,5,64,11,NOT 0x00820000|0x50010000,0x00000200 PUSHBUTTON "Go",IDC_GOTO,8,22,44,11,NOT 0x00820000|0x50010001,0x00000000 PUSHBUTTON "Cancel",IDC_GOTOCANCEL,54,22,44,11,NOT 0x00820000|0x50010000,0x00000000END


Part 14 - MainMenuMnu.Rc#define IDM_FILE 3001#define IDM_OPEN 3002#define IDM_EXIT 3003#define IDM_VIEW 3010#define IDM_VEXPORT 3011#define IDM_VIMPORT 3012#define IDM_VRSRC 3013#define IDM_VAPI 3014#define IDM_VSTRINGS 3015#define IDM_GOTO 3020#define IDM_GOTOENTRY 3021#define IDM_GOTOOFFSET 3022#define IDM_HELP 3090#define IDM_GETHELP 3091#define IDM_ABOUT 30923000 MENUBEGIN POPUP "&File" BEGIN MENUITEM "&Open",IDM_OPEN MENUITEM "E&xit",IDM_EXIT END POPUP "&View" BEGIN MENUITEM "&Exports",IDM_VEXPORT,GRAYED MENUITEM "&Imports",IDM_VIMPORT,GRAYED MENUITEM "&Resources",IDM_VRSRC,GRAYED MENUITEM "&API Calls",IDM_VAPI,GRAYED MENUITEM "&String References",IDM_VSTRINGS,GRAYED END POPUP "&GoTo" BEGIN MENUITEM "Goto &Entry Point",IDM_GOTOENTRY,GRAYED MENUITEM "Goto Virtual &Offset",IDM_GOTOOFFSET,GRAYED END POPUP "&Help"



67

BEGIN MENUITEM "&Help",IDM_GETHELP,GRAYED MENUITEM "&About",IDM_ABOUT ENDEND


Lesson 2 - Modules And Procedures

Lesson 2 - Modules And Procedures



67

CHAPTER 5 A Simple Disassembler-Engine


A Simple Disassembler-Engine

62

Lesson 1 - Theory


Lesson 2 - Practice

Lesson 2 - Practice


A Simple Disassembler-Engine

63

Lesson 3 - Result And Sources


CHAPTER 6 Building A DLL As Disassembler-Engine


Building A DLL As Disassembler-Engine

63

CHAPTER 7 An Advanced Disassembler-Engine


An Advanced Disassembler-Engine

63

Lesson 1 - Theory


Lesson 2 - Practice

Lesson 2 - Practice


An Advanced Disassembler-Engine

63

Lesson 3 - Results and Sources


CHAPTER 8 Improving The Disassembler-EngineString-References, API´s and more...


Improving The Disassembler-Engine String-References, API´s and more...

63

CHAPTER 9 Disassembler Extreme- Polymorphic Code and more...


Disassembler Extreme - Polymorphic Code and more...

64

CHAPTER 10 Appendix


Appendix

64

the art of disassembly - pudn.comread.pudn.com/downloads114/ebook/478072/artofdisassembly.pdf ·...

Documents