[ieee comput. soc 2000 ieee international conference on computer design: vlsi in computers and...

9
Sleipnir - An Instruction-Level Simulator Generator Tor E. Jeremiassen Texas Instruments* tor@ ti.com Abstract Instruction-level simulators occupy a central role in the software development f o r embedded processors. They pro- vide a convenient virtual platform for testing, debugging and optimizing code. They can be made available long be- fore any hardware is available, and are not as awkward to work with as test/evaluation boards. Howevel; many available instruction-level simulators are lacking in desired functionality. Moreover; instruction- level simulators suitable to the task are tedious to write from scratch. This paper presents the Sleipnir simulator generator; a convenient tool for writing instruction-level simulators. Sleipnir allows simulators for simple architectures to be generated with a minimum of overhead, yet allows sufJi- cient micro-architectural detail to be expressed to generate cycle accurate simulators for most embedded processors. Sleipnir has been used to successfully generate fast instruction-level simulators for six different architectures, including a RISC processol; two microcontrollers and three DSPs. 1. Introduction Software development for. embedded systems relies heavily on the use of simulators. .Unlike general purpose comput- ers, the embedded hardware, by its nature, offers very little support for the software development cycle. Often software development has to start long before the actual hardware is available. Therefore, simulators are necessary to provide a convenient platform for testing, debugging, profiling and optimizing embedded applications. When properly instru- mented, simulators can provide much more detailed statis- tics and profile information than any hardware platform. The downside is that simulators are much slower than the hardware they simulate. The more detailed the simula- tion is the slower the simulator runs. Logic and functional level simulators have simulation speeds on the order of only tens to hundreds cycles per second. Instruction level sim- ulators are much faster. Speeds range from thousands of 'This work was performed while the author was at Bell Labs, Lucent Technologies, Murray Hill, NJ simulated cycles per second to millions. Instruction level simulators for a given embedded archi- tecture are usually available from the vendor and/or one or more third party suppliers. While they often are part of a complete tool chain, their usefulness is often limited in several ways. First, commercial simulators tend only to provide a vanilla simulation capability. While a debugger interface is frequently available, the simulators typically report only the total cycle count. No other statistics or profiling in- formation is provided. Second, these simulators are dis- tributed in binary form, making them impossible to ex- tend or customize to collect new statistics or profile new events. Third, the simulators are normally only available for a small number of host platforms which may or may not match well with the customers computing environment. Fourth, the simulation speed is usually not high enough to be convenient for large applications, large data sets, or fre- quent testing. For instance, validating an implementation of the GSM-AMR speech transcoder [5], an important ap- plication in cellular telephony, against the set of standard test vectors, requires on the order of 200 billion simulated instructions for some DSPs. In order to complete the vali- dation in one CPU week, a simulator must be capable of simulating roughly 450,000 instructions per second. To complete it in one CPU day requires almost 2.5 million in- structions per second. One solution is to write a fast instruction-level simulator from scratch. However, writing an instruction-level simu- lator by hand is labor intensive, tedious and error-prone. Moreover, some of the more advanced simulation tech- niques, such as dynamic cross-compilation, used to pro- duce very fast simulators [3] are complex, non-portable, and produce simulators that are hard to modify. This paper describes Sleipnir, an instruction-level sim- ulator generator. Sleipnir was designed with four major goals in mind. First, it had to make it easy to write simula- tors for a wide range of architectures, and to instrument these simulators to provide detailed statistics, profiling, and/or timing information. Sleipnir has been used to gener- ate simulators for six different architectures: Texas instru- ments C62xx series DSPs, including detailed cycle accu- rate simulation of both the C6201 [20] and C6211 [ 191 de- 23 0-7695-0801-4/00$10.00 0 2000 IEEE

Upload: te

Post on 26-Feb-2017

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: [IEEE Comput. Soc 2000 IEEE International Conference on Computer Design: VLSI in Computers and Processors. ICCD 2000 - Austin, TX, USA (17-20 Sept. 2000)] Proceedings 2000 International

Sleipnir - An Instruction-Level Simulator Generator

Tor E. Jeremiassen Texas Instruments*

tor@ ti.com

Abstract

Instruction-level simulators occupy a central role in the software development for embedded processors. They pro- vide a convenient virtual platform for testing, debugging and optimizing code. They can be made available long be- fore any hardware is available, and are not as awkward to work with as test/evaluation boards.

Howevel; many available instruction-level simulators are lacking in desired functionality. Moreover; instruction- level simulators suitable to the task are tedious to write from scratch.

This paper presents the Sleipnir simulator generator; a convenient tool for writing instruction-level simulators. Sleipnir allows simulators for simple architectures to be generated with a minimum of overhead, yet allows sufJi- cient micro-architectural detail to be expressed to generate cycle accurate simulators for most embedded processors.

Sleipnir has been used to successfully generate fast instruction-level simulators for six different architectures, including a RISC processol; two microcontrollers and three DSPs.

1. Introduction

Software development for. embedded systems relies heavily on the use of simulators. .Unlike general purpose comput- ers, the embedded hardware, by its nature, offers very little support for the software development cycle. Often software development has to start long before the actual hardware is available. Therefore, simulators are necessary to provide a convenient platform for testing, debugging, profiling and optimizing embedded applications. When properly instru- mented, simulators can provide much more detailed statis- tics and profile information than any hardware platform.

The downside is that simulators are much slower than the hardware they simulate. The more detailed the simula- tion is the slower the simulator runs. Logic and functional level simulators have simulation speeds on the order of only tens to hundreds cycles per second. Instruction level sim- ulators are much faster. Speeds range from thousands of

'This work was performed while the author was at Bell Labs, Lucent Technologies, Murray Hill, NJ

simulated cycles per second to millions. Instruction level simulators for a given embedded archi-

tecture are usually available from the vendor and/or one or more third party suppliers. While they often are part of a complete tool chain, their usefulness is often limited in several ways.

First, commercial simulators tend only to provide a vanilla simulation capability. While a debugger interface is frequently available, the simulators typically report only the total cycle count. No other statistics or profiling in- formation is provided. Second, these simulators are dis- tributed in binary form, making them impossible to ex- tend or customize to collect new statistics or profile new events. Third, the simulators are normally only available for a small number of host platforms which may or may not match well with the customers computing environment. Fourth, the simulation speed is usually not high enough to be convenient for large applications, large data sets, or fre- quent testing. For instance, validating an implementation of the GSM-AMR speech transcoder [5], an important ap- plication in cellular telephony, against the set of standard test vectors, requires on the order of 200 billion simulated instructions for some DSPs. In order to complete the vali- dation in one CPU week, a simulator must be capable of simulating roughly 450,000 instructions per second. To complete it in one CPU day requires almost 2.5 million in- structions per second.

One solution is to write a fast instruction-level simulator from scratch. However, writing an instruction-level simu- lator by hand is labor intensive, tedious and error-prone. Moreover, some of the more advanced simulation tech- niques, such as dynamic cross-compilation, used to pro- duce very fast simulators [3] are complex, non-portable, and produce simulators that are hard to modify.

This paper describes Sleipnir, an instruction-level sim- ulator generator. Sleipnir was designed with four major goals in mind. First, it had to make it easy to write simula- tors for a wide range of architectures, and to instrument these simulators to provide detailed statistics, profiling, and/or timing information. Sleipnir has been used to gener- ate simulators for six different architectures: Texas instru- ments C62xx series DSPs, including detailed cycle accu- rate simulation of both the C6201 [20] and C6211 [ 191 de-

23 0-7695-0801-4/00 $10.00 0 2000 IEEE

Page 2: [IEEE Comput. Soc 2000 IEEE International Conference on Computer Design: VLSI in Computers and Processors. ICCD 2000 - Austin, TX, USA (17-20 Sept. 2000)] Proceedings 2000 International

vices', A r d h u m b (architecture version 4) [6], Motorola M*Core [ 141, Lucent DSP1600 [ 1 I ] , Mips (integer instruc- tions only) [8], as well as the StarCore SC140 DSP [13]. All but the SC140 simulator were written entirely by the author.

Second, the generated simulators had to be portable to a variety of host platforms. So far, Sleipnir generated simu- lators have been run on IA32/Windows, IA32/Linux, HP- PA/HP-UX, Compaq AlphdOSF, Sparc/Solaris and SGI Mips/Irix.

Third, the generated simulators had to be as fast as pos- sible without compromising portability and ease of writing the simulator. This meant that simulation techniques such as dynamic cross-compilation [3] could not be used. Never- theless, the simulation speed of the Sleipnir generated sim- ulators is quite acceptable. For instance, both the Motorola M*Core and the Mips simulators ran in excess of 5 million instructions per second on a relatively slow 250 MHz Mips RlOOOO system.

Fourth, the simulator generator needed to be relatively simple to implement. As the author was the only one work- ing on Sleipnir, i t was clear that the programming effort had to be kept reasonable. Still, the Sleipnir machine descrip- tion provides a very convenient level of abstraction for the simulator writer.

Given that Sleipnir is an instruction-level simulator gen- erator, i t is important to distinguish between simulating an instruction set architecture and simulating a processor implementation of an instruction set architecture. Sleip- nir does the former. Sleipnir does not provide a hard- ware description language for describing the detailed op- eration of an implementation, something that tends to both increase the programming effort as well as decrease simu- lation speed. Instead, Sleipnir makes it easy to create basic simulators, and provides the programmer enough flexibil- ity to control and extend the simulator to perform cycle- accurate simulation. This is particularly true for implemen- tations that reflect the instruction set architecture closely, which is often the case for embedded processors. However, it is less convenient to use Sleipnir to simulate implementa- tions that differ significantly from the instruction set archi- tecture, such as those of advanced out-of-order superscalar processors.

The remainder of this paper is structures as follows: Section 2. describes related works and provides a con- text for Sleipnir. Section 3. describes the basic simu- lation mechanism provided by Sleipnir, and Section 4. gives an overview of the machine description format. Sec- tion 5. provides some detail and performance results for five instruction-level simulators generated using Sleipnir, while Section 6. concludes.

IThe C6201 and C621 I have the same instruction set and core archi- tecture, but differ in their memory organizations

2. Related Work

One way of generating a simulator is to describe an imple- mentation using a hardware description language like Ver- ilog [22] or VHDL [I] . However, while these simulators can be extremely accurate, they are too slow for software development. Moreover, since an actual hardware imple- mentation of an instruction set architecture must be de- fined, the programming effort is very high. Any changes to the architecture may require extensive changes in the de- scription of the implementation.

There are several tools that work at a higher, organiza- tional level, where the microarchitecture is specified, e.g., number of pipelines and their stages, but the logic level im- plementation is not. Most of these tools originate in efforts to generate simulators for RISC processors. The descrip- tion languages and the structure of declarations of seman- tics and instruction encodings work well with the orthog- onal architectures and simple encoding schemes of their target architectures. However, some of the complex encod- ings and instruction semantics of embedded architectures often are harder to describe using these tools. Compared to the approach embodied in Sleipnir, these tools also require more programming effort to produce slower, but cycle ac- curate, simulators.

UPFAST [ 151 is such a tool. It defines an architectural description language that is used to describe the microar- chitecture of a processor. This ADL description is then used to generate a simulator/debugger, an assembler, a dis- assembler, and a linker. Results indicate that the ADL description for a simplified pipelined implementation of the Mips architecture, both integer and floating point in- structions, is on the order of 5,000 lines. This is signif- icantly bigger than the 700 lines in the Sleipnir machine description, even when taking into account that the Sleip- nir machine description only implemented the integer in- structions. The reported simulation speed of the UPFAST simulator is an order of magnitude slower than the Sleipnir generated simulator.

LISA [ 161 is a language tailored to expressing instruc- tion set architectures and their implementations. LISA specifications are used to generate compiled, or program specific, simulators [ 171. That is, the generated simulator can only simulate the program it was generated for. To simulate a different program, a new simulator has to be generated. By removing simulation overhead at compile time, this approach can generate very fast simulators. How- ever, it is predicated on being able to identify and decode the instructions at compile time, which is not possible for all architectures in general. Moreover, it cannot be used to simulate a device with a multiprogrammed or changing workload. The portability of the generated simulators are also limited.

24

Page 3: [IEEE Comput. Soc 2000 IEEE International Conference on Computer Design: VLSI in Computers and Processors. ICCD 2000 - Austin, TX, USA (17-20 Sept. 2000)] Proceedings 2000 International

The Simplescalar Tool Set [2] is a suite of simulation tools for performing simulations at several levels of detail and speed. While the simulated architecture is based on an architecture description file, the instructions are all en- coded in fixed width 64-bit instruction words. Thus, unlike Sleipnir, it cannot accept binaries from vendor tool chains. Instead it requires the use of its own (GNU based) compiler and assembler tools.

SLED [ 181 is the specification language for encoding and decoding machine instructions, used in the New Jersey Machine-Code Toolkit. SLED provides a powerful way of compactly describing the mapping between abstract, binary and assembly language representations of machine instruc- tions, particularly for RISC style machine architectures. It does not, however, provide for the specification of the se- mantics of instructions.

Spawn [ lo] extends SLED by providing a mechanism for expressing instruction semantics. While the language is not fully described, there seem to be limitations on the type of semantics that can be conveniently expressed. Particu- larly, i t is unclear how well the semantics of some signal processing instructions can be described.

The work described in [9] is probably closest to Sleip- nir, though it too was motivated by describing RISC style general purpose architectures. Both uses C to describe the semantics of the target instructions. However, Sleipnir uses a slower, but more portable simulation mechanism. [9] uses threaded code, which takes advantage of gcc specific constructs. The Sleipnir intermediate representation is less space efficient, but more time efficient. Unlike [9] the val- ues stored in the TID do not require additional decoding before they can be used.

3. Simulation Mechanism

Sleipnir works by compiling a machine description file into a set of C source files. These files consist of a number of . h files and three . c files : ops . c contains instruction semantic functions and some user defined C code, decode . c contains instruction decode functions, and core . c which contains the simulator core simulation mechanism. In addition to the generated source files, some functionality is provided in libraries.

3.1. Instruction Intermediate Representation

The simulators generated by Sleipnir use a predecode [3] simulation mechanism. Instead of decoding a target in- struction every time it is executed, it is decoded once, and the decoded information is stored into an easy to use in- termediate representation. When an instruction is executed repeatedly, the information stored in this intermediate rep- resentation is reused, thereby amortizing the overhead of

decoding instructions and thus increasing overall simula- tion speed. In Sleipnir, the intermediate representation is implemented as a C structure type, and is referred to as a target instruction descriptor (TID). A separate TID is maintained for each decoded instruction in code memory (except when using the cached mode as detailed below).

The TIDs are stored in an array called the instruction store. The instruction store can be used either in an exhaus- tive or a cached mode. The mode is selected by a command line option when the Sleipnir generator tool is invoked, and thus is fixed for any given generated simulator.

In the exhaustive mode there is a TID for every possible target instruction location in simulated memory. For a 32- bit, word-aligned, instruction set architecture like the Mips, one TID is allocated for every 4 bytes of simulated instruc- tion memory. For an architecture like the SC140, which has 2, 4 and 6 byte long instructions, there is one TID for every 2 bytes (instruction alignment unit). Standard exe- cutable files provide sufficient information about the size and location of the program text section to avoid having to allocate TIDs for all but the part of the instruction memory occupied by the program text.

A benefit of the exhaustive mode is that data stored in the TID persists through the entire simulation. This makes the TID a convenient place to store instruction-level statis- tics and profiling information that is reported or processed at the end of the simulation.

The cached mode can be used when either a smaller memory footprint is desired or when the entire instruction store of a device needs to be simulated. In the cached mode the instruction store operates as a direct mapped TID cache. As a consequence, TIDs for multiple target instructions may map to a single location in this cache. This means that prior to using a TID, the TID must be validated. The valida- tion is done by storing the target instruction’s address into its corresponding TID at the time it is decoded, and com- paring it against the PC prior to use. If there is no match, the instruction has to be decoded again, overwriting any previous information stored in that particular cache entry. Note, since the TIDs can be overwritten, they are no longer useful for storing any statistics or profile information.

While there may be a concern that simulation speed may be decreased in the cached mode because of the replace- ment and re-decoding of instructions, experiments have shown that the performance impact is negligible. The TID cache has roughly equivalent performance to processor in- struction caches of similar size (# of entries, not KB) and organization. Also, the additional overhead of validating a TID prior to use is too small to be significant.

The semantics of each target instruction is implemented by its own C function. This function is referred to as the se- mantic function for that target instruction. During decode the semantics of a target instruction is bound to its TID by

25

Page 4: [IEEE Comput. Soc 2000 IEEE International Conference on Computer Design: VLSI in Computers and Processors. ICCD 2000 - Austin, TX, USA (17-20 Sept. 2000)] Proceedings 2000 International

assigning the address of the semantic function to a func- tion pointer in the TID. This makes instruction dispatch as simple as an indirect function call, similar to the dispatch mechanism used in M I N T [23].

{ identify TID for inst. referenced by pc; pre-dispatch C code; target instruction dispatch, set new pc; post-dispatch C code;

1

3.2. Instruction Decoding

The decoding of target instructions can be done in two ways, eagerly or lazily. M I N T uses an eager approach. It decodes the instructions of a program when it is loaded into simulated memory. Eager decoding works best when in- structions are clearly separate from data, and requires that the beginning of each instruction can be determined a pri- ori. While this might be reasonable for a RISC style archi- tecture, i t is not possible in general for architectures with variable length instructions. For this reason Sleipnir gen- erates code to perform instruction decoding lazily. This is achieved by initializing the semantic function pointer of every TID to point to a special semantic function for unde- coded instructions. When this semantic function is called it first calls the decode function. This decodes the target instruction and initializes the appropriate fields in the TID. Then it calls the correct semantic function using the new value of the semantic function pointer. Any returned value is passed through to the original call site.

The lazy decode approach lends itself to simulation of programs that contain self-modifying or run-time generated code, or code that is loaded dynamically during execution. This is achieved by reinitializing the semantic pointers of the affected TIDs (i.e., those that correspond to the modi- fied code) to point to the undecoded function. This forces the simulator to decode these instructions anew when they are next encountered, guaranteeing that the updated pro- gram will be simulated correctly.

The decoder function is implemented as a series of if- statements that compare the bit pattern of a target instruc- tion against the bit pattern of every instruction in the ma- chine description in the order they appear. There is no early return if a match is found. Multiple matches may occur. Each match causes the initialization of fields in the TID. The set of TID fields that are initialized need not be identi- cal across different instructions. However, some fields, like the pointer to the semantic function, are always initialized. Thus the semantic action for an instruction is specified by the last instruction that matched during decode.

The multiple match feature of the decode mechanism can be used to perform default initializations for classes of instructions that have encodings with identical subfields. For instance, both the TI C62xx and the Arm architectures have full conditional execution specified by a predicate en- coded in the first 4 bits of the instruction encoding. Since the predicate is known at decode time, the code to eval- uate it should be bound to the instruction during decode.

Figure 1 : Simulator main loop.

Moreover, since the parsing of these 4 bits is identical for every instruction, it is convenient to express it only once. An efficient way of achieving this is to first write an func- tion for each predicate and declare an additional function pointer in the TID to point to one of these predicate func- tions. This way, the predicate can be evaluated at execution time by an indirect function call. A default instruction de- scription (which matches any instruction) is inserted into the machine description, before any other instruction de- scriptions, to initialize this predicate function pointer cor- rectly. This way, the predicate function pointer is initial- ized once for all instructions. The remaining fields of the TID, including the semantic function pointer are correctly initialized by subsequent successful decodings.

An important feature of making the ordering of instruc- tions significant is that it allows for proper decoding of vari- able length instructions where shorter instructions may be prefixes of longer instructions. An example of one such architecture is the StarCore SC140 DSP [13]. The SC140 has 16-bit instructions that match the initial 16 bits of some 32-bit instructions. By putting the instruction descriptions for the 16-bit instructions before the 32-bit instructions, the correct decode is performed.

Even though the decode function may be quite ineffi- cient by some standards, there is actually very little gain in trying to speed it up, as was done in [9]. Profiling infor- mation across multiple applications shows that the decode function accounts for less than 0.1 % of the execution time of any reasonably length simulation.

3.3. Main Simulation Loop

Figure 1 shows the main simulation loop. The loop can be broken down into four parts. First, the address of the TID for the target instruction, pointed to by the simulated pro- gram counter, is computed. Second, C code, copied verba- tim from a section of the machine description, is executed. Third, the instruction dispatch itself. This consists of an in- direct function call to the semantic function specified in the TID. The return value is copied to the simulated program counter. Last, more C code, again copied verbatim from a section of the machine description, is executed.

The ability to introduce arbitrary C code before and af- ter target instruction dispatch is one of the key features that

26

Page 5: [IEEE Comput. Soc 2000 IEEE International Conference on Computer Design: VLSI in Computers and Processors. ICCD 2000 - Austin, TX, USA (17-20 Sept. 2000)] Proceedings 2000 International

gives Sleipnir the flexibility to modify, augment or other- wise control the simulation mechanism in order to better simulate more complex target architectures.

For instance, the TI C62xx is a statically scheduled, variable width issue, VLIW like architecture. Its instruc- tion latencies are exposed to the programmer, e.g., a load has 4 delay slots during which the previous value of the des- tination register is accessible. In order to correctly simulate this architecture, both the parallel issue of instructions as well as their latencies must be modeled with full cycle ac- curacy. The basic Sleipnir simulation mechanism performs the simulation of instructions in strict sequence. Since the semantics of the architecture specifies parallel issue, the re- sults and side-effect of these instructions must be buffered until the end of the current cycle, for instructions with unit cycle latency, or the end of some subsequent cycle, for in- structions with more that one cycle latency, e.g. multiply instructions with 2 cycle latency. One way of implement- ing this is to have the semantic function of every instruction write the instruction’s side-effect to a series of buffers, and then write code that is inserted after instruction dispatch to detect the end of a cycle and write back the pending values from the buffers to the register file.

Code inserted before or after instruction dispatch can also be used to collect simulation statistics and profiling information, as well as simulating the effects of interrupts.

3.4. Additional Simulation Support

In addition to the TID, there are two other important data types that are visible in the machine description. One is a structure for global state information. This structure is in- tended to hold both the machine state as well as any statis- tics counters. Some internal simulation data structures are also stored in this structure. The other is called an action (or sub-instruction) descriptor, and acts much like the TID for sub-instructions, a way of decomposing the semantics of instructions.

Simulated memory is handled by library routines. A memory access library provides an API that enables the loader to specify the address ranges of the required sim- ulated memory. Allocating the memory and maintaining the mapping from the simulated address space to the sim- ulator’s address space is done within the library. The API includes all the load and store functions the simulator re- quires.

The generated simulators are intended to be used with any executable file format. Therefore, the program loading function has been divided into two parts. A machine inde- pendent part is provided in a library. Its functionality is to allocate and initialize memory based on the data provided by the machine dependent part. The machine dependent part is the responsibility of the programmer. It parses and

extracts data from the executable file when called by the machine independent loader.

Many tool-chains for embedded architectures include li- braries that assume that the simulator provides certain ba- sic low-level I/O functionality. The mechanisms by which such simulator functions are invoked by the simulated pro- grams are tool-chain dependent, but may include software interrupts and function calls made to predefined addresses. Since Sleipnir is intended to generate simulators that can interoperate with different software generation tool chains, the specification and implementation of this system inter- face is left to the programmer. Fortunately, the Sleipnir machine description is flexible enough to make this an easy task.

4. Machine Description

4.1. Introduction

The Sleipnir machine description specifies the architecture dependent information for a simulator. It specifies the en- coding and semantics of each instruction in the instruction set, as well as additional field declarations for the TID, ac- tion descriptor and global data structures. Additionally it may contain modifications or additions to the basic core simulation mechanism, such as support for VLIW issue of instructions or the collection of statistics or profiling infor- mation.

The design of the machine description emphasized sim- plicity and flexibility. On one hand, the machine descrip- tion had to be concise and simple to write. An ISA level simulator for a small, simple architecture should only re- quire a minimal machine description that someone familiar with the tool should be able to write very quickly. On the other hand, the machine description needed to be flexible enough to specify a very wide range of computer architec- tures. It had to easily support the specification of scalar, single-issue architectures, while providing few if any ob- stacles to specifying a complex VLIW or EPIC style archi- tectures. The machine description had to support both fixed and variable length instructions, as well as mode selectable instruction sets like ARMRhumb [6].

Given the broad requirements for flexibility and the ne- cessity of a simple implementation, i t was decided against using an architecture or hardware description language. In- stead, a philosophy similar to that of yacc [7] was adopted. C would be used to the greatest extent possible. Additional syntactic elements would be added to impose an overall structure to the machine description, pass information di- rectly to the simulator generator, and to express those fea- tures of the architecture that would be awkward and tedious to do in C. This approach allowed for a simple implemen- tation of the simulator generator. Instead of having to parse

27

Page 6: [IEEE Comput. Soc 2000 IEEE International Conference on Computer Design: VLSI in Computers and Processors. ICCD 2000 - Austin, TX, USA (17-20 Sept. 2000)] Proceedings 2000 International

and analyze a hardware description language, C code is generated, most of which is code pasted directly in from the machine description.

As a result, the semantics of the instructions, the simu- lated state of the architecture, as well as user defined fields of the TID and other structures are specified in C. Extra syntax was added to conveniently express the encoding of instructions, the specification and handling of bitfields, as we11 as providing a way of structuring the C code in a rea- sonable and meaningful manner.

The chosen form of the machine description has gone a long way towards meeting the design goals. For instance, it took the author less than half a day to write a working sim- ulator for the Motorola M*Core architecture [ 141, capable of running executable files generated by the Green Hills C compiler. Similarly, writing simulators for both the Arm (including Thumb inst.) and the Lucent DSP1600 took less than a week each. Both simulators accepted binaries from the vendors’ tool chains. The machine description format also made it very easy to write full cycle accurate simula- tors for both the TI C6201 and C6211 DSPs.

4.2. Machine Description Format

Figure 2 shows the overall structure of the machine descrip- tion. The machine description is divided into a number of sections.

The initial simulator directives allow stricter alignment requirements than byte aligned to be specified. Specifying the correct alignment reduces the overall space requirement of the instruction store when used in the exhaustive mode.

C code written in the %%hdef s section is copied verba- tim to the header file that contains the prototypes for all the semantic functions. This file is included in all the generated C source files. It is a convenient place to define types, vari- ables and macros that should be visible across the simulator source code.

The %%cdefs section and the %%aux sections contain code that is inserted before and after, respectively, the def- initions of the C functions that implement the instruction semantics.

The %%global, %%action and %%runinst sec- tions are all used to add user defined fields to three C data structures: global state, action descriptor and TID. The text written in these sections are appended to the C struc- ture definitions for these date types. The macros G P (x) , AP (x) and IP (x) , are used to reference field x in the cur- rent instantiation of each of these data types, respectively.

The %%ops section is where the instruction semantics are described. The syntax is detailed in the next section. This section is also used to specify C code that is added to the core simulation mechanism in three places: init{}, before{}, and after{}. These pieces of code are exe-

s i m u l a t o r d i r e c t i v e s % % h d e f s

C c o d e , c o p i e d t o 0 p s . h % % c d e f s

C d e c l a r a t i o n s c o p i e d t o 0 p s . c % % g l o b a l

%%ac t i o n C f i e l d d e f s added t o g l o b a l d a t a s t r u c t

C f i e l d d e f s added t o a c t i o n s t r u c t t y p e %%runinst

C f i e l d d e f s added t o TID s t r u c t type %%ops i n i t {

Code e x e c u t e d before main l o o p s t a r t s

1 b e f o r e {

Code e x e c u t e d j u s t before i n s t . d i s p a t c h

1 a f t e r {

Code e x e c u t e d j u s t a f t e r i n s t . d i s p a t c h

I n s t r u c t i o n d e s c r i p t i o n s s t a r t here

O p t i o n a l a d d i t i o n a l i n s t r u c t i o n set May be r e p e a t e d

}

% % i n s t s e t

%%aux C c o d e appended t o o p s . c

Figure 2: Structure of the machine description.

cuted at simulator initialization time, before every instruc- tion dispatch, and after every instruction dispatch, respec- tively. For instance, in the Mips simulator, C code that clears register zero every cycle was written in before{} so that the instruction semantics could be written with the assurance that register zero would always be zero, even fol- lowing instructions that may have written to it.

Instruction semantics for additional instruction sets can be described in optional %%instset sections. One %%instset separator is required for each additional in- struction set. This feature was used to implement the 16-bit Thumb mode instructions of the Arm architecture.

4.3. Instruction Descriptions

Unlike most modern RISC style general purpose architec- tures, embedded architectures often have ISAs with encod- ing schemes and instruction semantics that are hard to cap- ture with simple descriptions. Examples include:

0 Variable length instructions, e.g., [ l l , 12, 6, 211, in- cluding the case where shorter instructions may be valid prefixes of longer instructions [ 131.

0 Immediates, signed and unsigned, that are split over

28

Page 7: [IEEE Comput. Soc 2000 IEEE International Conference on Computer Design: VLSI in Computers and Processors. ICCD 2000 - Austin, TX, USA (17-20 Sept. 2000)] Proceedings 2000 International

/ * register number i is GP(gp[il) * /

addi: Ob-001000-sssss- t t t t t_i(16) decode {

/ * optional * / 1 GP(gp[$tl) = GP(gp[$sl) + $*i;

1

Figure 3: Simple instruction description.

multiple non-contiguous bitfields [ 131.

Large number of instruction formats, e.g. [ 121 which has over 30.

Register specifiers that depend on prefixes to select register banks [ 131.

This means that the description of the encoding and seman- tics of an ISA must be flexible enough to allow expression of very irregular architectures, yet remain easy and intu- itive to use. The instruction description syntax in Sleipnir satisfies these goals.

The Sleipnir instruction description consists of three parts: encoding, C source for an optional decode action, and C source for instruction semantics.

Instruction encodings are specified by strings of the form ObX, where X is a string of binary digits (0, l), un- named don’t cares (?), and/or named don’t cares ( [a- zA- Z ] ) . A digit or symbol may be repeated n times by fol- lowing it with {n}. Underscores may be used to visually separate fields of the instruction, but has no other mean- ing. Named don’t cares are bit fields that can be referenced within the C code of the decode or semantic action of the instruction by prepending a $ to its name. Multiple occur- rences of the same named don’t care in the encoding string is allowed as long as they are contiguous, and are auto- matically concatenated when the field is referenced. Non- contiguous bitfields can be concatenated by referencing the concatenation of their names. Binary constant values can also be concatenated with named fields. As an example, $aOb refers to the unsigned value of the concatenation of bitfield a, the constant 0 and bitfield b. The sign extended value is referenced by following the $ immediately by a ““”,i.e., $“aOb.

The decode action is used to specify C code that is ex- ecuted during decode when the bit pattern of the instruc- tion description has matched the instruction word(s) that are being decoded. It can be used to compute values, ini- tialize fields in the TID, or perform any other desirable side-effects. It is particularly useful in conjunction with the multiple match decode feature of Sleipnir.

The C code specified as the instruction semantics is ex- ecuted exactly as specified, except that bitfield references are replaced by variable references. In addition C code is appended to the semantic function to return the PC of the next sequential instruction by default. For many architec- tures, branches can be implemented by explicitly returning the value of the new PC instead. For other architectures with delayed branches, such as Mips, code can be written in the after{} action to modify the PC appropriately.

Figure 3 shows the description for the Mips addi instruc- tion. The instruction encoding is divided into 4 fields. The first field is a constant that specifies the opcode. The next two fields are 5-bit values that encode the source and desti- nation register operands (named s and t respectively). The last field is a signed immediate value. Notice that the refer- ence to i in the instruction semantics uses ““” to reference the sign-extended value.

4.4. Sub-Instructions

In order to make it simpler to write simulators for some architectures it is necessary to provide the means for de- composing the semantics of an instruction into multiple sub-instructions. This is particularly useful if the sub- instructions can be reused across multiple instructions. For instance, in the Arm architecture a number of the vanilla data processing instructions (e.g., ADD, SUB, etc.) pre- process one of the operands through a shifer [6] in what is called addressing mode 1. The input to the shifter and the specification of one of its 11 operations is determined by 13 bits in the affected Arm instructions - bit 25 and the 12 least significant bits (11..0). It is desirable to fac- tor out this common part of these instructions into a set of sub-instructions, or in Sleipnir terminology, actions.

Sleipnir facilitates this decomposition, by extending the use of the instruction description format to include named scopes. A set of actions can be grouped in a scope by putting the name of the scope, enclosed by square brack- ets, directly following the instruction identifier. All actions (sub-instructions) within a scope must have the same word width. Other than the scope identifier, the instruction de- scription for an action has all the same features of a regular instruction description.

A scope can be referenced within the semantics section of an instruction description using a syntax similar to a C function call. The scope identifier, with an “@” prefix, acts as the “function name”. The “parameter” is a bitfield ref- erence. A particular action is bound to the scope reference upon decoding the referencing instruction when the param- eter of the action reference is matched with the encoding of an action. A separate decode function is generated for each named scope. If an instruction description references a scope and no action is successfully bound to that refer-

29

Page 8: [IEEE Comput. Soc 2000 IEEE International Conference on Computer Design: VLSI in Computers and Processors. ICCD 2000 - Austin, TX, USA (17-20 Sept. 2000)] Proceedings 2000 International

/ * Sub-Instructions, addressing mode 1 * /

Architecture immediate [model] : Ob-l-i{4}-1{8)

GP(shifter-out)= rotate-right($I,$iO); if (Si == 0)

else

{

GP(sh-C)= GP(C-flag);

GP(sh-C)= (GP(sh-out) >> 31) & 0x1; }

{ reg-lef t-shif t [model I : Ob~O~s{5}~OOO~m{4)

GP(sh-out)= GP(reg[$ml) << $ s ; if ( $ s == 0 )

else GP(sh-C)= GP(C-flag);

GP(sh-C)= (GP(reg[$ml)>>(32-$s)) & 0x1; 1

src size Sim. Accuracy (lines) speed

/ * Arm instructions * /

and: Ob~cccc~00~I~OOOO~S~n{4}~d{4}~0{12}

if (IP(cond) 0) { / * predicate * / @model ($10) ; GP(reg[$dl)= GP(reg[$nl) & GP(sh-out);

eor: Ob~cccc~00~I~0001~S~n{4}~d{4}~0{12}

if (IP(cond) 0 ) { / * predicate * / {

@model ($10) ; GP(reg[$d]) = GP(reg[$nl) - GP(sh-out); . . .

1 1

Figure 4: Decomposition of instruction semantics using scoped sub-instructins (actions).

ence, the decode fails, just as if the bit-pattern of the de- scription failed to match the current instruction word(s). The actual dispatch of the action is performed through an indirect function call.

An action can in turn make references to other named scopes, though recursive references, direct or indirect, are not allowed.

Figure 4 shows the use of actions to handle the Arm shifter operations.

5. Results

Table 1 shows the size of the machine descriptions and the simulation speeds (running simple loops) of the five simu- lators that were written entirely by the author. The experi- ments were performed on a 250 MHz Mips R 10000 based

machine.

The simu-dtors represent a wide range of computer ar- chitectures. The Mips simulator implements the integer portion of the Mips-1 (R2000/R3000) instruction set [8]. M*Core is a simple, load/store RISC-like, 32-bit microcon- troller architecture with 16-bit instructions from Motorola [ 141. The Arm architecture [6] is a well known 32-bit mi- crocontroller architecture, which also includes the 16-bit Thumb mode instructions. The TI C6201 DSPs imple- ments an 8 wide VLIW like architecture [20]. The Lucent DSP1600 is a 16-bit DSP with a non-orthogonal instruction set architecture and exposed pipeline [ 1 I ] .

Table 1 clearly confirms three important features of Sleipnir. First, Sleipnir is general enough to generate instruction-level simulators for a wide range of architec- tures, even providing cycle accurate simulation of some processor implementations.

Second, the sizes of the machine descriptions are quite small. This attests to the simplicity and speed by which architectures can be expressed in Sleipnir.

Third, the simulation speeds of the generated simulators are quite high, especially considering the relatively slow speed of the host processor. In fact, when running on a 533 MHz Compaq Alpha workstation, the C6201 simula- tor was capable of simulating the GSM-EFR decoder [4] fast enough to decode speech in real-time. The equivalent Texas Instruments simulator, by comparison, only reaches a speed of 8,000 instructions per second on a 400 MHz Pentium 113, and reports far fewer statistics.

In terms of validation, the correctness of these simula- tors were validated for the most part by verifying the cor- rect execution of reasonably complex C programs, compar- ing the simulator output to the output of vendor andor third party simulation tools. More rigorous validation would of course be possible with the availability of appropriate sets of test vectors.

'The simulation speed increases dramatically (1OOx) if memory bank conflicts are ignored, though the simulator is then no longer cycle accu- rate.

30

Page 9: [IEEE Comput. Soc 2000 IEEE International Conference on Computer Design: VLSI in Computers and Processors. ICCD 2000 - Austin, TX, USA (17-20 Sept. 2000)] Proceedings 2000 International

6. Conclusion

This paper has presented Sleipnir, a simulator generation tool, aimed at providing a convenient level of abstraction for writing fast and accurate instruction-level simulators in a short amount of time.

Fast instruction-level simulators have great utility in software development for embedded systems. They of- fer greater convenience than hardware platforms, more de- tailed statistics and profiling information, and can be made available prior to any working silicon. Moreover, these simulators are also great tools for architectural exploration.

The Sleipnir machine description provides a great deal of flexibility to express the variety of architectural features found in embedded processors, from VLIW issue to com- pactly encoded, variable length instructions. The five sim- ulators presented in this paper clearly confirm this. More- over, the small size and short time needed to write sev- eral of these simulators attests to the ease by which these instruction-level simulators can be written.

The speed at which these generated simulators are ca- pable of running, far surpasses that of many vendor sup- plied simulators. This makes them very desirable for use in software development. The short simulation time aids in shortening the development cycle. The availability of the machine description source allows for far more detailed profiling and performance statistics gathering than what is available from most vendor or third party supplied simula- tors.

Acknowledgements

I would like to thank Jesse Thilo and Kent Wires of Lu- cent Microelectronics for their help in shaping Sleipnir’s development, improve its portability, as well as being very patient users of early versions of the tool.

References

[l] P.J. Ashenden. The Designer’s Guide to VHDL. Morgan- Kaufman Pubishers, Inc, 1996.

[2] D. Burger and T.M. Austin. The simplescalar tool set, ver- sion 2.0. Technical Report 1342, University of Wisconsin- Madison, June 1997.

[3] T.M. Conte and C.E. Gimarc, editors. Fast Simulaton of Computer Architectures, chapter 2. Kluwer Academic Pub- lishers, 1995.

[4] European Telecommunications Standards Institute. Digi- tal Cellular Telecommunications System (Phase 2+); En- hanced Full Rate (EFR) Speech Transcoding (GSM 06.60), March 1997.

[5] European Telecommunications Standards Institute. Digital Cellular Telecommunications System (Phase 2+); Adaptive

Multi-Rate (AMR) Speech Transcoding (GSM 06.90), De- cember 1999.

[6] D. Jaggar, editor. ARM Architecture Reference Manual. Prentice Hall, 1996.

[7] S.C. Johnson. Yacc - yet another compiler compiler. In Computing Science Technical Report 32, Murray Hill, NJ, 1975. AT&T Bell Laboratories.

[8] G. Kane and J. Heinrich. MIPS RISCArchitecture. Prentice Hall, 1992.

[9] F. Larsson. Generating efficient simulators from a specifica- tion language. Technical report, Swedish Institute of Com- puter Science, 1997.

[IO] J.R. Larus and E. Schnarr. Eel: Machine independent ex- ecutable editing. In Proceedings of the Conference on Programming Language Design and Implementation, pages

[ 111 Microelectronics Group, Lucent Technologies. DSP161 I , DSP 161 7, DSP1618, DSP1627 Information Manual, February 1996.

[ 121 Microelectronics Group, Lucent Technologies. DSP16210 Digital Signal Processor, October 1998.

131 Microelectronics Group, Lucent Technologies and Motorola Inc. SC140 DSP Core Reference Manual, December 1999.

141 Motorola, Inc. M*Core Programmer’s Reference Manual. 151 S. Onder and R. Gupta. Automatic generation of microarchi-

tecture simulators. In 1998 IEEE International Conference on Computer Languages, pages 80-89, May 1998.

161 S. Pees, A. Hoffman, V. Zivojnovic, and H. Meyr. Lisa - machine description language for cycle-accurate models of programmable DSP architectures. In 36th Design Automa- tion Conference, pages 933-938, May 1999.

[17] S. Pees, V. Zivojnovic, A. Ropers, and H. Meyr. Fast simu- lation of the TI TMS320C54x DSP. In International Con- ference on Signal Processing Applications and Technology, pages 995-999, September 1997.

[ 181 N. Ramsey and M.F. Fernindez. Specifying representations of machine instructions. ACM Transactions on Program- ming Languages and Systems, 19(3):492-524, May 1997.

[ 191 Texas Instruments. TMS32OC621 I Fixed-point Digital Sig- nal Processor - Product Preview, August 1998.

[20] Texas Instruments. TMS320C62dC67x CPU and Instruc- tion Set Reference Guide, March 1998.

[2 13 Texas Instruments. TMS32OC54x DSP Reference Set, April 1999.

[22] D.E. Thomas and P.R. Moorby. The Verilog Hardware De- scription Language. Kluwer Academic Publishers, second edition edition, 1995.

[23] J.E. Veenstra and R.J. Fowler. MINT: A front end for ef- ficient simulation of shared-memory multiprocessors. In Proceedings of the Second INternational Workshop on Mod- eling, Analysis, and Simulation of Computer and Telecom- munication Systems (MASCOTS), pages 201-207, January 1994.

291-300, 1995.

31