[ieee 2011 international conference on energy aware computing (iceac) - istanbul, turkey...

6
Multiple Instruction Sets Architecture (MISA) Hussein Karaki and Haitham Akkary Department of Electrical and Computer Engineering, American University of Beirut, FEA Beirut, Lebanon {hak40,ha95}@aub.edu.lb Abstract In the computer hardware industry, there are currently two highly successful instruction set architectures (ISAs): the CISC X86 ISA which is an established standard architecture in the personal computer and server markets, and the RISC ARM ISA which is currently used in many ultra- mobile computing devices, such as smart-phones and tablets. Platforms that run one standard ISA cannot run the other ISA application binaries without recompiling the source code. We are investigating the technical feasibility of designing an energy-efficient multiple instruction sets architecture (MISA) processor that can run both X86 and ARM binaries. We propose an approach in which special decoders interpret the binary instructions of the running ISA and translates them to a native target machine ISA that executes within the processor pipeline. We discuss the completed initial stage of our work involving the design of XAM, an X86 hardware binary interpreter for a MISA processor that runs native ARM instructions, and describe our design in detail. We present performance and energy simulation results of our MISA processor design for a set of synthetic benchmarks including Dhrystone2.1, measured using the ARM SimpleScalar microarchitecture and power simulator. We also discuss design issues of an ARM to X86 hardware interpreter we are currently developing. We expect the completed X86-to-ARM design and the current ARM-to-X86 work to lay a foundation for designing a well optimized processor having a new native ISA that can run efficiently both X86 and ARM binaries, using direct hardware interpretation. I. INTRODUCTION In the application development industry, applications are said to be architecture dependent because in the process of application creation, its source code is compiled and assembled to generate binaries compatible with specific target platform architecture. However, in order to make precompiled binaries compatible across different architectures, i.e. cross-platform binary compatible, we require a layer to handle binary code translation from a source to target architecture. This layer can be implemented in software or in hardware. There are two approaches to achieve cross-platform binary compatibility [3]. The first is “Interpretation”, which is mapping each instruction from the source ISA to its equivalent in the target ISA at runtime and without caching (i.e. without using software cache for translated code). The second approach is “Translation”, which can be dynamic or static. Dynamic Translation is similar to Interpretation but a Cache is used to save and reuse pieces of translated code, whereas Static Translation performs the translation offline, thus it has the ability to apply more rigorous optimizations. The different implementations of the aforementioned approaches can be roughly partitioned into two categories: Software and Hardware. Some such as HP’s DYNAMO and IBM-SNU’s LATTE [2] are completely software- based. These systems perform cross-platform binary compatibility by emulating the target architecture using a layer of software called a Virtual Machine Monitor (VMM) -also known as Code Morphing Software (CMS) in some references. The VMM is installed on top of the host’s operating system and manages the binary translation. Others such as IBM’s DAISY [7] and TRANSMETA’s CRUSOE [2] are hardware-aided, that is, hardware components are integrated within the platform in order to support and accelerate a binary translation process orchestrated by firmware. Even though software translation, hardware-aided or not, provides a large amount of flexibility, this comes at the expense of memory, execution resources, performance and complexity. Several issues related to software binary translation have to be overcome [2]. We list here a few: - Address Translation: The virtual machine monitor computes addresses in the translated code that are different from the original ones used by the source application binary. - Self-Referential Code: Some programs checksum themselves to guarantee correctness. The virtual machine monitor must handle this behavior properly, which is very difficult. - Translation Cache Management: Binary translation using a virtual machine usually accompanies the usage of a Translation Cache (T-Cache) which adds additional complications related to handling the cache’s efficiency and space. - Real-Time Behavior: Time critical code segments are problematic for the VMM because execution time of the translated code often differs from the original code. - Platform Resources: The VMM needs memory resources and additional execution cycles from the platform on which it is installed. In order to avoid complications that arise from binary translation, researchers have proposed Virtual Instruction Set Architectures (V-ISA) [1][11]. In this approach, all user and operating system code target the V-ISA. An implementation specific layer of software is co-designed with the hardware to translate V-ISA to the implementation processor ISA (I-ISA). The hope is that many of the complications associated with software binary translation and optimization could be avoided by making the V-ISA independent of any specific hardware processor implementation and by including program information that is usually discarded in code compiled to specific hardware. Although Virtual ISA is an elegant solution for future software and for compatibility issues that arise as hardware capabilities expand over time, V-ISA does not help with 978-1-4673-0465-8/11/$26.00 ©2011 IEEE

Upload: haitham

Post on 15-Oct-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

Multiple Instruction Sets Architecture (MISA)

Hussein Karaki and Haitham Akkary

Department of Electrical and

Computer Engineering,

American University of Beirut, FEA

Beirut, Lebanon

{hak40,ha95}@aub.edu.lb

Abstract — In the computer hardware industry, there are

currently two highly successful instruction set architectures

(ISAs): the CISC X86 ISA which is an established standard

architecture in the personal computer and server markets, and

the RISC ARM ISA which is currently used in many ultra-

mobile computing devices, such as smart-phones and tablets.

Platforms that run one standard ISA cannot run the other ISA

application binaries without recompiling the source code.

We are investigating the technical feasibility of designing an

energy-efficient multiple instruction sets architecture (MISA)

processor that can run both X86 and ARM binaries. We propose

an approach in which special decoders interpret the binary

instructions of the running ISA and translates them to a native

target machine ISA that executes within the processor pipeline.

We discuss the completed initial stage of our work involving

the design of XAM, an X86 hardware binary interpreter for a

MISA processor that runs native ARM instructions, and

describe our design in detail. We present performance and

energy simulation results of our MISA processor design for a

set of synthetic benchmarks including Dhrystone2.1, measured

using the ARM SimpleScalar microarchitecture and power

simulator. We also discuss design issues of an ARM to X86

hardware interpreter we are currently developing. We expect the

completed X86-to-ARM design and the current ARM-to-X86

work to lay a foundation for designing a well optimized

processor having a new native ISA that can run efficiently both

X86 and ARM binaries, using direct hardware interpretation.

I. INTRODUCTION

In the application development industry, applications are said to be architecture dependent because in the process of application creation, its source code is compiled and assembled to generate binaries compatible with specific target platform architecture. However, in order to make precompiled binaries compatible across different architectures, i.e. cross-platform binary compatible, we require a layer to handle binary code translation from a source to target architecture. This layer can be implemented in software or in hardware.

There are two approaches to achieve cross-platform binary compatibility [3]. The first is “Interpretation”, which is mapping each instruction from the source ISA to its equivalent in the target ISA at runtime and without caching (i.e. without using software cache for translated code). The second approach is “Translation”, which can be dynamic or static. Dynamic Translation is similar to Interpretation but a Cache is used to save and reuse pieces of translated code, whereas Static Translation performs the translation offline, thus it has the ability to apply more rigorous optimizations.

The different implementations of the aforementioned approaches can be roughly partitioned into two categories: Software and Hardware. Some such as HP’s DYNAMO

and IBM-SNU’s LATTE [2] are completely software-based. These systems perform cross-platform binary compatibility by emulating the target architecture using a layer of software called a Virtual Machine Monitor (VMM) -also known as Code Morphing Software (CMS) in some references. The VMM is installed on top of the host’s operating system and manages the binary translation. Others such as IBM’s DAISY [7] and TRANSMETA’s CRUSOE [2] are hardware-aided, that is, hardware components are integrated within the platform in order to support and accelerate a binary translation process orchestrated by firmware.

Even though software translation, hardware-aided or not, provides a large amount of flexibility, this comes at the expense of memory, execution resources, performance and complexity. Several issues related to software binary translation have to be overcome [2]. We list here a few: - Address Translation: The virtual machine monitor

computes addresses in the translated code that are different from the original ones used by the source application binary.

- Self-Referential Code: Some programs checksum themselves to guarantee correctness. The virtual machine monitor must handle this behavior properly, which is very difficult.

- Translation Cache Management: Binary translation using a virtual machine usually accompanies the usage of a Translation Cache (T-Cache) which adds additional complications related to handling the cache’s efficiency and space.

- Real-Time Behavior: Time critical code segments are problematic for the VMM because execution time of the translated code often differs from the original code.

- Platform Resources: The VMM needs memory resources and additional execution cycles from the platform on which it is installed.

In order to avoid complications that arise from binary translation, researchers have proposed Virtual Instruction Set Architectures (V-ISA) [1][11]. In this approach, all user and operating system code target the V-ISA. An implementation specific layer of software is co-designed

with the hardware to translate V-ISA to the implementation processor ISA (I-ISA). The hope is that many of the complications associated with software binary translation and optimization could be avoided by making the V-ISA independent of any specific hardware processor implementation and by including program information that is usually discarded in code compiled to specific hardware. Although Virtual ISA is an elegant solution for future software and for compatibility issues that arise as hardware capabilities expand over time, V-ISA does not help with

978-1-4673-0465-8/11/$26.00 ©2011 IEEE

SifresizAdmin
Texto escrito a máquina
978-1-4673-0465-8/11/$26.00 ©2011 IEEE

existing binaries compiled for different standard ISAs when cross-platform compatibility is desired.

We propose another approach for cross-platform binary compatibility in which special hardware decodes the binary instructions of the running source ISAs and interprets them to a native target machine ISA binary code that executes within the MISA processor pipeline. As an example, one hardware interpreter/decoder may transform at run time X86 binaries [6] into native target machine code, while another interpreter/decoder does the same for an ARM application binary [8]. The target machine native ISA could be X86, ARM or some other ISA optimized specifically to run the multiple desired source ISAs. Our MISA approach provides lower cost and, with simultaneous multithreading [12], better energy efficiency than two heterogeneous ARM and X86 cores. To the best of our knowledge, no pure hardware implementation to perform binary translation similar to our approach has been proposed before.

The rest of the paper is organized as follows. Section II describes our MISA architecture. Section III provides a description of our approach and some key challenges that we faced, along with the solutions. In section IV, we describe our methodology. We then present performance and power simulation results in section V. We discuss ARM-to-X86 hardware interpreter issues in section VI and conclude the paper in section VII.

Figure 1: XAM system components within ARMv5 architecture

Figure 2: XAM Detailed Design

II. MISA PROCESSOR OVERVIEW

Figure 1 shows how the XAM hardware interpreter fits within ARMv5 processor architecture. Depending on the execution mode, ARM instructions fetched from the instruction cache proceed to the execution core pipeline through a simple ARM I-decode block. In X86 execution mode, bytes fetched from the instruction cache go to a variable length X86-decoder, which identifies the length of each instruction, extracts the instruction bytes, and then forward them to an X86 to ARM interpreter block (XAM).

Instead of executing either ARM or X86 binaries using two different modes, another option is to execute both ARM and X86 binaries concurrently using simultaneous multithreading [12]. This provides the most energy-efficient performance due to the full resource sharing and utilization of all the hardware blocks in the processor.

Figure 2 shows a diagram of the XAM block. The XAM block is similar conceptually to decoders implemented on Intel X86 processors, which in order to fully exploit instruction level parallelism in the superscalar processor pipeline translate X86 macro instructions into one or more RISC-like micro-operations (uops) [9]. Our XAM decoder contains three translation tables, D1, D2 and D3, which transform an X86 instruction into equivalent 1, 2, or 3 ARM instructions, depending on the X86 instruction complexity. If an X86 instruction requires more than 3 ARM instructions, it is executed as a macro from the special microcode sequencer block M, also shown in Figure 2. For example, in one of the cases during the implementation, we had to translate the X86 floating-point instruction “FADDP memory”, due to the X86 complex memory-register addressing modes and stack organization of the floating-point registers, into 17 equivalent ARM instructions.

III. XAM HARDWARE BINARY INTERPRETER

In this work, the XAM hardware binary interpreter handles user-level 32-bit X86 instructions while privileged instructions are handled by the VMM software. For simplicity, we assume a flat memory model1, which is not a limitation for most of new X86 code.

Moreover, in our design, we do not modify the original ARM architecture. Hence, we do not add any additional registers or add any new customized instructions to the ARM original ISA. The only addition to the ARM architecture is the X86-decoder and XAM blocks.

We faced and resolved many translation challenges during the design. Due to space limitation in this paper, we only describe a few of the significant translation issues we faced and present a translation example for each case.

A. Register Mapping

In order to use the same register file for both ARM and X86 binaries, we had to map the X86 registers into ARM registers. The X86 architecture has a total of 16 registers, which include 6 “general-purpose” registers, the “Stack Pointer” register, the “Base Pointer” register, the “Flag”

1 Flat memory model means no usage of memory segmentation, i.e. all

segments set to full physical memory space.

register, an inaccessible “Instruction Pointer” register and 6 “Segment” registers.

On the other hand, ARM architecture has a total of 16 accessible registers. These include thirteen “general-purpose” registers and three registers that have special functions such as the “Link” Register (R14), the “Stack Pointer” Register (R13) and the “Program Counter” register (R15).

The register mappings between the two different architectures implemented in XAM are shown in the table 1. The registers R10 and R11 are used as temporary registers when needed (e.g. for indirect memory addresses, for extracting byte or word from operand, and so on). The registers R6, R7, and R8 are reserved for future use as temporary registers. The Link Register (R14) has no

equivalent register in the X86 ISA, therefore, it is not present in the register mapping table below. Furthermore, the reason for choosing this particular mapping is due to the compatibility issues when handling the system calls in the ARM simulator.

X86 Registers ARM Mapped Registers

EAX / AX / AH, AL EBX / BX / BH, BL ECX / CX / CH, CL EDX / DX / DH, DL EDI / DI ESI / DI EBP / BP ESP / SP EIP / IP

R0 R5 R4 R3 R1 R2 R9 SP (R13) PC (R15)

Table 1: X86-to-ARM mapped registers

B. X86 to ARM Translation Issues

We describe in this section a few of the translation issues we have encountered and our chosen solutions.

Example 1: ARM employs a load-store architecture [10], which means that the instructions that process data operate only on registers. When it comes to handling memory contents, load and store instructions must be used to retrieve or save data from and to memory locations. On the other hand, the X86 architecture allows memory locations to be used as operands in data processing instructions. Table 2 shows a translation example of an X86 “add” instruction with a memory source operand. In the example, the memory operand is identified by the presence of parenthesis around a source register.

X86 pseudo-instruction ARM equivalent instruction(s)

ADD (Source), Destination LOAD Temp, [Source] ADDS Destination, Destination, Temp

Table 2: Translation of ADD instruction with a memory operand

Example 2: Flag register and conditional branching. ARM architecture does not have a Flag Register (FR). Instead it has a Current Program Status Register (CPSR). Although ARM’s CPSR has some flags in common with the X86 FR (i.e. Negative, Zero, Carry and Overflow), it behaves differently. Contrary to X86, ARM updates the condition flags only after comparison operations or ALU operations where the executed instruction has a suffix “s” [10]. For example, in X86 assembly “TEST reg, reg” is equivalent to “AND reg, reg” and both have the exact impact on the flag

register. Unfortunately, in ARM assembly, “TST reg, reg” is not equivalent to “AND reg, reg, reg” because the former updates the CPSR flags, whereas the latter does not. However, by adding the “s” suffix to the “and” instruction, “TST reg, reg” becomes equivalent to “ANDS reg, reg,

reg”. Due to this issue, we had to explicitly use instructions with “s” suffix to some of the translated ARM instructions such as AND, ADD, SUB, OR, XOR, etc. in order to keep cross-platform “equivalent” flag states.

Example 3: Segment Registers. A 16-bit segmented memory model was first introduced with the 8086 architecture in order to extend memory addressing to form 20-bit addresses. Afterward, from 80386 and on, X86 processors had both 16-bit and 32-bit segmented memory models available [6], in which six Segment Registers each of 16-bit long are used as an index to Global Descriptor Table (GDT) that contains the actual 32-bit starting address of the segments. Then a 32-bit offset is added to that starting address to compute the required address for the memory operand [6] (see figure 3). In ARM architecture, a segmented memory model does not exist, therefore in order to handle this situation, we used a X86 “flat memory” model which sets all segment base registers to the value 0. By setting the segment base to 0, thus removing the base value from the memory address computation, the segment offset becomes the actual memory operand address.

Example 4: Call and Return from subroutines. When an X86 processor executes a call instruction, the address of the instruction that follows the call is pushed onto the stack, and then the processor transfers the control to the subroutine code [6]. In contrast, the ARM processor puts the return address in a register called “Link Register” (LR) [10] before transferring the control. Similarly, when an X86 processor executes a return instruction, it pops the address stored on the stack, loads it into the Instruction Pointer (EIP), and then transfer control back from the subroutine to the calling point. On the other hand, the ARM processor pops the value saved in the “Link Register” and loads it into the “Program Counter” (PC). Table 3 shows our translation for subroutine entry and exit.

Figure 3: X86 address computation using Segment Register

Subroutine X86 pseudo-instruction ARM equivalent

instructions

Entry

point

PUSH EBP PUSH {R9, LR}

MOVE ESP, EBP MOVE R9, SP

Exit

Point

LEAVE MOV ESP, EBP POP EBP

MOVE SP, R9 POP {R9}

RET POP {PC}

Table 3: Translation example for entry and exit point of subroutine

Example 5: Floating-point stack. The X86 architecture has a stack of 8 registers used by floating-point instructions. On the other hand, the ARM architecture does not have a floating-point stack. Instead, ARM supports floating-point operations such as MVFD, LDFS, STFS, ADFE, FDVS, etc…, using a floating-point register file of size eight (F0 – F7). In order to handle the X86 floating-point registers stack, we decided when translating each of the X86 floating-point instructions that push (FLD, FILD, etc…) or pop (FSTP, etc…) the stack to emulate the 8-register stack push and pop function by adding seven instructions -equal to the number of registers minus one. The 7 additional instructions are inserted to maintain a “logical” floating-point register stack without changing the ARM floating point register file hardware implementation or requiring new special instructions. Table 4 shows a translation example of how we emulated push and pop stack with 7 move instructions each.

X86 Instruction PUSH Floating-point POP Floating-point

ARM

equivalent

instructions

MoveFloat F7, F6 MoveFloat F6, F5 MoveFloat F5, F4 MoveFloat F4, F3 MoveFloat F3, F2 MoveFloat F2, F1 MoveFloat F1, F0

MoveFloat F0, F1 MoveFloat F1, F2 MoveFloat F2, F3 MoveFloat F3, F4 MoveFloat F4, F5 MoveFloat F5, F6 MoveFloat F6, F7

Table 4: Translation of the X86 Floating-point stack instructions

Example 6: Handling immediate operands. The X86 instruction set architecture has instructions that can operate directly on 32-bit long immediate values. This is because of the variable-length instruction format of the X86 ISA. ARM is a fixed-instruction length ISA and its immediate operand value must fit within the 32-bit fixed-length instruction. ARM data processing instructions use a total of 12 bits for specifying the value of an immediate operand 8 of which are for encoding a constant and the other 4 bits to rotate the constant and generate a final 32-bit immediate operand value. The ARM ISA therefore supports 32 bit immediate value range but with reduced precision.

In order to overcome this incompatibility and to translate X86 instructions that have long immediate operands, we used the program-relative “LDR” instruction provided in the ARM ISA to load the 32-bit immediate value into the destination register. Table 5 shows such translation.

X86 pseudo-instruction ARM equivalent instruction

Move $number, Destination LDR Destination, [PC, #offset]

Table 5: Translating X86 instructions with Immediate Value

Example 7: Microcode routines. Several X86 instructions such as IDIV, IMUL, instructions that use complex

addressing, etc… do not have any equivalent within the ARMv5 ISA. In order to translate these instructions, we wrote microcode routines, stored in the microcode sequencer in XAM’s decoder block M, to emulate the same operations. Furthermore, we used the same technique to translate X86 conditional instructions such as SETEQ, SETNE, SETG, etc… which are not originally found within

the ARM architecture. Table 6 shows the microcode routine for SETEQ instruction.

X86 pseudo-

instruction

ARM equivalent instruction(s)

SETEQ Destination

MOVNE Temp, #0 MOVEQ Temp, #1 MOVE Destination, Destination, LSR #8 MOVE Destination, Destination, LSL #8 ADD Destination, Destination, Temp

Table 6: Translation of the X86 conditional instruction in which the

Destination is in Byte.

IV. SIMULATION METHODOLOGY

In order to assess our MISA XAM interpreter, we conducted a performance evaluation and a power simulation. For performance evaluation, we have used the ARM sim-outorder model from the SimpleScalar simulation infrastructure [4]. To compare the energy consumption of our MISA processor to the ARM baseline architecture (i.e. power simulation) we have used Sim-Panalyzer [5] which is an augmentation to the SimpleScalar performance simulator that allows the user to estimate power consumption of a StrongARM-like (SA-1100) processor [14]. The power simulator accounts for the switching capacitance of various microarchitecture blocks and the logic activity factor provided by a cycle accurate microarchitecture model. In addition to the simulation tools, we utilized 5 synthetic benchmarks: Dhrystone 2.1 [13], Simple Calculator, linked list builder, binary tree builder, and array manipulators. We compiled each benchmark twice, using GCC 4.4.2: to X86 and to ARM. We ran the resulting ARM binaries on both the ARM SimpleScalar processor simulation model and the power estimation model. Afterward, we executed the X86 binaries on the same aforementioned models, but with the XAM interpreter option enabled.

Our SimpleScalar simulation models execute user code directly and handle privileged system calls by proxy on the simulation host machine [4]. This emulates our target universal platform architecture, which will use a VMM software layer to execute privileged instructions.

V. SIMULATION RESULTS

A. Performance Simulation Results

We report in this section the following performance metrics: total number of committed instructions, execution time in clock cycles, and the mapping type distribution for both the ARM binaries and the X86 translated binaries.

Benchmark

Cross-

compiled to

ARM Binary

X86-to-ARM

translated

Binary

Dhrystone2.1 (1000 iterations) 1020360 985783 SimpleCalculator 144288 144312 BasicLinkedList 165605 165687 BinaryTreesOps 314050 315575 ArrayManipulate 129804 138484

Table 7: Total number of committed instructions

Table 7 shows the total number of committed ARM instructions of each of the ARM cross-compiled binaries vs. the X86 translated binaries. In most cases the translated

binaries have a slightly higher instruction count. This is what we have expected, anticipating that the ARM automatically compiled code to be better than the X86 translated code, since we did not focus much on optimizing the manually generated translation mappings of the X86 instructions. Surprisingly, the Dhrystone translated code is slightly smaller than the ARM cross-compiled code, indicating that our translation mappings choice is reasonably optimized, at least as measured by the translated ARM instructions count metric.

In table 8, we present the execution time of each of the used benchmark, in which we chose the number of cycles as our measure unit. There are two factors that influence this metric: 1) the number of simulated instructions and 2) the type of the chosen instructions. Even though we made several optimizations during the XAM translator design, regarding the choice of the used instructions, which reduced significantly the number of generated instructions, this was not our primary goal. Still, the performance of the translated binaries is very close to the ARM binaries, thus validating the feasibility of our translation approach.

Benchmarks

Cross-

compiled to

ARM Binary

X86-to-ARM

translated

Binary

Dhrystone2.1 (1000 iterations) 690123 661566 SimpleCalculator 117576 118045 BasicLinkedList 144191 147940 BinaryTreesOps 241134 241560 ArrayManipulate 117663 116205

Table 8: Execution Time (in Cycles)

Benchmarks

X86-To-ARM Instructions Mapping Types

One-to-

one

One-to-

two

One-to-

Three

Micro-

code

SimpleCalculator 238 72 33 4 BasicLinkedList 1814 419 83 1 BinaryTreesOps 3797 865 87 1 ArrayManipulate 2564 1037 533 2

Table 9: Statistics of mapping type on dynamic execution

In table 9, we show statistics on the different types of mappings for 4 benchmarks. The results show that most instructions require one-to-one translation, followed by one-to-two then the one-to-three and lastly microcode. This is consistent with results reported by Intel designers in [9].

Benchmarks

Cross-

compiled

to ARM

Binary

X86-to-ARM

translated

Binary

Dhrystone2.1

1000 iterations I-Cache 315477 268155

ITLB 12019 11871

RF 92077 87167 ALU 476 455

DL1 182427 178774

DTLB 12037 11559

Clock 123627 119538

A-IO 176059 168352

D-IO 630970 649552

XAM 0 300000

Total 1545169 1795423

Table 10: Simulated power dissipation of SA-1100 processor in micro-

watts

B. Power Simulation Results

We report in this section, for Dhrystone2.1 benchmark, the Sim-Panalyzer estimated average power consumption for all functional blocks of the simulated StrongARM (SA-1100) [14] based MISA processor including: first level instruction cache and TLB (I-Cache, ITLB), ALU, register file (RF) first level data cache and TLB (D-Cache, DTLB) Clock, address input/output (A-IO), data input/output (D-IO) and XAM interpreter.

Table 10 shows the estimated power dissipation of the various functional blocks of our MISA processor expressed in micro-watts. The total power dissipation of the MISA processor when running Dhrystone2.1 compiled to ARM is about 1.55 Watts. When Dhrystone2.1 is compiled to X86 and translated using the XAM hardware interpreter, the estimated total power is about 1.80 Watts. The key difference in the estimated total processor power, when executing ARM and X86 binaries, come from two sources: 1) the additional power consumed in XAM block, which we estimate to be 0.3 Watts and 2) reduced power consumption in the I-Cache resulting from 15% decrease in code size when running the CISC variable size X86 instruction binaries.

In conclusion, our MISA processor achieves, when running in translation mode, the same level of performance with a power overhead of 14% of total processor power. Considering the 40% to 2x performance overhead reported for various software binary translators in literature, our MISA hardware translation approach provides higher level of performance and better energy efficiency than previous binary translation solutions.

VI. ARM-TO-X86 TRANSLATION

We are currently working on alternative MISA processor design in which the processor native ISA is X86 and a hardware interpreter transforms ARM binaries to X86 machine instructions. We have identified several issues in this ARM-to-X86 hardware translation approach that we describe in this section.

1. With ARM-to-X86 translation, some of the benefits of CISC are lost. Examples of these benefits include reduced code size, reduced number of instructions, improved addressing modes, and other features that may benefit compiling applications directly to X86 binaries. On the other hand, all the disadvantages of CISC are introduced after the translation. For example, in addition to the power overhead of the hardware interpreter, the processor has to incur the overhead of decoding and executing the variable length, complex, X86 binary instructions generated by the hardware interpreter.

2. CPSR register. The “s” instruction suffix in ARM specifies whether or not an instruction sets the CPSR flags. Since X86 and ARM handles their flags sometime differently, as we discussed earlier in the paper, we have found out that in some cases we need to add extra instruction to update the flags properly. An example is the translation of MOVS instruction shown in Table 11.

ARM instruction X86 equivalent instruction(s)

MOVS Destination, Source MOV Source, Destination TST Destination, Destination

MOVNE Destination, Source CMP Destination, Source JE Next_Instruction

MOV Destination, Source Next_Instruction

Table 11: Translating ARM instruction with suffix to X86

3. Conditional suffix. This suffix in ARM controls whether or not the ARM instruction will be executed based on the state of the condition flags. These conditional instructions require microcode routines including conditional branch when translated. In addition to increasing the number of instructions in the executed program, conditional branches sometimes cause pipeline flushes, when mispredicted, which increases power due to bogus path execution and reduces performance due to wasted execution cycles. Table 11 shows one possible translation of the conditional MOVNE instruction.

4. ARM to X86 register mappings. There are multiple issues related to mapping ARM registers to X86 registers.

• The inaccessibility of the EIP register. X86 instruction pointer is not accessible by code while the ARM “Program Counter” register is accessible. Therefore, direct mapping of ARM PC register to EIP is not possible.

• Different number of registers. When we selected for XAM the X86-to-ARM register mappings, we assumed a flat memory model, thus eliminating the need for the X86 segment registers. Using a flat memory model left us with 10 architected X86 register that we had to map into 16 available ARM registers. We were able to select a mapping and assign some of the remaining free registers to use as temporary registers for complex instructions translation. The smaller number of X86 registers makes ARM-to-X86 register mapping not possible.

VII. CONCLUSION

In this paper, we have shown that it is possible to design multiple instruction sets architecture (MISA) processor that runs both ARM and X86 binaries. In our initial work, we have succeeded in translating 32-bit X86 binaries to ARM instructions at run time, using a special X86 hardware decoder and interpreter, integrated within an ARM processor pipeline and without requiring any other hardware changes. The translated code runs at equivalent performance to compiled ARM binaries, with a 14% power overhead. Our ultimate goal is to design an optimized MISA processor that translates effectively either ARM or X86 binaries into an optimized target ISA code that runs on the MISA processor at high performance and energy efficiency. The novelty of our work is not the concept of

binary translation itself, but the method we use. Our proposed translation scheme is an interesting option towards designing universal platform architecture, capable of running binaries compiled for the two most successful ISAs currently used in industry.

This work is only the first step, and significant additional research is still to come. We intend to continue this work and evaluate our MISA design using larger, more realistic benchmarks. We also plan to develop a VMM software layer to handle privileged instructions, as the next step towards realizing our universal X86/ARM multiple instruction sets platform architecture.

REFERENCES

[1] V. Adve, C. Lattner, M. Brukman, A. Shukla,and B. Gaeke, “LLVA: A Low-level Virtual Instruction Set Architecture”, Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture, December 2003.

[2] E. R. Altman, K. Ebcioglu, M. Gschwind, and S. Sathaye, “Advances and Future Challenges in Binary Translation and Optimization”, Proceedings of the IEEE, vol.89, no.11, pp.1710-1722, Nov 2001.

[3] E. R. Altman, D. Kaeli and Y. Sheffer, “Welcome to the Opportunities of Binary Translation”, Computer, vol.33, no.3, pp.40-45, Mar 2000.

[4] T. M. Austin and D. Burger, "The SimpleScalar tool set, version 2.0", ACM SIGARCH Computer Architecture News, Volume 25 Issue 3, June 1997, http://www.simplescalar.com/.

[5] T. M. Austin, T. Mudge, and D. Grunwald, “Sim-Panalyzer: The SimpleScalar-Arm Power Modeling Project”, Sim-Panalyzer2.0_referenceManual, http://www.eecs.umich.edu/ ~panalyzer/.

[6] R. C. Detmer , “Introduction to 80x86 Assembly Language and Computer Architecture”, Jones and Bartlett Publishers, Inc., 2001.

[7] K. Ebcioglu and E. R. Altman, “DAISY: Dynamic Compilation for 100% Architectural Compatibility”, Proceeding of the 24th Annual International Symposium on Computer Architecture, 1997.

[8] S. Furber, “ARM System-on-Chip Architecture”, Addison-Wesley Professional, 2

nd edition, 2000.

[9] D. B. Papworth, “Tuning the Pentium Pro microarchitecture”, Micro, IEEE, vol.16, no.2, pp.8-15, Apr 1996.

[10] A. Sloss, D. Symes and C. Wright, "ARM System Developer's Guide: Designing and Optimizing System Software", the Morgan Kaufmann Series in Computer Architecture and Design, 2004.

[11] J. E. Smith, T. Heil, S. Sastry, and T. Bezenek. Achieving high performance via co-designed virtual machines. In International Workshop on Innovative Architecture (IWIA), 1999.

[12] D. Tullsen, S. Eggers, and H. M. Levy. Simultaneous Multithreading: Maximizing On-Chip Parallelism. ISCA-22, June 1995.

[13] R. P. Weicker, "Dhrystone: A Synthetic Systems Programming Benchmark" Communications of the ACM (CACM), Volume 27, Number 10, October 1984, p. 1013-1030.

[14] “DIGITAL Semiconductor SA-1100 Microprocessor for Portable Applications Product Brief”, EC–R59EC–TE, 19 February 1998.