a scalable register file architecture for superscalar processors

12
ELSEVIER Microprocessors and Microsystems 22 (1998) 49-60 MIC~ESSORS AND MICROSYSTEMS A scalable register file architecture for superscalar processors Steven Wallace, Nader Bagherzadeh* Department of Electrical and Computer Engineering, University of California, Irvine, CA 92697, U.S.A. Received 17 January 1997; received in revised form 17 December 1997; accepted 18 December 1997 Abstract A major obstacle in designing superscalar processors is the size and port requirement of the register file. Multiple register files of a scalar processor can be used in a superscalar processor if results are renamed when they are written to the register file. Consequently, a scalable register file architecture can be implemented without performance degradation. Another benefit is that the cycle time of the register file is significantly shortened, potentially producing an increase in the speed of the processor. © 1998 Elsevier Science B.V. Keywords: Register file; Superscalar; Register renaming 1. Introduction The instruction level parallelism in programs allows a superscalar microprocessor to execute multiple instructions per cycle. It can be exploited by hardware providing resources to perform dynamic instruction scheduling. Inde- pendent instructions can be discovered at run-time and scheduled to functional units out-of-order. To avoid anti and output dependencies, registers can be renamed by using a reorder buffer [1] or a mapping table [2]. The Power PC [3], Pentium Pro [4], and MIPS R1OOOO [2] are examples of superscalar microprocessors that perform out-of-order issue and register renaming. The register file is a design obstacle for superscalar microprocessors. If N instructions can be issued in a cycle, then a superscalar microprocessor's register file needs 2N read ports and N write ports to handle the worst- case scenario. The area complexity for the register file grows proportional to N 2 [5]. Therefore, a new architecture is needed to keep the ports of a register file cell constant as N increases. In addition, the register requirements for high perfor- mance and exception handling can be quite high. Farkas et al. conclude that for best performance, 160 registers are needed for a 4-way issue machine, and 256 registers are needed for an 8-way issue machine [6]. Therefore, it is desirable to reduce the register requirements and still main- tain performance. A major difficulty in the simultaneous multithreading * Corresponding author. E-mail: [email protected] 0141-9331/98/$19.00 © 1998 Elsevier Science B.V. All rights reserved PH S0141-9331 (98)00048-9 (SMT) architecture, introduced by Tullsen et al., was the size of the register file [7]. They supported eight threads on an 8-way issue machine, using 356 total registers with 16 read ports and 8 write ports. Compared to a standard 32- register, 2 read port, 1 write port register file of a scalar processor, the area of the SMT register file is estimated to be over 700 times larger. To account for the size of the register file, they took two cycles to read registers instead of one. This underscores the need for a mechanism to scale the register file, yet still have the benefits of area and access time of a register file for a scalar processor. To attack this problem, we approach the problem of scal- ability of a register file in two directions. First, we show how it is possible to use a multiple banked register file to reduce the port requirement to that of a scalar processor: 2 read ports and 1 write port. Second, we show how to reduce the total register requirement by improving the utilization of the registers. This paper begins by giving background information related to register renaming, branch misprediction recovery, and the complexity of the register file. The experimental methodology used in the remainder of this paper is explained. Section 4 describes a hybrid renaming technique to reduce the penalty of mispredicted branches. The next section presents how registers can be renamed at result write time, thereby leading to a scalable register file imple- mentation. Section 7 explains the implementation details for hybrid renaming with dynamic result renaming. The performance of these techniques is evaluated in the next section. The last section gives some concluding remarks.

Upload: steven-wallace

Post on 02-Jul-2016

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: A scalable register file architecture for superscalar processors

E L S E V I E R Microprocessors and Microsystems 22 (1998) 49-60

M I C ~ E S S O R S AND

MICROSYSTEMS

A scalable register file architecture for superscalar processors

Steven Wallace, Nader Bagherzadeh* Department of Electrical and Computer Engineering, University of California, Irvine, CA 92697, U.S.A.

Received 17 January 1997; received in revised form 17 December 1997; accepted 18 December 1997

Abstract

A major obstacle in designing superscalar processors is the size and port requirement of the register file. Multiple register files of a scalar processor can be used in a superscalar processor if results are renamed when they are written to the register file. Consequently, a scalable register file architecture can be implemented without performance degradation. Another benefit is that the cycle time of the register file is significantly shortened, potentially producing an increase in the speed of the processor. © 1998 Elsevier Science B.V.

Keywords: Register file; Superscalar; Register renaming

1. Introduct ion

The instruction level parallelism in programs allows a superscalar microprocessor to execute multiple instructions per cycle. It can be exploited by hardware providing resources to perform dynamic instruction scheduling. Inde- pendent instructions can be discovered at run-time and scheduled to functional units out-of-order. To avoid anti and output dependencies, registers can be renamed by using a reorder buffer [1] or a mapping table [2]. The Power PC [3], Pentium Pro [4], and MIPS R1OOOO [2] are examples of superscalar microprocessors that perform out-of-order issue and register renaming.

The register file is a design obstacle for superscalar microprocessors. If N instructions can be issued in a cycle, then a superscalar microprocessor's register file needs 2N read ports and N write ports to handle the worst- case scenario. The area complexity for the register file grows proportional to N 2 [5]. Therefore, a new architecture is needed to keep the ports of a register file cell constant as N increases.

In addition, the register requirements for high perfor- mance and exception handling can be quite high. Farkas et al. conclude that for best performance, 160 registers are needed for a 4-way issue machine, and 256 registers are needed for an 8-way issue machine [6]. Therefore, it is desirable to reduce the register requirements and still main- tain performance.

A major difficulty in the simultaneous multithreading

* Corresponding author. E-mail: [email protected]

0141-9331/98/$19.00 © 1998 Elsevier Science B.V. All rights reserved PH S 0 1 4 1 - 9 3 3 1 ( 9 8 ) 0 0 0 4 8 - 9

(SMT) architecture, introduced by Tullsen et al., was the size of the register file [7]. They supported eight threads on an 8-way issue machine, using 356 total registers with 16 read ports and 8 write ports. Compared to a standard 32- register, 2 read port, 1 write port register file of a scalar processor, the area of the SMT register file is estimated to be over 700 times larger. To account for the size of the register file, they took two cycles to read registers instead of one. This underscores the need for a mechanism to scale the register file, yet still have the benefits of area and access time of a register file for a scalar processor.

To attack this problem, we approach the problem of scal- ability of a register file in two directions. First, we show how it is possible to use a multiple banked register file to reduce the port requirement to that of a scalar processor: 2 read ports and 1 write port. Second, we show how to reduce the total register requirement by improving the utilization of the registers.

This paper begins by giving background information related to register renaming, branch misprediction recovery, and the complexity of the register file. The experimental methodology used in the remainder of this paper is explained. Section 4 describes a hybrid renaming technique to reduce the penalty of mispredicted branches. The next section presents how registers can be renamed at result write time, thereby leading to a scalable register file imple- mentation. Section 7 explains the implementation details for hybrid renaming with dynamic result renaming. The performance of these techniques is evaluated in the next section. The last section gives some concluding remarks.

Page 2: A scalable register file architecture for superscalar processors

50 S. Wallace, N. Bagherzadeh/Microprocessors and Microsystems 22 (1998) 49-60

Allocate new physical register

I Commit register by freeing Free List Iold physical register

• I Mapping Table

l g'gs,ca, Register

Recovery List

Fig. 1. Block diagram of renaming logic.

2. Background

2.1. Register renaming

Register renaming is an important tool to allow a super- scalar to perform out-of-order execution. One method to rename registers is to use a reorder buffer [1,8]. It can dynamically rename a logical (programmable) register to a unique tag identifier. The result is written back into this buffer. If there are no exceptions or branch mispredictions, the result eventually commits by writing the result to the register file.

Another register renaming technique maps logical regis- ters into physical registers (the index into the physical data array), as performed by the MIPS R10000 [2]. A block diagram of the renaming process is shown in Fig. 1. When an instruction is decoded, a new physical register from a free list is allocated for its destination register and entered into a mapping table. The old physical register for that register is entered into a recovery list. The recovery list (also called the active list) maintains the in-order state of instructions and can be used to undo the mappings in the event of a mis- predicted branch or exception. After an instruction com- pletes and all previous instructions have completed, its register is committed and the old value is discarded by free- ing the old physical register contained in the recovery list. During decoding of source operands, its logical register number is used as an index into the mapping table to read the corresponding physical register. The advantage of this renaming technique is that one data array is used to store both committed registers (part of the state of the machine) and speculative registers (extra registers reserved for spec- ulative results until committed).

2.2. Bad branch recovery

instruction which caused an exception is encountered. After an entry is read, the old physical register is used to replace the mapping of that entry' s logical register number. After all appropriate entries from the recovery list have been read and re-mapped into the mapping table, the mapping table will reflect the state of the machine after that mispredicted branch.

The large penalty required to recover the mapping table can be minimized by using a checkpoint mechanism. The R1OOOO uses checkpointing for up to four branches, but not for exceptions [2]. In order for checkpointing to be realistic in hardware, checkpoint storage must be integrated into the basic cell of the mapping table to have direct access for a single cycle recovery. Thus, a mapping table's cell size is greatly increased and is not scalable with the number of branch checkpoints. With increased speculation from a wide-issue superscalar and larger mapping tables from SMT, using standard RAM cells becomes imperative for speed and scalability. Section 4.2 will introduce a new hybrid register renaming technique which significantly reduces this penalty yet still retains some of the benefits of a mapping table.

2.3. Register file complexity

The design of a register file for a superscalar processor has become an increasingly difficult task to accomplish. With increasing issue rates, larger instruction windows, deeper pipelines, and the advent of simultaneous multi- threading, the pressure on the register file to supply multiple values and store speculative results has dramatically increased. In order to issue N instructions per cycle, 2N registers need to be read for operands and N results need to be written back. For instance, issuing eight instructions per cycle requires 16 read ports and 8 write ports on a register file.

Unfortunately, the area of the register file increases pro- portional to the square of the number of ports on a register file [5]. Because of the implementation of a data cell in a register file, each time a port is added, more hardware is required: a transistor, a wire to access the cell, and a wire to read/write the cell. This increases both the cell's length and the cell's width.

In addition, the access time increases with the number of ports. For example, using 0.5/~m CMOS technology, the cycle times of a register file with 64 registers for N -- 1, 2, 4, 8 are estimated to be 2.8, 3.0, 3.2, 3.6 ns, respectively [9]. The increased cycle time must be taken into account in comparing the effective performance between different values of N, as will be shown in Section 8.

A major drawback with using a mapping table to rename logical registers to physical registers is the large penalty required to recover from a branch misprediction or excep- tion. Recovery proceeds by reading the recovery list from the most recent entry until the mispredicted branch or the

3. Experimental methodology

The performance of the concepts we describe was simu- lated by running the SPEC95 benchmark suite on the

Page 3: A scalable register file architecture for superscalar processors

S. Wallace, N. Bagherzadeh/Microprocessors and Microsystems 22 (1998) 49-60 51

Table 1 Functional unit parameters

Quantity Type Latency

4-way 8-way

4 8 ALU 1 2 4 Load unit 1 2 4 Store unit - 1 2 Int. multiply 2 1 2 Int. divide 10 4 8 FP add 3 1 2 FP multiply 3 1 2 FP divide 16 1 2 FP other 3

SPARC architecture using the Shade instruction-set simu- lator [10]. Each program was compiled using the SunPro compiler with standard optimizations ( - O ) and simulated for the first fifty million instructions. Also, we simulated the SPECint95 benchmark suite on the SDSP (Superscalar Digi- tal Signal Processor) architecture [11], using the GNU CC compiler and second-level optimizations ( -02 ) . The instruction set of the SDSP is very similar to MIPS.

To verify that our new register file architecture performs, we chose a reasonable machine model that resembles com- mercial processors including Power PC 604 [3] and MIPS R10000 [2]. Table 1 lists the quantity, type, and latency of the different functional units modeled. The quantity of func- tional units for the 8-way superscalar architecture is twice that of the 4-way superscalar architecture.

The machine model parameters used in simulation are:

• ins t ruc t ion cache: 64 Kbyte, 2-way set associative LRU, 16 byte line size, 2 banks, self-aligned fetching, 10 cycle miss penalty;

• da ta cache: 64 Kbyte, 2-way set associative LRU, 16 byte line size, 4 banks, 2 simultaneous accesses per cycle, lockup-free, write-around, write-through, 4 out- standing cache request capability, 10 cycle miss penalty;

• b ranch pred ic t ion: 2K × 2-bit pattern history table indexed by the exclusive--or of the PC and global his- tory register;

• specu la t i ve execut ion: enabled; • interrupts: precise; • ins t ruc t ion window: centralized; 32 entries for 4-way, 64

entries for 8-way; • reg i s ter f i le: separate general purpose and floating point

register files; logical registers are mapped to physical registers;

• recovery list: 32 entries for 4-way, 64 entries for 8-way; • s tore buffer: 16 entries.

The instruction scheduling logic uses a single instruction window for all functional units [ 1 ]. A reasonable size for the instruction window, 32 entries for 4-way and 64 entries for 8-way, was chosen that would give good performance and produce a strong demand for registers during issue and write back. A variable number of instructions, up to the decode

I I

misfetch: 1 cycle

:/D/RI IS RR EXl

IF ID/R IS RR

I IF D/R IS

IF D/R

IF mispredict: 4+ cycles )

RC

EX RCl

RR EXlRC

IS RRI E_X

D/I~ IS I RIq

IF D/RI IS RR I EX RC

Fig. 2. Pipeline stages of a superscalar processor.

width of 4 or 8, may be inserted into the instruction window, if entries are available. Instructions are issued out-of-order using an oldest first algorithm. Store instructions are issued in-order, and load instructions may be issued out-of-order in between store instructions.

Each cycle, a variable number of registers, up to the decode width, may be retired from the recovery list. If entries are available, then they may be used by new instruc- tions, up to the decode width. The old physical register corresponding to the same destination register is inserted into the recovery list.

The pipeline stages of the processor modeled are instruc- tion fetch, decode and rename, issue, register read, execute, and result commit, as shown in Fig. 2. Consequently, two levels of bypassing are required for back-to-back execution of dependent instructions. Also, instructions dependent on a load are optimistically issued, in expectation of a cache hit. If a cache miss occurs, then the dependent instructions must be re-issued, similar to the design in [12]. In addition, the simulator continues to fetch instructions from the wrong- path until an incorrectly predicted branch is resolved.

4. Register renaming

Registers can be renamed from logical to physical register using a mapping table. Unfortunately, a mispredicted con- ditional branch or an exception may result in a large penalty to recover the mapping table. Section 4.2 introduces a hybrid renaming technique to reduce this penalty. The hybrid renaming technique uses both content addressable memory (CAM), as used in a reorder buffer, and random access memory (RAM), as used with a mapping table. The number of ports for the CAM and RAM cells can be reduced by detecting the dependencies within the decode block, as will be described in Section 4.3.

4.1. R e c o v e r y

A major performance penalty with using a mapping table and a recovery list is the time it takes to recover from a mispredicted branch. Using the recovery list, the mapping table can be recovered by undoing each entry in the list one

Page 4: A scalable register file architecture for superscalar processors

52

Intrablock Detection

- - L-L-L-_--_

[

Reg I ltag I PReg [Readyl

CAM Lookup

S. Wallace, N. Bagherzadeh/Microprocessors and Microsystems 22 (1998) 49-60

Old PReg Update PReg

Old Next PReg Itag

Recovery List

. . . . . . . ~ . . . . . . . . J

Itag I PRegl~dy Map'ping I

table I

Send renamed Operands to

Instruction Window

Fig. 3. Block diagram of hybrid renaming.

at a time. The mapping table can be updated in groups at a time, but the rate is limited by the number of read and write ports. If P ports are available and M register mappings need to be recovered, then it will take [M/P] cycles to recover.

Hence, the longer it takes for a branch to be found incor- rectly predicted, the longer it takes to recover the mapping table. In fact, this nearly doubles the branch misprediction penalty, compared to a mechanism that can recover from a mispredicted branch in one cycle, as shown in the next section.

In addition, the ports of a RAM cell used by the mapping table grow proportional to N, since more instructions are being renamed per cycle. Therefore, it is not ideally scal- able. On the other hand, since the number of logical regis- ters remains fixed, N would have to be large in order to cause real problems from a practical standpoint. Neverthe- less, it would be beneficial to reduce the number of ports of the mapping table's RAM cell.

PRe8

~ ETE for Result and Send to

RF and IW

4.2. CAM~table hybrid

The advantage of a register renaming mechanism that uses CAM, such as a reorder buffer [1], is a one cycle

recovery time from a mispredicted branch or exception. This is accomplished by simply invalidating appropriate entries relative to the branch. A one cycle recovery time is desirable. However, CAM cells scale worse than RAM cells, as used in a mapping table. To begin with, a CAM cell is more expensive and slower than a normal RAM cell. In addition, the lookup array, which searches for the most recent register instance, grows as the number of speculative registers increases. Therefore, it is desirable to have the significant performance benefits of the CAM and the area benefits of RAM.

As a compromise, CAM is used to rename a limited number of speculative registers (extra registers reserved for speculative results until committed), and a mapping table, which is implemented using RAM, renames the remaining speculative registers and the committed registers. Fig. 3 is a block diagram of renaming hardware which uses both CAM and RAM. The hardware which uses CAM is called CAM lookup. The CAM lookup is a FIFO queue. When a new instance of a logical register is created, it is inserted into the CAM lookup. Only instructions which have a result need to be entered into the CAM lookup. When an instance exits the CAM lookup, it is entered into the appro- priate mapping table entry, and the old physical register is entered into the recovery list. When an instruction commits its result, the old physical register in the recovery is freed.

Source operands are first searched for matching entries in the CAM lookup list for the most recent destination register. If a match is not found, then the register is looked up in the mapping table. When a mispredicted branch is resolved, if its instruction tag identifier indicates that subsequent instructions are still in the CAM lookup entries, then there is no additional penalty. If it has been entered into the map- ping table, then the recovery list is used to undo the map- pings. However, the penalty is not as severe since instances are first entered into the CAM lookup. The number of entries to recover is reduced by the number of entries in the CAM lookup. Consequently, there is a significant reduc- tion in the average misprediction or exception penalty.

To compare the performance benefit from a full mapping table, full CAM lookup, and hybrid, Table 2 lists the aver- age misprediction penalty and instructions per cycle (IPC) for CAM depths (number of entries divided by decode width) of 0, 2, 4, and 8. The misprediction penalty includes

Table 2 Bad branch penalty and performance

Arch Suite Issue CAM = 0 CAM = 2 CAM = 4 CAM ---- 8

Penalty IPC Penalty IPC Penalty IPC Penalty IPC

SDSP Int 4 SPARC Int 4 SPARC FP 4 SDSP Int 8 SPARC Int 8 SPARC FP 8

7.7 2.63 6.4 2.70 8.0 2.20 6.6 2.29 8.1 1.58 6.8 1.61 8.8 3.70 7.4 3.85

10.0 2.70 8.4 2.84 10.6 1.91 9.1 1.96

5.8 2.74 5.5 2.75 6.0 2.32 5.7 2.35 6.1 1.63 5.8 1.64 6.7 3.93 6.2 3.99 7.5 2.91 7.0 2.96 8.1 2.00 7.4 2.03

Page 5: A scalable register file architecture for superscalar processors

S. Wallace, N. Bagherzadeh/Microprocessors and Microsystems 22 (1998) 49-60

100%

53

80%

P 60% e r c e 40% n t

20%

0~ 0

all the pipeline bubbles in the various stages. A branch may wait to be issued for several cycles. These cycles are also included, except when the pipeline stalls due to data depen- dencies. In addition, an extra cycle may be included if instructions have to be re-fetched from the mispredicted branch's block. A significant performance improvement is observed with the CAM lookup because the average mis- prediction penalty is reduced. After about half the total depth (CAM = 4), there is a marginal improvement in per- formance compared to a full CAM lookup (CAM = 8). Therefore, the hybrid CAM/table is a good compromise between cost and performance.

4.3. Intrablock decoding

1 2 4 8 16 32 64

Instruction Distance

I " SDSP ,, SPARC l

Fig. 4. Dependence distance for SDSP/SPARC.

of needing five or six CAM ports, now the CAM lookup can use four or five. When the block size is doubled to eight instructions, about four out of eight register operands are expected to be intrablock dependent. Therefore, instead of doubling the number of CAM ports when the decode size doubles, an increase of only one port is needed-- to about five or six.

Referring back to Fig. 3, the intrablock decoding now can be done before any CAM lookups or table mappings. As N increases, the number of operands needed for CAM and table lookup only increases slightly because it becomes more likely that operands will be dependent within the same block. Hence, the hybrid CAM/table renaming scheme is scalable.

Many operands are dependent on a result in the same block or in the recent past. This is shown in Fig. 4. It shows the number of instructions between a source operand and the creation of the register it is referencing. In a block of four instructions, each with one operand (since about half are usually constant or not used), about 1.2 operands are expected to be dependent within that block. If intrablock dependencies are detected, the number of CAM ports required to search the lookup entries may be reduced. If there is not enough time in the decode stage to determine the intrablock dependencies, then pre-decode bits in the instruction block can be used. Each source operand in a block requires log2 N bits to encode which of the previous N - 1 instructions it is dependent on, or none at all. In addition, if an operand is dependent on an instruction which comes before the starting position of a block, the dependency information is ignored. When each line is brought into the instruction cache, or after the first access, the line containing a block of instructions is annotated with 2N log2 N bits (for two source operands) indicating if an instruction is dependent on another one in the same block.

After running simulations using intrablock detection, decoding four instructions requires one less CAM port to search the lookup array for equivalent performance. Instead

5. Register file utilization

A disadvantage with allocating a physical register at decode time is that physical registers go unused until they receive their result value. As a result, a good portion of the register file (RF) is wasted most of the time. The total register file utilization is defined to be the ratio of the num- ber of physical registers with a useful value and the total number of physical registers. In addition, the speculative register file utilization is the ratio of the number of physical registers with a useful speculative value and the total num- ber of physical registers reserved for speculative results (does not include committed registers). A value is con- sidered to be useful if it is needed to ensure proper execution of the machine. With speculative execution and precise interrupts, this occurs from the time a register receives its result until it is committed.

Table 3 shows the average speculative and total register file utilization per cycle for 4-way and 8-way superscalar processors. The mean, median, and 90th percentile of the number of useful speculative registers are shown. Physical registers used to store the state of the logical register file will always be active, so the total register file utilization is not as

Page 6: A scalable register file architecture for superscalar processors

54 S. Wallace, N. Bagherzadeh/Microprocessors and Microsystems 22 (1998) 49-60

Table 3 Average register file utilization per cycle

Arch Suite Issue %spec %total Mean Median 90th%

SDSP Int 4 26.3 63.2 8.43 8 17 SPARC Int 4 16.0 84.0 5.11 4 12 SPARC FP 4 6.3 53.2 2.03 1 6 SDSP Int 8 24.6 49.7 15.73 15 32 SPARC Int 8 12.4 72.0 7.91 5 21 SPARC FP 8 4.8 54.4 2.80 1 9

meaningful as the speculative register file utilization. From the results presented, it is observed that less than one quarter of the registers reserved for speculative execution are used on the average. Less than half of the available speculative registers are used 90% of the time. The floating point RF used in the SPARC SPECfp95 has an extremely low utiliza- tion: less than 6%. Hence, the majority of speculative registers are going to waste most of the time.

6. Dynamic result renaming

As has been shown, many physical registers in the regis- ter file have no value or contain a useless value. Therefore, one way to reduce the size of the RF is to improve its utilization. Physical registers allocated with no value can be virtually eliminated by allocating at result write time instead of decode time. This is accomplished by splitting the register file into multiple banks, each bank with two read ports and one write port, as shown in Fig. 5. Each bank maintains its own free list (see Fig. 3), and old physical registers are freed when an instruction commits. In addition, a bank is directly connected to one result bus. When func- tional units arbitrate for result buses, a free register is needed in each bank. The allocation of a register cannot be done at decode time, since it is not known exactly which functional unit and bus a result will eventually arrive.

New Instructions (Opcode, Rena~ned Operands)

Update [ Instruction Result - - ~ Window PReg |

Issue Instructions (Read Ope~nds)

[ Result Bus Muxes h

~ ~ ~ 1

Bypass Network

Result Bus

Fig. 5. Block diagram of scalable register file.

On the other hand, by allocating the entry when results are written, multiple banks can be used with one write port and have no conflicts with writing results into the same bank. As a result of allocating physical registers at result write time, the size of each bank can remain constant as the number of banks increase proportionally to the issue width.

Although allocating physical registers at write time creates no conflicts for the single write port in a bank, the two read ports on one bank can cause contention with the instruction window. For example, three ALUs could require three operands from a single bank. With only two read ports, one ALU would not be able to issue its instruction. Even though this event can happen, it is not a likely event for two reasons. First, not every instruction issued requires two register operands. Some have one operand, while others require an immediate value. Second, most instructions issued bypass one of their results from the result of an instruction completed the previous cycle. Consequently, such a limited number of read ports per bank has a very limited impact on performance.

Table 4 demonstrates this fact by showing the distribution of read operand types, and the percentage of individual operand requests failed due to insufficient ports. If the oper- and is a register, then it can originate from the first or second level of bypassing, the first or second read port of a bank, or be an identical register read from the first or second read port. On the other hand, if it is not a register operand, then it can be an immediate value, a zero value, or no operand at all. Interestingly, although a significant percentage of reg- ister operands came from the first read port, few required the second read port. Furthermore, a large percentage of regis- ters are bypassed, especially at the first level. Since many operands are bypassed, a traditional RF for map on decode could reduce the number of read ports by about 50%. The number of write ports, however, must remain the same. Consequently, although the size of its RF may be reduced, this does not lead to a scalable solution.

The allocation of registers for results is pipelined. Two cycles before the result will be ready for writing, write arbitration takes place. The free list of each bank is searched for an available register. Result buses are assigned to a bank with a free register in a round-robin process. For example, consider a register file with four banks. In one cycle, three results are assigned to the first three banks. In the next cycle, the first result bus is assigned to the fourth bank, and remain- ing results are assigned starting with the first bank. If a bank

Page 7: A scalable register file architecture for superscalar processors

S. Wallace, N. Bagherzadeh/Microprocessors and Microsystems 22 (1998) 49-60

Table 4 Read operand category distribution (%)

55

4 Issue 8 Issue

Architecture SDSP SPARC SDSP SPARC

SPEC95 Int Int FP Int Int FP

Bypass level 1 25.1 19.7 19.4 19.7 13.8 21.5 Bypass level 2 4.8 4.3 2.6 4.3 2.9 2.3 Read port 1 13.2 18.1 20.4 18.1 9.6 18.8 Read port 2 0.8 1.5 5.2 1.5 1.1 3.4 Identical 0.3 0.6 1.6 0.6 1.0 2.5 Zero value 9.3 9.3 8.7 9.3 13.8 8.7 Immediate value 32.7 32.7 38.8 32.7 54.5 38.9 No operand 13.8 13.8 3.3 13.8 3.5 4.0 Failed read 0.2 0.2 7.3 0.2 7.4 5.1

should not have a free register, that bank is skipped in the assignment process.

If there should be more results than available banks, then the pipeline for the functional unit which cannot write its results is stalled. Also, another instruction may be issued to the functional unit before the instruction window is notified that the functional unit is stalled. Consequently, two instruc- tions with results may be waiting in the functional unit 's pipeline. This does not create a problem since already two levels of bypassing exist. If subsequent instructions require a stalled result, then the result continues to use the bypass network until it is written to the register file.

In order to be able to allocate registers at result write time and be able to do register renaming for out-of-order execu- tion, two types of renaming must take place. Before an instruction is placed into the instruction window, its desti- nation operand is renamed to a unique instruction tag (itag) and inserted into the mapping table. This contrasts to renam- ing the register to a physical register since allocation has not taken place yet. After the physical register has been allo- cated for the result, then the instruction window, mapping table, and recovery list need to be notified of the physical register (preg). Using the result 's destination register iden- tifier, the mapping table is updated, if necessary. Matching itag entries in the instruction window receive the preg. Entries in the instruction window are already matched to mark its operand ready, so there is little additional cost involved besides storage cost.

A problem exists with updating the preg in the recovery list. The recovery list may contain an entry with an invalid old preg because its value is not ready. As a result, there needs to be some way of finding that entry so it can be updated. The old preg in the recovery list can be updated matching the itag using CAM cells. A more efficient mechanism would be to store the next itag in the recovery list. When the entry in the recovery list is marked complete, the next itag is read, and the preg is written into the entry indexed by the next itag.

The greatest cost in hardware using dynamic result renaming is the full multiplexer network used for reading

and writing registers. This cost, however, does not begin to compare to the enormous time and space savings by using a two read port and one write port register file. The bypass network is still required for map on decode case. In addition, the bypass network might be reduced by implemented sug- gestions by Ahuja et al. [13].

6.1. Deadlocks

The most critical aspect of using dynamic result renaming is avoiding deadlock situations. If there are fewer specula- tive registers than entries in the recovery list, then it is possible all registers can be allocated with results still pend- ing and create a deadlock situation. To guarantee a deadlock will not occur, two conditions must exist:

1. The oldest instruction must be able to issue its instruction. 2. The oldest instruction must be able to write its result.

It may occur that the oldest instruction is not able to issue its instruction if the functional unit it requires is stalled and there are no free registers available. Therefore, if this situa- tion arises, the oldest instruction and only the oldest instruc- tion is permitted to issue its instruction to a functional unit whose results are stalled and latched into its two stages. When it completes, the result is delivered to the register file and written using the old physical register stored in the recovery list (thereby guaranteeing the second condition).

Moreover, if the oldest instruction is unable to issue or complete its instruction, then the processor temporarily exe- cutes in a scalar manner. In addition, each functional unit must have access to enough banks that equal to at least the number of entries in the recovery list plus the number of logical registers (in most situations, the entire register file).

7. Implementation

This section describes implementation details for hybrid renaming with dynamic result renaming. First, the structures

Page 8: A scalable register file architecture for superscalar processors

56

Table 5 CAM lookup fields

S. Wallace, N. Bagherzadeh/Microprocessors and Microsystems 22 (1998) 49-60

Name Bits Type Description

Reg log2 L CAM Logical destination register identifier

Itag log2 S RAM Unique instruction tag PReg [log2 R] RAM Physical register number Ready 1 RAM Indicates if PReg field is valid

(result has completed)

and procedures used for renaming source operands are explained. Then the structures and procedures used for renaming destination operands during decode and result write time are explained.

7.1. Source operand renaming

Register renaming can be performed by using a mapping table. A CAM lookup can also be used to perform hybrid renaming. In addition, intrablock detection of dependent operands can be used. This section describes some of the implementation requirements for a CAM lookup, a mapping table. Also, the steps involved for renaming source operands and destination operands are described.

7.1.1. CAM lookup The CAM lookup is composed of a destination register

identifier, an instruction tag field, a physical register field, and a ready field. The name, number of bits required, type of cell, and description of each field for the CAM lookup is given in Table 5. The number of logical registers is repre- sented by L; the number of speculative instructions allowed is represented by S; and the number of physical registers is represented by R. For L ---- 32, S = 64, and R = 96, 5 bits are required for the destination register, 6 bits are required for the instruction tag, and 7 bits are required for the physical register number.

7.1.2. Mapping table The mapping table is composed of an instruction tag field,

a physical register field, and a ready field. The name, num- ber of bits required, and description of each field for the mapping table is given in Table 6. These fields are the same as the CAM lookup, except the logical register indexes into the table instead of being stored as CAM.

Table 6 Mapping table fields

Name Bits Description

Itag log2 S Unique instruction tag PReg log2 R Physical register number Ready 1 Indicates if PReg field is valid (result has completed)

7.1.3. Renaming procedure The steps performed in renaming a source operand are

summarized as follows (refer to Fig. 3):

1. Intrablock detection: Pre-decode bits are used to indicate which instruction within a block it is dependent, if any. The instruction tag of the instruction it is dependent upon is returned to the instruction window.

2. CAM lookup: Remaining operands from intrablock detection are searched for the most recent matching des- tination register identifier using content addressable memory. If the register has its value ready, then the physical register is used for the operand. Otherwise, the instruction tag is used for the operand.

3. Mapping table: Remaining operands from CAM lookup are indexed into the mapping table. If the register has its value ready, the physical register is returned. Otherwise, the instruction tag is returned to the instruction window.

The intrablock detection is optional. This reduces the number of required ports for the CAM lookup. In addition, the CAM lookup is not required for renaming operands, but the mapping table can be used by itself.

Instead of accessing the CAM lookup and then the map- ping table in series, they may be accessed in parallel. This requires 2N ports for the mapping table' s RAM cells, which would satisfy the worst case possible. A serial access, on the other hand, allows the number of ports for the mapping table's RAM cells to be significantly reduced. However, accessing the CAM lookup and mapping table sequentially may not be possible in one cycle. Most likely another cycle will be required to read the mapping table for the remaining operands. The instruction window can tolerate an additional cycle delay in receiving the remaining renamed operands with no observable impact on performance. This is possible because the majority of instructions are waiting on its coun- terpart operand to become ready before being issued.

7.2. Destination operand renaming

This section describes the procedure used in renaming a destination operand. The steps for dynamic result renaming are also listed. The structure used for the recovery list and free list are detailed.

7.2.1. Recovery list When a destination register is renamed and entered into

the mapping table, the old mapping is recorded into the recovery list. Table 7 lists the name, number of bits, and description for its three fields: old physical register, next instruction tag, and completed bit.

7.2.2. Operand events The sequence of events a destination operand experiences

from decode time until it commits its result are summarized

Page 9: A scalable register file architecture for superscalar processors

Table 7 Recovery list fields

Name Bits Description

Old PReg log2 S

Next Itag log2 R

Completed 1

S. Wallace, N. Bagherzadeh/Microprocessors and Microsystems 22 (1998)49-60

as follows:

Physical register number of previous instance of the same logical register Instruction tag of the next instance of the same logical register Indicates if instruction has completed

1. Rename the destination logical register identifier to a unique instruction tag. Instruction tags are numbered sequentially.

2. Enter instruction tag and destination register identifier into CAM lookup.

3. Destination register exits CAM lookup and is entered into the mapping table. Its logical register identifier is used to index into the mapping table. The entry is read and used in the next two steps. The new instruction tag, physical register number, and ready bit are written into that entry.

4. The old physical register from the mapping table entry is written into the recovery list.

5. The old instruction register identifier is used to index the recovery list and write the new instruction tag into the Next ltag field.

6. A physical register is allocated when the result is ready. The physical register is written to the appropriate entry in either the CAM lookup or the mapping table (explained in the following section).

7. When the register commits its result, the old physical register number is read from the recovery entry and then freed.

7.2.3. Dynamic result renaming In order to perform dynamic result renaming to multiple

banks of a register file, a free list must be maintained for each bank. Only a single bit is required to indicate if a register is allocated. As a result, the storage for this is not complicated. For example, a RF with 96 registers using 8 banks would require 12 bits per free list of each bank. When results are committed, the old physical registers are read from the recovery list. A mask can be generated for each bank according to the registers to be freed. Then the old register mask for each bank is ORed with the corresponding free list.

The steps involved in renaming a destination register dur- ing result write time are summarized as follows:

1. Two cycles before the result is ready, registers are allo- cated for all banks by searching the appropriate free list.

2. If the allocation fails for the bank which the result is writing to, writing of the result must stall. Otherwise, the result writes to the assigned bank as scheduled.

3. The instruction tag of the result is used to determine if

57

the destination operand has moved from the CAM lookup to the mapping table.

4. The allocated physical register is written into the appro- priate entry in either the CAM lookup or the mapping table, depending on that determination. The correspond- ing ready bit is also set.

5. If the destination operand has been written to the map- ping table, an entry in the mapping table will exist. The instruction tag of the result is used to read the appropriate entry in the recovery list and mark the result as com- pleted. This entry provides the next instruction tag for that register.

6. The allocated physical register number is written to the recovery list entry pointed to by the next instruction tag, provided that tag is valid.

8. Performance

This section compares the performance of dynamic result renaming with renaming during decoding. The performance is first compared using the instructions per cycle (IPC) metric and then using the billions of instructions per second (BIPS) metric.

To begin with, the overall performance is compared using the IPC metric. The number of speculative physical registers is varied. Figs. 6 and 7 show the IPC for a 4-way and 8-way issue processor, respectively. The map on decode (referred to as the base case) does not need to vary the speculative physical registers because the recovery list is constant at 32 (or 64) entries and requires a fixed amount. On the other hand, result renaming can be affected by more or less reg- isters than this amount.

The performance difference from the base case is negli- gible, except for SPECint95 on the SPARC architecture, where a 5% decrease is observed. Increasing the number of physical registers reduced this gap for the 8-way issue

3.0

2.5

2.0 ~ 9 2.01

. . . . 631 2.59 2.63 2.63 2 A A •

2.63 2.63 1 2.20 t

_2.11 < 2.111 m m

2.10 2.11

1.57 1.49 =1.53 =1.55 ~ 1 .~5 1.5Z

1'5 i " " ~.55 ~ ~ b B = ~

I I I

1.0 8 16 24 32 40 Speculative physical registers Fig. 6. Register file performance comparison for a 4-way superscalar.

48

Page 10: A scalable register file architecture for superscalar processors

58

4.0

S. Wallace, N. Bagherzadeh/Microprocessors and Microsystems 22 (1998) 49-60

3.5

3.01

2 .6 '

2.0

1.5

,,•.66 3.70

2.54 2.59 2.58 _ --_ ~_

1.73

3.70 I A . , w

3.70 3.70 3.70

2.70 2 .65 < i

2.60

1.78 A1.88 1.91 v 1.90

2.63

1 I I I I I

16 32 48 64

Speculative physical registers

1.89 1.90 A

i 80 96

Fig. 7. Register file performance comparison for an 8-way superscalar.

processor but did not help much for the 4-way issue processor.

The reason why a slight performance decrease is observed in the SPARC architecture and not in the SDSP architecture is the difference in the number of logical integer registers. The ratio of logical to speculative registers in the SDSP is 1:1 while the ratio is 4.25 for the SPARC (136 integer registers) for a 4-way issue processor. This ratio is cut in half when the decode width and recovery list are doubled. When there is a large ratio, speculative registers have a difficult time competing against logical registers in a bank. Logical registers can pool into one particular bank,

thereby restricting its usage. Registers will then tend to be allocated from a bank with most of the free registers, and functional units will stall since writing becomes limited. This performance problem can be avoided by reducing the ratio and/or increasing the number of banks.

With none or a small performance decrease, the number of speculative registers can be reduced by 25-50%. As a result, the total and speculative utilization of the register file increases. Sometimes the performance can actually increase by decreasing the number of registers. This is not due to the renaming, but from a slight reduction in the branch mispre- diction penalty.

In Section 2.3, it was pointed out that the complexity of the register file increases the cycle time of the register file. As a result, the cycle time of the processor may have to increase. By assuming the cycle time of the processor is equal to the cycle time of the register file, the performance can be more accurately compared. The cycle time of the register file was estimated using the timing model of Wilton and Jouppi [9]. The technology and implementation para- meters are identical to those used in their report, except the results are factored for a 0.5/zm CMOS technology instead of a 0.8 #m CMOS technology. The register file is similar to a direct-mapped cache without a tag. The number of ports in the register file was taken into consideration by appropri- ately increasing the capacitance of the word and bit lines.

Figs. 8 and 9 show the billions of instructions per second (BIPS) and the cycle time for a 4-way and 8-way processor, respectively. The BIPS metric reflects the cycle time, since it is the product of the IPC and the cycle time. As with Figs. 6

BIPS

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

cycle time (ns)

r 4.4

J 4.2

3.8

3.6

3.4

3.2

2.8

== 2.6 16 24 32 40 48

Spe,=ulatlve Physi¢,ll Reglaters

Fig. 8. BIPS and cycle time performance comparison for a 4-way superscalar.

-+-SDSP Int BIPS

--*- SDSP base Int BIPS S PARC Int BIPS SPARC base Int BIPS

~SPARC FP BIPS

-..,t- SPARC base FP BIPS

I S D S P cycle time

BSDSP bese cycle time

r-"l SPAFIC cycle time SPARC base cycle time

Page 11: A scalable register file architecture for superscalar processors

S. Wallace. N. Bagherzadeh/Microprocessors and Microsystems 22 (1998) 49-60

BIPS 1.4 ~ -

cycle time (nil)

4.4

1.2

0 . 8

0,6

0.4 ~ -- --] i

-! I I 0.2 I o II , I . . . .

• 4.2

2.8

2,6 8 16 24 32 40 48

Speculative Physical Registers

Fig. 9. BIPS and cycle time performance comparison for an 8-way superscalar.

--'+- SDS P Int BIPS SDSP base Int BIPS $PARC Int BIPS

--a.- SPARC base Int BIPS $PARC FP BIPS SPARC base FP BIPS 1 SDSP cycle time

I S D S P b a s e I cycle time

r-"-t SPARC cycle time i

~ S P A R C base

59

and 7, the number of speculative registers is varied and the performance is compared to the base case. The performance of a scalable register file is significantly larger than the performance of the base case for both the SDSP and SPARC. The increase in BIPS is approximately 25%. This can be entirely attributed to the reduction in the cycle time of the register file. In fact, by reducing the number of spec- ulative registers, the performance continues to increase because the slight loss in IPC does not compare to the improvement in cycle time. The reduction in cycle time for the SDSP is greater than the SPARC because the ratio of logical to speculative registers is much higher for SPARC. As a result, the percentage reduction of registers is much less for the SPARC file.

considerations to avoid, but simulation results show the proposed technique did not have any significant perfor- mance drawbacks compared to mapping during instruction decoding.

Another benefit of using a scalable register file architec- ture is that the cycle time can be close to that of a scalar register file. Consequently, a tremendous increase in instructions executed per second is observed when the cycle time of the register file is taken into account. In addi- tion, the performance can continue to increase by decreasing the number of physical registers.

9. Conclus ion

The MIPS R1OOOO uses a powerful register renaming technique. This technique renames registers from a logical register to a physical register when instructions are decoded. This work introduced a new technique which renames reg- isters during result writing instead of instruction decoding. As a result, multiple scalar register files can be used. The cost of the register file now scales with the number of instructions issued per cycle. Furthermore, the register file utilization can increase by reducing the number of physical registers. On the other hand, renaming at result write time creates a problem with reading registers and deadlock

References

[1] M. Johnson, Superscalar Microprocessor Design, Prentice-Hall, Englewood Cliffs, 1991.

[2] K.C. Yeager, MIPS RI0000 superscalar microprocessor, IEEE Micro (April 1996) 28-40.

[3] S.P. Song, M. Denman, J. Chang, The Power PC 604 RISC micro- processor, IEEE Micro (October 1994) 8-17.

[4] D.B. Papworth, Tuning the pentium pro microarchitecture, IEEE Micro (April 1996) 8-15.

[5] A. Capitanio, N. Dutt, A. Nicolau, Partitioned register files for VLIWs: a preliminary analysis of tradeoffs, in 25th Annual Inter- national Symposium on Microarchitecture, Portland, Oregon, December 1992, pp. 292-300.

[6] K.I. Farkas, N.P. Jouppi, P. Chow, Register file design considerations in dynamically scheduled processors, in Second International

Page 12: A scalable register file architecture for superscalar processors

60 S. Wallace, N. Bagherzadeh/Microprocessors and Microsystems 22 (1998) 49-60

Symposium on High-Performance Computer Architecture, February 1996, pp. 40-51.

[7] D.M. Tullsen, S.J. Eggers, J.S. Emer, H.M. Levy, J.L. Lo, R.L. Stamm, Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor, in 23rd Annual International Symposium on Computer Architecture, May 1996.

[8] J.E. Smith, A.R. Pleszkun, Implementing precise interrupts in pipe- lined processors, IEEE Transactions on Computers C-37 (May 1988) 562-573.

[9] S.J.E. Wilton, N.P. Jouppi, An enhanced access and cycle time model for on-chip caches, TR 93/5, Digital Equipment Corporation Western Research Lab, July 1994.

[10] B. Cmelik, D. Keppel, Shade: a fast instruction-set simulator for execution profiling, in ACM SIGMETRICS, 1994.

[11] S. Wallace, N. Bagherzadeh, Performance issues of a superscalar microprocessor. Microprocessors and Microsystems 19(4) (May 1995) 187-199.

[ 12] S. Wallace, N. Dagli, N. Bagherzadeh, Design and implementation of a 100 MHz centralized instruction window for a superscalar micro- processor, in 1995 International Conference on Computer Design, October 1995.

[13] P. Ahuja, D. Clark, A. Rogers, The performance impact of incomplete bypassing in processor pipelines, in 28th Annual International Sym- posium on Microarchitecture, November 1995.

Steven Wallace received a B.S. in information and computer science and a B.S. in electrical and computer engineering from University of California, lrvine. He also received an M.S. degree and a Ph.D. degree in electrical and computer engineering from the same university. Currently he is a post-doctoral researcher at University of California, San Diego. His research interests are in computer architecture and VLSI design.

Nader Bagherzadeh received a Ph.D. degree in computer engineering from the University of Texas at Austin, in 1987. From 1980 to 1984 he was with AT&T Bell Laboratories in Holmdel, New Jersey. Since 1988 he has been with the Department of Electrical and Computer Engineer- ing at University of California, Irvine. His research interests are parallel processing, reconfigurable computing, computer architecture and Vl~l design.