[ieee 2009 22nd international conference on vlsi design: concurrently with the 8th international...

6
Exploring the Limits of Port Reduction in Centralized Register Files Sandeep Sirsi and Aneesh Aggarwal Electrical and Computer Engineering Department Binghamton University, Binghamton, NY {[email protected]} Abstract Register file access falls on the critical path of a microprocessor because large heavily ported register files are used to exploit more parallelism. In this paper, we focus on reducing register file complexity by reducing the number of register file read ports. The goal of this paper is to explore the limits of read port reduction in a centralized integer register file i.e. how few read ports can be provided to a centralized integer register file, while still maintaining performance? A naïve port reduction may result in significant performance degradation and does not give a true measure of the limits, while clever techniques may be able to further reduce the number of ports. Hence, in this paper, we drastically reduce the number of ports and then investigate techniques to improve the performance of the reduced-ported register file. Our experiments show that the techniques allow further port reduction by improving the performance from reduced-ported RFs. For instance, with our experimental parameters, the naïve port reduction method requires at least five read ports to maintain a performance impact of less than 5%, whereas, our techniques require only three ports. Keywords: Complexity-effective design, Register file, Port reduction, Instructions per cycle 1 Introduction Modern microprocessors use heavily ported large register files (RFs) for exploiting instruction level parallelism. Since the register file lies in the critical path of dependent instruction execution, heavily ported large register files have significant clock cycle time and energy dissipation implications in microprocessor design [1, 3, 4]. In fact, a study [5] has shown that for current out-of-order superscalar processor designs such as MIPS R10000 [7] and Alpha 21264 [6], RF consumes the largest fraction of the total chip power consumption. One method to reduce the RF complexity is to have multiple RF banks. In this method, a single centralized RF is constructed from the multiple interleaved register banks [1, 13], or the banks are used in a decentralized fashion for clustered processors [6, 21]. Our work differs from multi- banked register files, as we consider a centralized RF. For a centralized RF, there are two possible design options to reduce RF complexity – reduce the number of registers in the RF and/or reduce the number of RF ports. Both the options can significantly reduce the amount of instruction level parallelism that can be exploited. However, the relative performance degradation between the two options depends on whether the registers or the RF ports are in demand. To be able to achieve good performance from the reduced complexity RFs, it is important to investigate techniques that increase the parallelism exploited when using them. Several techniques [8, 9, 10] are proposed to improve the parallelism from an RF with fewer registers. Unfortunately, there has been only one other effort to investigate the performance of a centralized reduced-ported RF [12]. The goal of this paper is explore the limits of RF read port reduction (i.e. the lowest number of RF read ports that can be provided) in a centralized integer register file, while still maintaining good performance. A brute-force method of port reduction may result in significant performance degradation, thus restricting the extent of port reduction. Clever port reduction techniques, on the other hand, may be able to further reduce the number of ports. Hence, in this paper, we drastically reduce the number of read ports and then propose and investigate techniques to efficiently utilize the reduced read ports provided in a centralized RF. In this paper, we focus only on the integer register file. To alleviate the performance impact of a reduced-ported RF, our techniques (i) limit the drop in the number of operands read from the forwarding path, and (ii) increase the effective number of read ports in the reduced-ported RF. Our experiments show that our techniques significantly improve the parallelism that is exploited from reduced-ported RFs, thus allowing further reduction in the number of RF read ports. For instance, the brute force method incurs a performance impact of 25% for a two-ported RF, where our techniques incur a performance impact of only about 9% for the two-ported RF. 2 Register File Design 2.1 Brute-Force Reduced-Ported Register File (TraRP— RF) design Fewer RF read ports can be provided because not all instructions have both the register operands and many operands are read from the forwarding path. The scheduler can be modified so that only those instructions that can be serviced with the reduced number of ports are issued (for instance, using bypass hints as discussed in [12]). This implementation can have a significant impact on the already complex dynamic scheduler design [28]. We use an alternative approach that performs port arbitration on the issued instructions, and requires little modifications to the scheduler. For a full-ported RF, the issued instructions read their operands in the next cycle. However, with the basic implementation of reduced number of ports, the issued instructions arbitrate for ports, and only those instructions that acquire the required number of ports proceed to the next stage. Port arbitration logic (PAL) involves port requirement analysis followed by the actual port arbitration. The PAL ensures that only the required number of ports is requested and that a consumer instruction is not executed before the producer. The instructions that acquire the 2009 22nd International Conference on VLSI Design 1063-9667/09 $25.00 © 2009 IEEE DOI 10.1109/VLSI.Design.2009.29 535

Upload: aneesh

Post on 16-Mar-2017

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [IEEE 2009 22nd International Conference on VLSI Design: concurrently with the 8th International Conference on Embedded Systems - New Delhi, India (2009.01.5-2009.01.9)] 2009 22nd

Exploring the Limits of Port Reduction in Centralized Register Files Sandeep Sirsi and Aneesh Aggarwal

Electrical and Computer Engineering Department Binghamton University, Binghamton, NY

{[email protected]}

Abstract

Register file access falls on the critical path of a microprocessor because large heavily ported register files are used to exploit more parallelism. In this paper, we focus on reducing register file complexity by reducing the number of register file read ports. The goal of this paper is to explore the limits of read port reduction in a centralized integer register file i.e. how few read ports can be provided to a centralized integer register file, while still maintaining performance? A naïve port reduction may result in significant performance degradation and does not give a true measure of the limits, while clever techniques may be able to further reduce the number of ports. Hence, in this paper, we drastically reduce the number of ports and then investigate techniques to improve the performance of the reduced-ported register file. Our experiments show that the techniques allow further port reduction by improving the performance from reduced-ported RFs. For instance, with our experimental parameters, the naïve port reduction method requires at least five read ports to maintain a performance impact of less than 5%, whereas, our techniques require only three ports.

Keywords: Complexity-effective design, Register file, Port reduction, Instructions per cycle

1 Introduction

Modern microprocessors use heavily ported large register files (RFs) for exploiting instruction level parallelism. Since the register file lies in the critical path of dependent instruction execution, heavily ported large register files have significant clock cycle time and energy dissipation implications in microprocessor design [1, 3, 4]. In fact, a study [5] has shown that for current out-of-order superscalar processor designs such as MIPS R10000 [7] and Alpha 21264 [6], RF consumes the largest fraction of the total chip power consumption.

One method to reduce the RF complexity is to have multiple RF banks. In this method, a single centralized RF is constructed from the multiple interleaved register banks [1, 13], or the banks are used in a decentralized fashion for clustered processors [6, 21]. Our work differs from multi-banked register files, as we consider a centralized RF. For a centralized RF, there are two possible design options to reduce RF complexity – reduce the number of registers in the RF and/or reduce the number of RF ports. Both the options can significantly reduce the amount of instruction level parallelism that can be exploited. However, the relative performance degradation between the two options depends on whether the registers or the RF ports are in demand.

To be able to achieve good performance from the reduced complexity RFs, it is important to investigate techniques that

increase the parallelism exploited when using them. Several techniques [8, 9, 10] are proposed to improve the parallelism from an RF with fewer registers. Unfortunately, there has been only one other effort to investigate the performance of a centralized reduced-ported RF [12]. The goal of this paper is explore the limits of RF read port reduction (i.e. the lowest number of RF read ports that can be provided) in a centralized integer register file, while still maintaining good performance. A brute-force method of port reduction may result in significant performance degradation, thus restricting the extent of port reduction. Clever port reduction techniques, on the other hand, may be able to further reduce the number of ports. Hence, in this paper, we drastically reduce the number of read ports and then propose and investigate techniques to efficiently utilize the reduced read ports provided in a centralized RF. In this paper, we focus only on the integer register file.

To alleviate the performance impact of a reduced-ported RF, our techniques (i) limit the drop in the number of operands read from the forwarding path, and (ii) increase the effective number of read ports in the reduced-ported RF. Our experiments show that our techniques significantly improve the parallelism that is exploited from reduced-ported RFs, thus allowing further reduction in the number of RF read ports. For instance, the brute force method incurs a performance impact of 25% for a two-ported RF, where our techniques incur a performance impact of only about 9% for the two-ported RF.

2 Register File Design

2.1 Brute-Force Reduced-Ported Register File (TraRP—RF) design

Fewer RF read ports can be provided because not all instructions have both the register operands and many operands are read from the forwarding path. The scheduler can be modified so that only those instructions that can be serviced with the reduced number of ports are issued (for instance, using bypass hints as discussed in [12]). This implementation can have a significant impact on the already complex dynamic scheduler design [28]. We use an alternative approach that performs port arbitration on the issued instructions, and requires little modifications to the scheduler. For a full-ported RF, the issued instructions read their operands in the next cycle. However, with the basic implementation of reduced number of ports, the issued instructions arbitrate for ports, and only those instructions that acquire the required number of ports proceed to the next stage. Port arbitration logic (PAL) involves port requirement analysis followed by the actual port arbitration. The PAL ensures that only the required number of ports is requested and that a consumer instruction is not executed before the producer. The instructions that acquire the

2009 22nd International Conference on VLSI Design

1063-9667/09 $25.00 © 2009 IEEE

DOI 10.1109/VLSI.Design.2009.29

535

Page 2: [IEEE 2009 22nd International Conference on VLSI Design: concurrently with the 8th International Conference on Embedded Systems - New Delhi, India (2009.01.5-2009.01.9)] 2009 22nd

ports proceed to the following stages, whereas those that do not acquire the required ports stall and again arbitrate for the ports in the next cycle. We do not discuss detailed implementation of PAL to conserve space.

2.2 Limitations with the brute-force approach

In the basic implementation of reduced-ported RFs, at least two ports are required because instructions may read both their operands from the RF. In this implementation, the stalled instructions result in more register file reads. For instance, consider an instruction that requires one operand from the RF whereas the other operand is available from the forwarding logic. By the time the instruction acquires an RF read port, the other operand’s value is written into the RF and is no longer available from the forwarding path. The stalled instructions also reduce the effective issue width of the processor (the issue slots where the instructions are stalled are frozen), delaying the issue of dependent instructions.

The brute-force approach also results in wastage of read ports, which are a precious commodity in reduced-ported RFs. Wastage of read ports results because an instruction cannot proceed till it has all the required ports. For instance, if there is just one free read port available and all the instructions arbitrating for the read ports require two operands from the RF, then the read port is wasted because it cannot be allocated to any of the instructions.

2.3 Split-instruction Reduced-Ported Register File (SiRP—RF) Design

As discussed in the previous section, operands of the stalled instructions can be dropped from the forwarding path by the time these instructions acquire the required RF read ports. These dropped operands are then read from the register file increasing the RF read port requirements. In this technique, we propose splitting the issued instructions between the operands whose values are available and those whose values have not yet been read from the RF. Hence, for issued instructions, the register operands for which RF ports are allocated and those that are available from the forwarding path are forwarded to the next pipe stage. In this approach, it may happen that only one register operand (out of the two) for an instruction is forwarded, splitting the instruction among the two pipeline stages. The register operands that are forwarded to the next stage obtain the values using the ports or off the forwarding path in the next cycle, irrespective of whether the instruction is split or not, and move to the following stage. The register operand that is left behind again arbitrates for a port in the next cycle. In the following cycles, the remaining operands of a “split” instruction are given priority for RF ports. When all the operands of an instruction are available, the instruction executes on the corresponding functional unit. This technique alleviates the limitations of the TraRP—RF approach.

Figure 1 illustrates an example of this RF design. In Figure 1, instruction X requires both the operands from the RF and instruction Y requires one operand from the RF and the other from the forwarding path. The RF is provided with a single read port. In the next cycle, X acquires the one port and hence op1 of X and op2 (obtained from the forwarding path) of Y proceed to the allocated port latches. In the next cycle, these

two operands get their values, which are written into the corresponding ready-to-execute latches, and X again acquires the one read port and its op2 also moves forward. In addition, another issued instruction Z takes the place of X. In the next cycle, both Z and Y again arbitrate for the ports.

Figure 1: Example Illustrating the Working of the SiRP—

RF design

2.4 Split Ported Reduced-Ported Register File (SpoRP—RF) Design

This technique is proposed to increase the effective number of ports in a reduced-ported RF and is implemented on top of the SiRP—RF design. This design exploits the observation that many of the operands used by instructions, and read from the RF, are narrow in size (require fewer bits for representation) [14]. Our experiments showed that almost about 50% of the operands read from the RF are narrow. A value is defined as narrow if it has 16 or more (out of a maximum of 32) leading bits as zeros. In this design, we partition each available RF read port into two portions. The portion that can read the upper bits of a register is called the upper port and the other portion is called the lower port. For a 32-bit register, the upper port reads the upper 16 bits and the lower port reads the lower 16 bits. Of course, this technique requires one decoder per narrow port. In this design, the instructions specifically arbitrate for ½ or 1 or 1½ or 2 ports, depending on the size of the operands. The sizes of the values written into the RF are recorded using a size bit for each register. Note that it is not required to determine the size of the operands read from the forwarding path.

This RF design will still result in the lower ports being utilized more often and the upper ports being idle. For instance, if two issued instructions both want to read a narrow operand from the RF, they both will want to acquire the lower port. Hence we replicate a narrow value in the register so that a narrow value can be read by either one of the upper and the lower ports. For instance, a value 0x000067f3 will be stored in the register as 0x67f367f3. The replication operation is performed in the dummy memory pipeline stage for non-memory instructions and in parallel to the cache output multiplexers for memory instructions.

In this design, upper or lower portions of ports may still be wasted. For instance, for a single-ported RF, if the lower port is already acquired by an instruction and another instruction wants to read a normal-sized operand, then that instruction can acquire both the lower and the upper ports only in the next

536

Page 3: [IEEE 2009 22nd International Conference on VLSI Design: concurrently with the 8th International Conference on Embedded Systems - New Delhi, India (2009.01.5-2009.01.9)] 2009 22nd

cycle. So, in this cycle, the upper port is wasted. To resolve this issue, we further split the operands between the several pipeline stages.

Figure 2(a) shows the schematic organization of a conventional register file (RF) with four read ports dedicated to the two functional units (FUs). Figure 2(b) shows the TraRP—RF organization with two ports for the two FUs. In this case, the output of the port is supplied to both the operands of the two FUs. Figure 2(c) shows the SpoRP—RF organization with one partitioned read port for the two FUs. In SpoRP—RF, the operand values read from the register file can either be “0L” or “0U” or “UL”, where “0” specifies 16 zero bits, “L” specifies the 16 bits read using the lower port and “U” specifies the 16 bits read using the upper port. Note that this multiplexer implementation will have minimal affect on latency because the control signals for the multiplexers are set in parallel to reading the registers.

Figure 2: (a) Conventional RF with a dedicated port for each operand; (b) TraRP—RF with two ports; and (c)

SpoRP—RF with one partitioned port

2.5 Arbitration of Ports

The performance impact of reduced read ports also depends on the arbitration policy used for allocating the ports. We experimented with four different arbitration policies for read port allocation. Latch position based is the simplest of the allocating policies, where priority is given to an instruction in the topmost pipeline latch, and then to the instruction in the following latch and so on. Load preference based policy is still latch position based, however, higher priority is given to load instructions, followed by ALU instructions and then by store instructions. In age based policy, an older instruction (in terms of its position in the program order) is given a higher priority during allocation of ports. The implementation of this policy will be more complex than the latch position based. We observed that the port-requirement based policy, in which the PAL gives the highest priority to the instruction requiring the fewest ports, is the best performing one. A tie is broken by using a latch position based priority.

3 Experimental Results

3.1 Experimental Setup

Table 1 gives experimental parameters for our baseline superscalar processor without a reduction in RF ports. For this study, we experiment with a relatively wider machine because the port requirements are anyways low for a narrow machine. In this paper, the schemes are only applied to the Integer RF

accessed by the Integer FUs and the write ports are kept intact. To measure IPCs, we modify the SimpleScalar simulator [18], simulating a 32-bit PISA architecture. For benchmarks, we use a collection of 13 SPEC2000 integer and floating point benchmarks. The statistics are collected for 200M instructions after skipping the first 1B instructions. Our base pipeline has 8 frontend stages. An additional stage is introduced after issue for the PAL. We assume a single pipe stage for register access and two pipe stages for data memory access.

Parameter Value Parameter Value

Fetch/Commit Width

8 instructions Floating-point FUs

3 ALU, 1 Mul/Div

Unified Register File

128 Int/128 FP

Integer FUs 4 ALU, 2 AGUs, 1 Mul/Div

Integer RF 10 read / 5 write ports

FP RF 6 read / 3 write ports

Issue Width 5 Int/ 3 FP Issue Queue Size

96 Int/64 FP

Branch Predictor

4K Gshare BTB Size 4K entries, 2-way

L1 Icache 32K direct mapped 2 cycle lat

L1 Dcache 32K, 4-way, 2

cycle lat

Memory Latency

100 cycles first word 2 cycle inter-

word

L2 cache Unified 512K, 8-way, 10 cycle lat.

ROB Size 256 instructions

Load/store buffer

64 entries

Table 1: Default Experimental Parameters for the Base Superscalar Processor

3.2 Experimental Result

Our experiments showed that on an average more than 50% of the integer operands are obtained from the forwarding path and about 40-50% operands read from the RF are narrow in size. Interestingly, we observed that even though the percentage of narrow-sized register values produced was significantly higher than 20%, most of the narrow values produced are consumed off the forwarding path, resulting in only about 20% being read from the RF.

Next, we evaluate the performance of the different RFs in terms of instructions per cycle (IPC). We only present the results with the best performing port requirement based port allocation policy to conserve space. We observed an average of about 25% IPC drop for the two-ported TraRP—RF configuration with respect to the base superscalar with a 10-ported RF. Figure 3 compares the IPCs of single- and two-ported SiRP—RF and SpoRP—RF configurations with that of a two-ported TraRP—RF. Figure 3 shows that the parallelism exploited increases in SiRP—RF and SpoRP—RF. On an average, the two-ported SiRP—RF and SpoRP—RF

537

Page 4: [IEEE 2009 22nd International Conference on VLSI Design: concurrently with the 8th International Conference on Embedded Systems - New Delhi, India (2009.01.5-2009.01.9)] 2009 22nd

configurations perform about 5% and 10% better than TraRP—RF. However, the single-ported SiRP—RF and SpoRP—RF configurations perform, on an average, about 30% and 17% worse than TraRP—RF. The best performing configuration – two-ported SiRP—RF – still performs about 16% worse than the base 10-ported RF. This suggests that further techniques are required to reduce the performance impact, or more read ports have to be provided.

Figure 3: Normalized Performance of SiRP—RF and SpoRP—RF with respect to the TraRP--RF

4 Enhancements to Improve Performance

In the SpoRP—RF design, the provided ports are fully utilized, i.e. none of the lower or upper ports are idle if there are issued instructions that want to read a value from the RF. However, even the SpoRP—RF design still incurs delays in the issue of dependent instructions because of the waiting instructions. Delays in the issue of instructions reduce the effective issue width and further increase the RF read port requirements. Figure 4 compares the percentage of operands for integer instructions obtained from the forwarding path for one- and two-ported SpoRP—RF and two-ported TraRP—RF. Figure 4 shows that the percentage of operands obtained from the forwarding path reduces by about 15% for two-ported SpoRP—RF and by about 18% for two-ported TraRP—RF. The reduction for one-ported SpoRP—RF is about 20%. The enhancements proposed in this section further improve the parallelism exploited from the SpoRP—RF by (i) further reducing the port requirement by making more operands available from the forwarding path, and (ii) further increasing the effective number of ports available for the instructions. We discuss the results of all the enhancements in Section 4.4.

4.1 Latching the Forwarded Values

In a traditional back-end pipeline stages, a result value can be forwarded from several pipe stages. Similarly, different pipe stages may be forwarding different results. Hence, to determine the register whose value is present on a particular forwarding path, a buffer is typically used to store the register identifiers of the values on the forwarding path. As the register values pass through the backend pipeline stages, their register identifiers are written into the RID buffer. Once a value is available from a register, the valid bit for the register is set. From then onwards, all the instructions that require that value read it from the register.

Figure 4: Percentage of operands read from the forwarding path for the integer instructions

In the SpoRP—RF design, many of the functional units (FUs) may not be utilized every cycle because the instructions issued to those FUs are waiting for RF read ports to read their operands. Hence, no new values are produced at the outputs of these FUs. Also instructions such as branch instructions do not produce any result. We propose a technique where the values produced by FUs are latched onto the forwarding paths till new valid values are present. Note that the latched values are still driven on the forwarding path every cycle. With this enhancement, a value is available from the forwarding path if its register identifier is still present in the RID buffer. In this technique, even the delayed dependent instructions may be able to read their operands from the forwarding path, increasing the number of operands read from the forwarding path. For instance, this enhancement increased the operands obtained from forwarding path from about 38% to about 44% for two-ported SpoRP—RF.

4.2 Servicing Multiple Instructions with a Single Port

This technique also attempts to increase the effective number of read ports. With multiple instructions vying for RF read ports, it may happen that in the same cycle, multiple instructions require a value from the same register. In all our techniques so far, different RF read ports are used to read the same register for these instructions. So with limited read ports, one instruction may have to wait for a read port even though its operand value is read from the RF. We also experimented with a technique in which the value read from the RF using a single port (which could be the lower or the upper port or the entire port) is also passed on to all instructions that require that value. To implement this technique, in parallel to vying for the RF read ports, an issued instruction also compares its source operands’ register identifiers with all the other issued instructions. If an instruction (lets say X) does not acquire an RF read port, but the register identifier of its source operand (op1) matches with that of an instruction (Y) that has acquired a port, then op1 of instruction X also proceeds to the following stages. In this case, the value read from the RF is forwarded to both the instructions.

4.3 Additional RF read port for few registers

We also experiment with a technique that increases the overall effective number of RF read ports by providing just one additional read port to only a few registers. We observed in our experiments that about 63% of the normal-sized operands

538

Page 5: [IEEE 2009 22nd International Conference on VLSI Design: concurrently with the 8th International Conference on Embedded Systems - New Delhi, India (2009.01.5-2009.01.9)] 2009 22nd

from the register file are read for the load/store instructions, especially for address computation. This is because an address value cannot be narrow. Hence, we propose a technique where a few registers are provided with an additional unpartitioned read port. For a centralized RF, this technique can be implemented by providing the additional port for the bottom registers. For instance, for a 128-entry RF, the registers numbered 118 to 127 can be provided with the additional port. We call such registers as addport registers. The addport registers can also be read using the other (non-additional) ports. The addport registers are allocated to instructions whose results can be used for address computation in load and store instructions. For this technique, we provide an additional bit for each instruction in the instruction cache. This bit is dynamically set for the instructions whose result can be used in address computation. Such instructions are identified when they execute for the first time. Note that the output of this additional port is dedicated only to the AGUs used for address computation of load/store instructions.

In this technique, when an instruction that produces a result used in address computation of load/store instructions is renamed, it is allocated an addport register. If no such registers are free, then it is allocated any other register. In addition, the addport registers can also be allocated to other instructions (whose results are not used in load/store address computation) if the remaining registers are all allocated. However, only the load/store instructions that need to read the addport registers can vie for the additional port. We implemented this technique by providing the additional unpartitioned port for 16 registers (out of the 128 registers).

4.4 Results

Figure 5 presents the performance results for the SpoRP—RF organization when the techniques – latching the forwarded values (LFV), providing an additional read port for a few registers (ARP), and servicing multiple instructions with a single port (SMP) – are applied individually, and when all the enhancements are applied together (ALL). As compared to two-ported TraRP—RF, the ALL enhancements improve IPC (parallelism) by about 21% for two-ported SpoRP—RF (a maximum of more than 50% for bzip2) and about 10% for one-ported SpoRP—RF (a maximum of more than 30% for vortex). All the enhancements achieve reasonable performance improvement even when they are applied individually. Note that the ALL enhancements improve the IPC of one-ported SpoRP—RF from almost 20% less than to about 10% more than the two-ported TraRP—RF configuration. The two-ported SpoRP—RF organization with ALL enhancements still performs about 9 % less than full-ported RF, but down from 25% for the two-ported TraRP—RF.

Next, we investigate the performance of the SpoRP—RF configuration with ALL enhancements as the number of ports is increased. Figure 6 presents the normalized IPC of the one-, two- and three-ported SpoRP—RF configuration with ALL enhancements and three-ported TraRP—RF as compared to two-ported TraRP—RF. As can be seen, the two-ported SpoRP—RF configuration performs better than three-ported TraRP—RF. The three-ported SpoRP—RF configuration is about 6% better than two-ported SpoRP—RF, about 11%

better than three-ported TraRP—RF, and only about 4% less than the full-ported RF. The three-ported TraRP—RF is about 16% lower than the full-ported RF. To maintain a performance impact of less than 5%, we observed that the brute-force port reduction method requires at least five read ports.

Figure 5: Normalized Performance of one- and two-ported SpoRP—RF (with respect to two-ported TraRP—RF) with and without the enhancements; using the port requirement

based policy

Figure 6: Normalized IPC with respect to two-ported

TraRP—RF for varying number of ports

5 Related Work

Most of the low-complexity RF schemes propose banking or partitioning a register file or reducing the number of registers. The only other technique that has been proposed to reduce the number of ports a centralized RF also uses bypass hints to reduce the RF port requirements [12]. The bypass hint mechanism uses the wakeup tag search to determine bypassibility. This can significantly increase the complexity of the dynamic scheduler. In our schemes, we avoid modifications to the complex dynamic scheduler hardware.

539

Page 6: [IEEE 2009 22nd International Conference on VLSI Design: concurrently with the 8th International Conference on Embedded Systems - New Delhi, India (2009.01.5-2009.01.9)] 2009 22nd

We also propose many more schemes to further improve the parallelism with reduced RF ports.

Software controlled two-level RFs have been proposed in [15, 16, 17]. Cruz, et al [3] proposes a two-level hierarchical RF. They use a fully associative upper level RF, and run-time caching and prefetching to store the critical register values in the upper level RF, as was also proposed in [19]. The authors in [1] propose a two-level RF, which uses a Usage Table to record information for every physical register. The L1 RF is the conventional RF, to which a L2 RF is added. Registers are written to the L2 RF only when the number of physical registers falls below a pre-set threshold.

Partitioning or replication of monolithic register files has been proposed in the context of clustered processors [20, 21, 6, 22]. These organizations reduce the porting requirements per cluster while adding inter-cluster communication. Partitioned register files have also been proposed in [23] for a VLIW processor. Balasubramonian, et al [1] also propose reducing the ports per bank by modifying the scheduler.

Researchers have exploited the inefficiencies in register usage to reduce the number of registers in three major ways. One set of solutions delays the actual allocation of physical registers until the time that the result is written back [e.g. 24, 25]. The second set of solutions reduces the number of registers through the use of register sharing [e.g. 26]. The third set of techniques aim at reducing the register file pressure by using the early deallocation of physical registers [e.g. 8].

6 Summary and Conclusions

Heavily ported large register files are provided to achieve good performance for modern processors with large issue queue sizes and issue widths. Such a large register file has a significant impact on the processor clock cycle time and power consumption. The register file size can be reduced by reducing the number of read ports that are provided. However, a naïve reduction of the register file read ports results in significant performance degradation, thus limiting the extent of port reduction. In this paper, we propose and investigate various techniques to improve the parallelism that can be exploited from a reduced-ported RF, thus enabling further reduction in ports.

We propose a unique RF where the operands are split among various pipeline stages and the ports are partitioned so that a single port can read two narrow operands from different registers. These techniques reduce the number of operands dropped from the forwarding path and increase the effective number of available read ports. We proposed further enhancements to this technique, such as latching the values in the forwarding paths, forwarding a register value read from the register file to all the instructions requiring that value, and providing an additional read port for a few registers. With these techniques, the parallelism exploited from a reduced-ported RF is significantly increased. These techniques allow reducing the number of integer RF read ports from 10 to three, while maintaining an average performance impact of less than 5%.

References [1] R. Balasubramonian, et. al., Reducing the Complexity of the Register File in Dynamic Superscalar Processors, Proc. Micro-34, 2001. [2] K. Farkas, et. al., Register File Considerations in Dynamically Scheduled Processors, Proc. HPCA, 1996. [3] J.-L. Cruz, et. al., Multiple-Banked Register File Architectures, Proc. ISCA-27, 2000. [4] D. M. Tullsen et. al., Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Proc. ISCA, 1996. [5] V. Zyuban and P. Kogge, Optimization of High-Performance Superscalar Architectures for Energy Efficiency, Proc. ISLPED, 2000. [6] R. Kessler, The Alpha 21264 Microprocessor, IEEE Micro, March/April 1999. [7] K. C. Yeager, The MIPS R10000 Superscalar Microprocessor, IEEE Micro, April 1996. [8] Martinez, J., Renau, J., Huang, M., Prvulovich, M., Torrellas, J., "Cherry: Checkpointed Early Resource Recycling in Out-of-order Microprocessors", in Proc. of MICRO-35, 2002. [9] Monreal, et.al., “Delaying Register Allocation Through Virtual-Physical Registers”, in Proc. of MICRO-32, 1999. [10] Ergin O., et.al., “Register Packing: Exploiting Narrow-Width Operands for Reducing Register File Pressure”, in Proc. Of .MICRO, 2004. [11] P. Shivakumar and N. Jouppi, CACTI 3.0: An Integrated Cache timing, Power, and Area Model, Technical Report, DEC Western Lab, 2002. [12] Park, I., Powell, M., Vijaykumar, T., "Reducing Register Ports for Higher Speed and Lower Energy", in Proc. of MICRO-35, 2002. [13] Tseng, J., Asanovic, K., "Banked Multiported Register Files for High Frequency Superscalar Microprocessors", in Proc. of ISCA-30, 2003. [14] D. Brooks and M. Martonosi, Dynamically Exploiting Narrow Width Operands to Improve Processor Power and Performance, Proc. HPCA, 1999. [15] J. Zalamea, et. al.,Two-Level Hierarchical Register File Organization for VLIW Processors, Proc. Micro-33, 2000. [16] R. Russell, The Cray-1 Computer System, Readings in Computer Architecture, 2000. [17] J. Swensen and Y. Patt, Hierarchical Registers for Scientific Computers, Proc. ICS, 1988. [18] D. Burger and T. M. Austin, The SimpleScalar Tool Set, Version 2.0, Computer Arch. News, June 1997. [19] R. Yung and N. Wilhelm, Caching Processor General Regsiters, Proc. ICCD, 1995. [20] A. Capitanio, et. al., Partitioned Register Files for VLIWs: A Preliminary Analysis of Trade-offs, Proc. Micro-25, 1992. [21] K. Farkas, et. al., The Multicluster Architecture: Reducing Cycle Time Through Partitioning, Proc. ISCA-24, 1997. [22] S. Rixner, et. al., Register Organization for Media Processing, Proc. HPCA, 2000. [23] J. Janssen and H. Corporaal, Partitioned Register File for TTAs, Proc. Micro-28, 1995. [24] Gonzalez, A., Gonzalez, J., Valero, M., “Virtual-Physical Registers”, in Proc. of HPCA-4, 1998. [25] Wallase, S., Bagherzadeh, N., "A Scalable Register File Architecture for Dynamically Scheduled Processors", in Proc. PACT-5, 1996. [26] Srinivasan S., et.al., “Continual Flow Pipelines”, in Proc. of ASPLOS 2004. [27] G. Loh, “Exploiting data-width locality to increase superscalar execution bandwidth,” Proc. Micro-35, 2002. [28] S. Palacharla, et al., “Complexity-Effective Superscalar Processors,” Proc. ISCA, 1997.

540