[ieee ieee eurocon 2013 - zagreb, croatia (2013.07.1-2013.07.4)] eurocon 2013 - streamlining the...

8
Streamlining the Continual Flow Processor Architecture with Fast Replay Loop Komal Jothi and Haitham Akkary Department of Electrical and Computer Engineering American University of Beirut Beirut, Lebanon {kmj04, ha95}@aub.edu.lb Abstract—We present a streamlined continual flow processor architecture for scheduling instructions behind loads that miss the data cache. Instructions that do not encounter cache misses execute quickly, releasing their allocated hardware resources for other instructions. Instructions that depend on data cache misses wait in their reservation stations for the data, as long as the reservation station resources are not full. If the reservation stations become full blocking the pipeline, instructions dependent on cache misses give back their reservation stations and move directly into a large single-ported SRAM waiting buffer without having to go through pseudo execution and commit in the reorder buffer, as required by previous continual flow architectures. Afterwards, when the miss data cache block is fetched, these instructions are replayed from the waiting buffer, i.e., re-inserted again into the reservation stations to be scheduled for execution. Shortening the replay loop by removing the reorder buffer and the pseudo execute and commit from the replay path, improves performance on benchmarks with large number of loads that miss the L1 but hit the on-chip L2 data cache. Performance measurements using the SimpleScalar microarchitecture simulator and Spec 2006 benchmarks show that our streamlined continual flow pipeline architecture outperforms conventional continual flow pipeline architecture by 16% on average. Keywords: superscalar processors, instruction level parallelism, continual flow pipelines, latency tolerant processors, virtual register renaming I. INTRODUCTION Microprocessor architects are facing a daunting power constraint challenge as they try to balance the need for integrating multiple cores on future processors for high throughput performance, while providing energy efficient single-thread performance for applications that have little thread-level parallelism. To improve superscalar processor core performance on difficult to parallelize applications, architects have been increasing the capacity of reorder buffers, reservation stations (RS), physical register files, and load and store queues [1] with every new out-of-order processor core. Larger instruction buffers give two performance benefits: 1) they allow hardware to dynamically identify and schedule independent instructions concurrently, thus taking advantage of the wide pipeline and parallel functional units, and 2) they increase the instruction and memory level parallelism available to the scheduling hardware, thus minimizing the impact on performance of stalls from multi-cycle instructions, such as loads that miss the data cache. Increasing buffer sizes to achieve single-thread performance comes at significant cost in power, area and circuit complexity. In fact, this approach has reached a point where it hardly provides any benefit. Current buffer sizes are more than sufficient for code that hits the L1 data cache, and way too small for code that misses the L1 data cache. Load latency to the last level cache on current processors is more than 20 clock cycles, and latency of load miss to DRAM, even with on-chip DRAM controller, exceeds hundred cycles. It is not practical to increase buffer sizes to the capacity necessary to handle long load latencies to the last level cache or DRAM. A different design strategy is to size the instruction buffers to the minimum capacity necessary to handle the common case of L1 data cache hit and to use new scalable out-of-order execution algorithms to handle code that misses the L1 data cache. Over the past decade, there have been various studies of scalable large instruction window architectures that target reducing the impact of data cache misses on performance, without having to increase instruction buffers and physical register files sizes [2]-[7]. These architectures have common characteristics, but vary in implementation details. They all break away from conventional reorder buffer mechanisms for managing speculative out-of-order execution and handling exceptions and branch mispredictions. Instead of using a reorder buffer for tracking, buffering and sequentially committing one by one the current set of in-flight instructions, also called the instruction window, these architectures use bulk commit of execution results, and register state checkpoints [8]-[12] to recover from branch mispredictions EuroCon 2013 • 1-4 July 2013 • Zagreb, Croatia 1821 978-1-4673-2232-4/13/$31.00 ©2013 IEEE

Upload: haitham

Post on 23-Dec-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Streamlining the Continual Flow Processor Architecture with Fast Replay Loop

Komal Jothi and Haitham Akkary

Department of Electrical and Computer Engineering American University of Beirut

Beirut, Lebanon {kmj04, ha95}@aub.edu.lb

Abstract—We present a streamlined continual flow processor

architecture for scheduling instructions behind loads that miss the data cache. Instructions that do not encounter cache misses execute quickly, releasing their allocated hardware resources for other instructions. Instructions that depend on data cache misses wait in their reservation stations for the data, as long as the reservation station resources are not full. If the reservation stations become full blocking the pipeline, instructions dependent on cache misses give back their reservation stations and move directly into a large single-ported SRAM waiting buffer without having to go through pseudo execution and commit in the reorder buffer, as required by previous continual flow architectures. Afterwards, when the miss data cache block is fetched, these instructions are replayed from the waiting buffer, i.e., re-inserted again into the reservation stations to be scheduled for execution. Shortening the replay loop by removing the reorder buffer and the pseudo execute and commit from the replay path, improves performance on benchmarks with large number of loads that miss the L1 but hit the on-chip L2 data cache. Performance measurements using the SimpleScalar microarchitecture simulator and Spec 2006 benchmarks show that our streamlined continual flow pipeline architecture outperforms conventional continual flow pipeline architecture by 16% on average. Keywords: superscalar processors, instruction level parallelism, continual flow pipelines, latency tolerant processors, virtual register renaming

I. INTRODUCTION

Microprocessor architects are facing a daunting power constraint challenge as they try to balance the need for integrating multiple cores on future processors for high throughput performance, while providing energy efficient single-thread performance for applications that have little thread-level parallelism.

To improve superscalar processor core performance on difficult to parallelize applications, architects have been increasing the capacity of reorder buffers, reservation stations (RS), physical register files, and load and store queues [1] with every new out-of-order processor core. Larger instruction

buffers give two performance benefits: 1) they allow hardware to dynamically identify and schedule independent instructions concurrently, thus taking advantage of the wide pipeline and parallel functional units, and 2) they increase the instruction and memory level parallelism available to the scheduling hardware, thus minimizing the impact on performance of stalls from multi-cycle instructions, such as loads that miss the data cache.

Increasing buffer sizes to achieve single-thread performance comes at significant cost in power, area and circuit complexity. In fact, this approach has reached a point where it hardly provides any benefit. Current buffer sizes are more than sufficient for code that hits the L1 data cache, and way too small for code that misses the L1 data cache. Load latency to the last level cache on current processors is more than 20 clock cycles, and latency of load miss to DRAM, even with on-chip DRAM controller, exceeds hundred cycles. It is not practical to increase buffer sizes to the capacity necessary to handle long load latencies to the last level cache or DRAM.

A different design strategy is to size the instruction buffers to the minimum capacity necessary to handle the common case of L1 data cache hit and to use new scalable out-of-order execution algorithms to handle code that misses the L1 data cache.

Over the past decade, there have been various studies of scalable large instruction window architectures that target reducing the impact of data cache misses on performance, without having to increase instruction buffers and physical register files sizes [2]-[7]. These architectures have common characteristics, but vary in implementation details. They all break away from conventional reorder buffer mechanisms for managing speculative out-of-order execution and handling exceptions and branch mispredictions. Instead of using a reorder buffer for tracking, buffering and sequentially committing one by one the current set of in-flight instructions, also called the instruction window, these architectures use bulk commit of execution results, and register state checkpoints [8]-[12] to recover from branch mispredictions

EuroCon 2013 • 1-4 July 2013 • Zagreb, Croatia

1821978-1-4673-2232-4/13/$31.00 ©2013 IEEE

and exceptions. Instead of residing completely in the hardware instruction buffers, the instruction window becomes virtual and composed physically of a partial, discontiguous subset of the conventional and sequential instruction window. As such, instructions that do not encounter cache misses, called miss-independent instructions, enter the instruction window, execute and leave quickly, freeing all their pipeline hardware resources. On the other hand, instructions that depend on data cache misses, called miss-dependent instructions, enter a waiting buffer outside the execution pipeline where they can wait as long as necessary for their input operands without blocking the execution pipeline. Afterwards, when the miss data cache block is fetched, the miss-dependent instructions are replayed from the waiting buffer, i.e. renamed and re-inserted again into the reservation stations to be scheduled for execution. Mechanisms used to implement with lesser hardware such large virtual instruction windows include, for example, virtual reorder buffers [13], and miss-dependent slice data buffers [7].

A. Paper Contribution

Previous scalable large instruction window superscalar architectures either target L2 data cache misses that go to DRAM [2], [3], [7] or target, for better performance, the more frequent L1 data cache misses [5], [14], [15] but require the reorder buffer in the replay loop to reorder miss-dependent instructions before they are renamed again and re-inserted in the execution pipeline.

This work targets L1 data cache misses and its key contribution over prior work is to evaluate the impact of replay loop optimizations on performance when processing instructions that depend on loads that miss the L1 data cache, including those that miss all the way to DRAM.

Our simulated architecture is based on a simultaneous continual flow pipeline (simultaneous CFP) [5]. Like simultaneous CFP, it uses a set of reservation stations (RS) sized appropriately to handle the common case of instructions that hit the L1 data cache, augmented with a single-ported SRAM waiting buffer for instructions that miss the L1 data cache. However, unlike simultaneous CFP, it uses an order list of instructions in the reservation stations to order miss-dependent instructions when they have to move into the waiting buffer. The architecture improves performance over previously studied simultaneous CFP superscalars due to five advantages:

• It keeps miss-dependent instructions in the reservation stations as long as possible after the miss before they are evicted to the waiting buffer. This reduces the number of miss-dependent instructions that are replayed in case of medium latency load misses, which are those loads that miss the L1 data cache but hit the L2 data cache.

• Removing the reorder buffer from the replay loop reduces the replay latency of miss-dependent instructions that are evicted to the waiting buffer and thus reduces the total execution time.

• On CFP architectures, mispredicted branches that depend on load misses have very high misprediction

penalty. The reason is that CFP recovers from these mispredictions by rolling back execution to a checkpoint taken at the load miss. Reducing the replay loop delay minimizes the execution look-ahead window after load misses, and thus reduces the chance of encountering miss-dependent branch mispredictions and their high recovery penalty. This improves performance significantly on benchmarks that have many branches that depend on load misses.

• On CFP architectures, all miss-dependent instructions have to be moved into the waiting buffer and then replayed, once the load miss is moved into the waiting buffer. This is necessary since when the miss load pseudo commits and moves into the waiting buffer, it releases its renamed destination register ID (also called tag). This breaks the dependence links between the miss load and its dependents, requiring the full dependent thread to be replayed and renamed again to re-establish the dependence links. The architecture this paper evaluates uses virtual register renaming [16], which allows partial replay of the miss load and its dependents, thus significantly reducing the number of replayed instructions and the total execution time.

• Branch instructions that move into the waiting buffer and later found to have been mispredicted lead to severe performance degradation because of the large runahead instructions following the branch causing excessive replay and checkpoint rollback. This paper presents a hardware predictor that is used as a branch confidence mechanism to identify miss-dependent branches that are likely to mispredict, and stalls the pipeline when such low confidence branches are moved into the waiting buffer. This prediction mechanism reduces the checkpoint rollback risk and improves performance on benchmarks that have many mispredicted branches that depend on load misses.

The rest of this paper is organized as follows. Section II presents as background to this work a brief overview of the simultaneous CFP architecture. Section III follows with a description of the streamlined simultaneous CFP architecture (SS-CFP) we evaluate in this study, and the changes and mechanisms used to shorten the CFP replay loop. We outline our simulation methodology in Section IV. Section V evaluates SS-CFP performance. Section VI discusses related work and we conclude in Section VII.

II. BACKGROUND: SIMULTANEOUS CONTINUAL FLOW

PIPELINE ARCHITECTURE

The simultaneous CFP core microarchitecture [5] was based on Intel P6 architecture [17]. Unlike its previous latency tolerant out-of-order architectures, the simultaneous CFP core executes cache miss-dependent and miss-independent instructions concurrently using two different hardware thread contexts. The simultaneous CFP hardware is similar to simultaneous multithreading architectures (SMT) [18], except that in simultaneous CFP, the two simultaneous threads are formed of miss-dependent and miss-independent instructions

EuroCon 2013 • 1-4 July 2013 • Zagreb, Croatia

1822978-1-4673-2232-4/13/$31.00 ©2013 IEEE

constructed dynamically from the same program, instead of being two different programs that run simultaneously in the same core. In order to support two hardware threads, simultaneous CFP has two register alias tables (RAT) for renaming the independent and the dependent thread instructions. Simultaneous CFP also has two retirement register file contexts (RRF), one for retiring miss-independent instruction results and the other for retiring miss-dependent instruction results. The two threads share the reorder buffer, load queue, store queue, reservation stations and data cache.

The independent hardware thread is responsible for instruction fetch and decode of all instructions, branch prediction, memory dependence prediction, identifying miss-dependent instructions and moving them into the waiting buffer (WB). The dependent thread execution starts when the load miss data is brought into the cache, waking up the load instruction in the WB, and continues until the WB empties. At the end of dependent execution, when all the instructions from the WB have committed without mispredictions or exceptions, the independent and dependent execution results are merged together with a flash copy of the dependent and independent register contexts within the retirement register file [5]. To maintain proper memory ordering of loads and stores from the independent and dependent threads execution, simultaneous CFP uses load and store queues (LSQ) [1], a Store Redo Log (SRL) [19] and a store-set memory dependence predictor [20].

Simultaneous CFP renames instructions using the reorder buffer (ROB) [1]. When a load miss reaches the head of the ROB, it is pseudo-retired and moved into the WB immediately. The load miss instruction releases all its pipeline resources including its ROB ID before entering the waiting buffer, breaking its links with its readers that are still in the pipeline. For this reason, the entire dependence chain of the load that has missed and its dependents need to be pseudo-retired in the ROB, moved into the WB and then replayed from there. This can last for a long distance forward in the program, causing excessive replays and rollbacks. The architecture evaluated in this paper tries to overcome this excessive wasteful replay of simultaneous CFP.

III. A STREAMLINED SIMULTANEOUS CFP ARCHITECTURE

The streamlined simultaneous CFP architecture (SS-CFP), evaluated in this paper, uses virtual register renaming and a waiting buffer directly coupled to the reservation stations to scale up the instruction window. We next describe virtual register renaming and the support provided in the reservation stations and waiting buffer arrays for scaling up SS-CFP instruction window. We also describe the mechanism for moving instructions from the RS to the waiting buffer, and the mechanism for replaying miss-dependent instructions and integrating their results with the independent instruction after replay.

A. Virtual Register Renaming

Fig. 1 shows a block diagram of SS-CFP core architecture. Like all other superscalar architectures, SS-CFP uses a reorder buffer to commit instructions and update architecture register

and memory state in program order. However, SS-CFP does not use the reorder buffer for register renaming. Instead, it performs register renaming using virtual register IDs generated by a special counter. These virtual register IDs are not mapped to any fixed storage locations in the core, and therefore can be large in number and allocated to instructions throughout their life time, including miss-dependent instructions evicted to the waiting buffers. Since the virtual register IDs are plentiful and are not associated with actual physical storage, SS-CFP does not run the risk of execution pipeline buffer-full stalls resulting from miss-dependent instructions holding on to their renamed registers for a long time while waiting for the long latency load miss.

Virtual register renaming gives SS-CFP an advantage over previous CFP architectures. Past CFP architectures require all miss-dependent instructions to be replayed and renamed again to re-establish dependence links, which is necessary for the reservation stations to re-dispatch the miss-dependent instructions in correct data flow order. In contrast, since the virtual register renaming IDs are permanent from the time the miss-dependent instructions are renamed until they execute and commit, SS-CFP can do partial replay of dependent instructions. What this means is that if the load miss data is fetched from memory after the load is moved to the waiting buffer but before its dependents have been moved, SS-CFP simply replays the load, saving the execution time that would be encountered if all the miss-dependent instructions that are still in the reservation stations had to be replayed through the waiting buffer to be renamed again.

B. SS-CFP Replay Loop

Fig. 1 shows the reduced replay loop in SS-CFP consisting of two stages: the reservation stations (RS) and the waiting buffer. The waiting buffer basically acts as a second level storage for the reservation stations. With virtual register renaming, entries can be freely evicted from the RS to the WB and then loaded back again to the RS to be scheduled for execution at a later time.

In contrast to this short replay loop, the previous simultaneous CFP architecture has a significantly longer replay loop, which consists of the RS, pseudo execute (EX), writeback to ROB, pseudo-commit (ROB), waiting buffer

ICache

Decode Rename RS

WB

EX

LSQ/ SRL DCache ROB RRF

VID Counter

Fig. 1. Streamlined Simultaneous CFP microarchitecture block diagram

EuroCon 2013 • 1-4 July 2013 • Zagreb, Croatia

1823978-1-4673-2232-4/13/$31.00 ©2013 IEEE

(WB), and rename (DEP RAT) stages [5]. As we will show later in the results section, a short replay loop is a key advantage of SS-CFP, providing it a 16% performance improvement over simultaneous CFP.

C. SS-CFP Reservation Stations

SS-CFP uses a centralized array of conventional data-capture reservation stations [17]. Each reservation station entry is extended with a poison bit per source operand and L1-DCache-miss bit. The L1-DCache-miss bit is set to 1 if the entry contains a load instruction that has missed the L1 data cache. We say an instruction is poisoned if one of its source poison bits or the L1-DCache-bit is set to 1. A source operand of an instruction is poisoned if and only if it is the destination of another poisoned instruction. In other words, the poison bits propagate the dependences from L1 data cache misses to later instructions in the program to identify instructions that may encounter long data cache miss delays. These instructions are candidates to move to the waiting buffer to avoid pipeline stalls that could occur if the reservation stations array becomes full.

The reservation stations array is augmented with a free list and an order list. SS-CFP uses the free list to track the occupancy of the reservation stations. The order list tracks the allocation order of the reservation stations, which matches the program order of the instructions, since reservation stations are allocated in the rename pipeline stage. The RS order list could be implemented as part of the reorder buffer by adding the reservation station ID of each instruction to its allocated reorder buffer entry. It also could be implemented as a special array separate from the reorder buffer.

Four conditions are checked to determine if an instruction should be moved to the waiting buffer: 1) the instruction is at the head of the RS order list, 2) the instruction is poisoned, 3) the RS array is full, and 4) every source operand of the instruction is either poisoned or ready. The last condition ensures that the miss-dependent instructions carry their non-poisoned input values with them when they are replayed, since there is no guarantee that these values would not be overwritten in the register file by replay time.

In addition, each reservation station has a state bit that indicates if the instruction in the entry has been replayed once before, i.e. it has been moved earlier in time to the waiting buffer and then back to the RS array. Each reservation station also contains a load miss identifier, in case it has a load instruction that misses the data cache. An implementation could use for this purpose the ID of the L1 data cache fill buffer used to handle the load miss.

D. Waiting Buffer

The waiting buffer is a wide single ported SRAM array managed as a circular buffer using head and tail pointers. Miss-dependent RS entries at the head of the RS array moves to the tail of the waiting buffer when the RS fills up due to data cache misses. When a data cache miss is completed, SS-CFP replays the miss-dependent entries by loading them back from the head of the waiting buffer to the tail of the RS. Ideally, the width of the two buses connecting the RS and the

waiting buffer would match the pipeline width. Narrower interconnect can also be used, trading some performance for reduced hardware complexity.

A key to the efficiency of SS-CFP large window implementation is the fact that the waiting buffer has no CAM ports, connections to write back buses for capturing data operands or conventional ready/schedule logic. All these functions are handled in the RS array after the data cache miss completes and the miss-dependent instructions are replayed. Therefore, the waiting buffer array can be designed using non-tagged SRAM and made significantly larger than the RS array at much lower area and power cost than if the RS array is enlarged to completely hold the large instruction window.

In order to wake up miss dependents from the waiting buffer and replay them, the L1 data cache fill buffer handling a load miss has to receive and save the waiting buffer ID of its load miss. When the miss is completed, SS-CFP replays the load miss and its dependents in program order, as described earlier, from the head of the waiting buffer back into the RS allocate/write stage of the execution pipeline. These replayed instructions do not need to be renamed again. Their virtual register renames are still valid, thus can be used by the RS to schedule these instructions and to grab their results from the writeback bus into the reservation stations of any dependent instructions, including any instructions that have not been replayed but still waiting in the RS.

E. Miss-dependent branch predictor

A key reason for CFP performance degradation is dependent branches that go into the waiting buffer which are later found to be mispredicted. Compare the non-CFP baseline which stalls when the instruction buffers become full, keeping the miss-dependent branch misprediction penalty low, to a CFP architecture which continues to process a large number of runahead instructions beyond the branch. In light of this, there are two major factors that bring down performance. One factor is re-execution of all instructions between the load miss and the mispredicted branch, after the branch is resolved. The other factor is the flurry of instructions from the wrong path which increase the pressure on the instruction buffers, resulting in more instructions being moved to the waiting buffer. This not only increases the amount of replay and probability of rollback, but also potentially delays the eventual resolution of miss dependent branches depending on when they will get a chance to replay from the waiting buffer.

This paper addresses this problem by identifying branches that are likely to mispredict and takes necessary action when they are waiting to move into the WB. It is observed that in multiple benchmarks there is a strong correlation between the dependent mispredicted branch and its PC address. A small hardware predictor of 64 entries that contains the PC addresses of previous dependent mispredicted branches is used to estimate the confidence of the branch. The front end (rename unit) is stalled if a branch with low confidence is next in line to be moved into the WB. This not only prevents excessive instructions from being moved into the WB because the instruction buffers do not clog, but also reduces rollbacks

EuroCon 2013 • 1-4 July 2013 • Zagreb, Croatia

1824978-1-4673-2232-4/13/$31.00 ©2013 IEEE

from the checkpoint when the predictor is correct. The front end of the processor is unblocked only after the load miss data is delivered to the cache and the branch is resolved.

F. Tying Some Architectural Loose Ends Together

Finally, a few important SS-CFP architecture details: • The virtual register ID counter is finite in size and cannot

be allowed to overflow for correctness reason. SS-CFP architecture opportunistically resets the virtual register ID counter whenever it can, e.g. when the pipeline is flushed to recover from a mispredicted branch.

• The replayed miss-dependent instructions do not need to be renamed again, but they sometimes need to read source operands that have already been computed and retired from the ROB to the register file (RRF). State bits that track whether the last instructions to write logical registers have been retired are stored in a special storage structure. These state bits are checked during replay to determine if the operands are ready in the RRF when instructions are moved from the WB to the RS.

• The SS-CFP register file is similar to the simultaneous CFP register file described in [5]. SS-CFP handles integration of miss-dependent and miss-independent execution results, load and store execution and miss-dependent misses in the same way as described in [5].

IV. EVALUATION METHODOLOGY

We built our SS-CFP architecture model on the Simplescalar ARM ISA simulation infrastructure (www.simplescalar.com) and used all 14 C benchmarks from SPEC 2000 and Spec 2006 that we succeeded in compiling using the Simplescalar cross compiler tool. After skipping the initialization code, we simulated 200 million instructions from each benchmark, consisting of four different samples carefully selected from representative execution phases. The selected execution phases display wide variation in the cache miss rate as well as the branch misprediction rate, both of which significantly impact CFP execution behaviour. The results section below shows the average performance of these selected samples for each benchmark.

Table I shows the simulated machine configurations. We used two level cache hierarchy with sizes that are representative of current multicore processors, scaled down proportionally to the number of cores, since we simulate a SS-CFP core.

V. SIMULATION RESULTS

We present in this section three key simulation results to highlight how our SS-CFP optimizations perform on our simulated benchmarks. We first compare SS-CFP performance to simultaneous CFP when used to handle: 1) all L1 data cache misses, or 2) only L1 data cache misses that go to DRAM. Second, we compare SS-CFP performance to that of a non-CFP baseline ROB machine also establishing the SS-CFP performance upper bound with an oracle dependent branch predictor. Finally, we isolate the contribution from different optimization schemes discussed in this paper towards overall SS-CFP performance.

TABLE I SIMULATED MACHINE CONFIGURATIONS

Pipeline and buffers

Baseline core: 16 stage, 4-wide, 128-entry ROB, 64-entry RS, 64-entry LQ, 48-entry SQ

SS-CFP and simultaneous CFP core: 16 stage, 4-wide, 128-entry ROB, 64-entry RS, 64-entry LQ, 48-entry SQ, 256 entry WB, 256 entry SRL

L1 DCache 8KB , 8-way, 3 cycles 64-byte line

L1 ICache 8KB , 8-way, 3 cycles 64-byte line

L2 cache Unified, 512 KB, 8-way, 16 cycles, 64-byte line

DRAM latency

150 cycles L2 to data return with on-chip DRAM controller

Branch predictor

Combined bimodal and gshare, 4K Meta, 4K bimodal, 4K gshare, 4K BTB, 16 entry Return Address Stack

A. SS-CFP speedup over simultaneous CFP

Fig. 2 shows the speedup of SS-CFP over simultaneous CFP when targeting only last level data cache misses that go to DRAM (L2-CFP) or data cache misses at all levels (L1-CFP).

SS-CFP, with its virtual register renaming, short replay loop and reduced replay/rollback outperforms simultaneous CFP by an average of 16% when CFP algorithm is applied to L1 data cache misses. The maximum improvement occurs on gcc (58% speedup), which displays high rate of L1 cache misses as well as high branch misprediction rate. The reduced replay loop of SS-CFP is very favorable to benchmarks like gcc and perl, since reducing the amount of replay also significantly reduces the number of costly miss-dependent branch mispredictions. The variation in the improvement between benchmarks is mainly due to the variation in the cache miss rates and branch misprediction rates of our simulation samples.

Since L2 cache misses are less frequent in number than L1 cache misses, CFP encounters significantly less replay when applied to L2 cache misses only. Nevertheless, our simulations show that the reduced replay loop of SS-CFP is beneficial, albeit to a lesser degree, with an average 2.75% speedup, when applied to L2 cache misses. Most importantly, gcc again benefits (16% speedup) from SS-CFP. Even though the average benefit of SS-CFP for L2 cache misses is small, the long replay loop of simultaneous CFP is a glass jaw that is exposed on benchmarks like gcc that display high cache misses and branch misprediction rates. A carefully designed replay loop is necessary when designing CFP for general purpose processors that target many applications with widely different execution characteristics.

To make a fair comparison, both the SS-CFP and simultaneous CFP models are simulated with the dependent branch predictor that stalls the pipeline when a branch likely to mispredict is about to be moved into the WB. Notice that simultaneous CFP does not benefit completely from this stall

EuroCon 2013 • 1-4 July 2013 • Zagreb, Croatia

1825978-1-4673-2232-4/13/$31.00 ©2013 IEEE

because the poisoned branch is still replayed from the WB to establish its link with the load miss. In such cases, simultaneous CFP would still have to rollback to the checkpoint to recover the architecture state. On the other hand, SS-CFP always avoids a rollback because stalling the front end at this stage avoids moving the dependent branch into the WB, allowing the branch to be resolved as soon as the miss returns without needing the branch to be replayed. This also infers that in such cases, branch recovery can be done using the less costly ROB mechanism.

B. SS-CFP speedup over baseline ROB

Fig. 3 shows the percent speedup of three different SS-CFP configurations over a conventional superscalar of similar pipeline, reorder buffer and cache configurations. SS-CFP, when applied to only L2 cache misses that go to DRAM, gives an average speedup of 2.25% (shown as L2-CFP).

When SS-CFP targets the more frequent L1 data cache misses to avoid pipeline stalls, it performs better with an average speedup of 3.7% (shown as L1-CFP). This SS-CFP configuration works with the simple dependent branch predictor described in Section III E. The same figure also shows the performance of an oracle predictor that stalls the pipeline perfectly when a mispredicted branch is about to be moved into the WB, thus establishing the performance upper bound that can be achieved with SS-CFP (L1-CFP-Orc). Notice that the hardware predictor used in this paper comes within 0.8% of the oracle predictor, on average.

Also notice that we do not get as much speedup over the baseline as has been reported in [5] because the benchmarks and simulation traces used in this paper are different from those reported in [5].

C. SS-CFP optimizations

This paper talks about mainly three optimizations over prior CFP work, which are 1) partial replay of the load miss dependence chain (PR) 2) moving miss-dependent instructions into the WB only when a resource or buffer becomes full (BF), and 3) dependent branch confidence predictor (DP). To keep the design simple, the ROB doubles up as the RS order list, therefore the impact of the short replay loop is not quantified. In order to see the impact of the other optimization schemes,

we choose our base as the SS-CFP configuration that performs partial replay of miss-dependent instructions (or PR) and normalize other incremental contributions to this configuration. As shown in Fig. 1, the combined optimizations give an average speedup of 16% over simultaneous CFP architecture. The intention of this experiment is not to compare SS-CFP with simultaneous CFP, but to isolate the contribution of each optimization towards overall SS-CFP performance. Fig. 4 shows the contribution of incremental optimizations to SS-CFP performance, as well as an upper bound with perfect dependent branch prediction (DP_Orc).

We see from the first column (PR + BF) that benchmarks like equake, gcc and libquantum benefit well from deferring to move dependent instructions into the WB until they become blocking, mainly because of reduced replay and rollback risk, and faster resolution of dependent branches.

The second (PR + BF + DP) and third (PR + BF + DP_Orc) columns highlight the significance of the dependent branch predictor. Some benchmarks like mcf and twolf have dependent branches that mispredict often, but show predictable patterns that can be detected by the hardware predictor. However, there are other benchmarks like gobmk and gzip, where the dependent branch predictor is seemingly ineffective (column 2 shows little or no improvement over column 1), resulting in the performance difference between the oracle and practical models. It is clear from column 3 that a more accurate dependent branch predictor will further improve SS-CFP performance over the non-CFP baseline. Striking the right balance is crucial when dealing with dependent branches flagged as low confidence by the hardware predictor. On one hand, it is not possible to be conservative and stall on low confidence branches frequently, because this would offset the benefit coming from CFP speculative execution. On the other hand, it is also not possible to be over-aggressive and allow a lot of runahead instructions in the window since it will lead to excessive replay and rollback. As future work, we plan to investigate methods to better predict the behavior of dependent branches by including the history of other dependent branches into the prediction mechanism.

On an alternate note, simultaneous CFP also benefits from the hardware predictor with an average 4% improvement in performance. Even though simultaneous CFP replays the poisoned dependent branch from the WB to resolve it, the predictor reduces the likelihood of miss-dependent mispredicted branches, because of relatively lesser replay activity compared to having an uncontrolled runahead window of instructions in the pipeline.

VI. RELATED WORK

Latency tolerant microarchitectures include the Out of Order Commit Processor [3], Waiting Instruction Buffer [4], Continual Flow Pipelines [7], Checkpoint Processing and Recovery [1], Cherry [12] and Virtual ROB [13]. None of these however deal with L1 data cache misses or execute miss-dependent and miss-independent instructions concurrently using multiple register file contexts.

Fig. 2. Streamlined Simultaneous CFP percent speedup over Simultaneous CFP when CFP is applied to L1 and L2 data cache misses

EuroCon 2013 • 1-4 July 2013 • Zagreb, Croatia

1826978-1-4673-2232-4/13/$31.00 ©2013 IEEE

Fig. 3. Streamlined Simultaneous CFP percent speedup over conventional superscalar processor of equal configuration when CFP execution is applied only to L2 data cache misses (L2-CFP), to all data cache misses (L1-CFP), and to oracle machine that perfectly predicts miss-dependent branches (L1-CFP-Orc)

Fig. 4. Percent speedup contributed by buffer full optimization (PR + BF), buffer full and dependent branch confidence prediction (PR + BF + DP) and buffer full optimization with oracle dependent branch prediction (PR + BF + DP-Orc) over a baseline SS-CFP that only performs virtual register renaming optimization

Runahead execution increases memory level parallelism on in-order cores [21], and on out-of-order cores [22] without having to build large reorder buffers. In runahead execution, the processor state is checkpointed at a load miss to DRAM. Execution continues speculatively past the miss for data prefetch benefits. When the miss data returns, runahead execution terminates, the execution pipeline is flushed, and execution rolls back to the checkpoint. Except for the prefetch benefit, all work performed during runahead is discarded. SS-CFP executes ahead of L1 data cache misses and does not waste energy by discarding large number of instructions.

Flea-Flicker [23], [24] executes a program on two in-order back-end pipelines coupled by a queue. An advance pipeline executes independent instructions without stalling on long latency cache misses while deferring dependent instructions. A backup pipeline executes the instructions deferred in the

advance pipeline and merges them with results that are stored in the queue from the advance pipeline. Flea-flicker and SS-CFP differ in their execution methods, result integration methods and their instruction deferral queues. Flea-flicker executes instructions in an in-order pipeline, saving all advanced instructions and results in its queue and merging results sequentially during backup pipeline execution.

iCFP [15] tolerates cache misses at all levels in the cache hierarchy, but uses an in-order pipeline, which is less suitable for the performance needs of conventional single-thread applications.

BOLT [25] utilizes additional map tables in Simultaneous Multithreading architecture to re-rename L2 miss-dependent slice, combined with a program order slice and a unified physical register file that supports aggressive register reclamation. BOLT reuse of SMT hardware is for improving

��

��

��

��

��� �� � ��� ���� ��� � � �� � ��� ���� ���� ��� ���� ���� ���

�����������

�� ���

�����

� ��!"#

��!"#

��!"#�$��

%

&

�%

��� �� � ��� ���� ��� � � �� � ��� ���� ���� ��� ���� ���� ���

�����������

�����

������

�����

��������

#'�(�)"

#'�(�)"�(�*#

#'�(�)"�(�*#+$��

EuroCon 2013 • 1-4 July 2013 • Zagreb, Croatia

1827978-1-4673-2232-4/13/$31.00 ©2013 IEEE

energy-efficiency, but does not extend the use of SMT to simultaneous execution of the dependent and independent slices to improve performance. Neither does BOLT use virtual register renaming or a streamlined replay loop.

Sun Microsystems Rock is a single die multicore processor for high throughput computing. Rock uses Simultaneous Speculative Threading [14] to defer dependent instructions into a buffer, and executes the deferred instructions from the checkpoint after the miss data returns. The deferred instructions execution uses a simultaneous hardware thread and merges the results into the scout thread future file. Rock uses an in-order pipeline, while SS-CFP core is out-of-order and thus provides better performance than Rock on single-thread applications.

Gonzalez et al. [26], [27] proposed using virtual registers to shorten the lifetime of physical registers. The idea was to use virtual-physical registers to delay the allocation of physical registers from the time instructions are renamed until instructions execute and produce results that need the physical registers. Until the physical destination registers are allocated at execution time, virtual registers are used for register renaming. Kilo instruction processors [2] also used virtual renaming and ephemeral registers to do late allocation of physical registers. In contrast to virtual-physical registers and ephemeral registers, Virtual Register Renaming (VRR) [16] does not require physical registers for any allocation of execution results to physical registers. SS-CFP uses a renaming method similar to VRR.

VII. CONCLUSION

This paper evaluates a streamlined simultaneous continual flow pipeline architecture that improves performance, by an average of 16%, over previous CFP designs. It achieves this by reducing the replay latency associated with the processing of miss-dependent instructions after the miss data arrives into the L1 data cache and wakes up the miss load. SS-CFP achieves this reduction in the miss-dependent instructions processing latency by 1) keeping these instructions as long as possible in the reservation stations and moving them to the waiting buffer only when RS resources are needed, 2) removing pseudo commit and the reorder buffer from the replay loop, 3) applying virtual register renaming, which eliminates the need to rename dependent instructions again during replay, and 4) allowing partial replay of the miss-dependent chain of instructions, requiring only the subset of these instructions that have already moved to the waiting buffer to be replayed, 5) stalling the pipeline when a branch likely to mispredict is next in line to be moved into the waiting buffer.

ACKNOWLEDGMENT

This research has been supported by a grant from Intel Corporation.

REFERENCES [1] J. E. Smith and G. S. Sohi, “The microarchitecture of superscalar

processors,” in Proceedings of the IEEE, vol. 83, no. 12, Dec1995.

[2] A. Cristal, O. J. Santana, M. Valero and J. F. Martinez, “Toward kilo-instruction processors,” in ACM Transactions on Architecture and Code Optimization, vol. 1, issue 4, pp 389-417, Dec 2004.

[3] A. Cristal, D. Ortega, J. Llosa, and M. Valero, “Out-of-order commit processors,” in Proceedings of HPCA-10, Feb 2004.

[4] R. Lebeck, J. Koppanalil, T. Li, J. Patwardhan, and E. Rotenberg, “A large, fast instruction window for tolerating cache misses,” in Proceedings of ISCA-29, May 2002.

[5] K. Jothi, H. Akkary, and M. Sharafeddine, “Simultaneous continual flow pipeline architecture,” in Proceedings of ICCD-29, Oct 2011.

[6] S. Nekkalapu, H. Akkary, K. Jothi, R. Retnamma, and X. Song, “A simple latency tolerant processor,” in Proceedings of ICCD-26, Oct 2008.

[7] S. T. Srinivasan, R. Rajwar, H. Akkary, A. Gandhi, and M. Upton, “Continual flow pipelines,” in Proceedings of ASPLOS-11, Oct 2004.

[8] H. Akkary, R. Rajwar, and S. Srinivasan, “Checkpoint processing and recovery: towards scalable large instruction window processors,” in Proceedings of MICRO-36, Dec 2003.

[9] H, Akkary, R. Rajwar, and S. Srinivasan, “Checkpoint processing and recovery: an efficient, scalable alternative to reorder buffers,” in IEEE MICRO, vol. 23, issue 6, pp. 11-19, Nov/Dec 2003.

[10] H. Akkary, R. Rajwar, and S. Srinivasan, “An analysis of a resource efficient checkpoint architecture,” in ACM Transactions on Architecture and Code Optimization, vol. 1, issue 4, pp 418-444, Dec 2004.

[11] W. W. Hwu and Y. N. Patt, “Checkpoint repair for out-of-order execution machines,” in Proceedings of ISCA-14, June 1987.

[12] J. F. Martinez, J. Renau, M. C. Huang, M. Prvulovic, and J. Torrellas, “Cherry: checkpointed early resource recycling in out-of-order microprocessors,” in Proceedings of MICRO-35, Nov 2002.

[13] A. Cristal, M. Valero, J. Llosa, and A. Gonzalez, “Large virtual ROBs by processor checkpointing,” Tech. Report, UPC-DAC-2002-39, Department of Computer Science, Barcelona, Spain, July 2002.

[14] S. Chaudhry, R. Cypher, M. Ekman, M. Karlsson, A. Landin, S. Yip, H. Zeffer, and M. Tremblay, “Simultaneous speculative threading: a novel pipeline architecture implemented in Sun’s Rock processor,” in Proceedings of ISCA-36, June 2009.

[15] A. Hilton, S. Nagarakatte, and A. Roth, “Tolerating all-level cache misses in in-order processors,” in Proceedings of HPCA-15, Feb 2009.

[16] M. Sharafeddine, H. Akkary and D. Carmean, “Virtual register renaming,” in Proceedings of the 26th International Conference on Computing Systems, Feb 2013.

[17] D. B. Papworth, “Tuning the Pentium Pro microarchitecture,” in IEEE MICRO, vol. 16, no. 2, April 1996.

[18] D. Tullsen, S. Eggers, and H. M. Levy, “Simultaneous multithreading: maximizing on-chip parallelism, in Proceedings of ISCA-22, June 1995.

[19] A. Gandhi, H. Akkary, R. Rajwar, S. T. Srinivasan and K. Lai, “Scalable load and store processing in latency tolerant processors,” in Proceedings of ISCA-32, June 2005.

[20] G. Z. Chrysos and J. Emer, “Memory dependence prediction using store sets,” in Proceedings of ISCA-25, June 1998.

[21] J. Dundas and T. Mudge, “Improving data cache performance by pre-executing instructions under a cache miss,” in Proceedings of the International Conference on Supercomputing, June 1997.

[22] O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt, “Runahead execution: an alternative to very large instruction windows for out-of-order processors,” in Proceedings of HPCA-9, Feb. 2003.

[23] R. D. Barnes, E. M. Nystrom, J. W. Sios, S. J. Patel, N. Navarro, and W. W. Hwu, “Beating in-order stalls with flea flicker two-pass pipelining,” in Proceedings of MICRO-36, Dec 2003.

[24] R. D. Barnes, S. Ryoo, W. W. Hwu. “Flea flicker multi-pass pipelining: an alternative to the high power out-of-order offense,” in Proceedings of MICRO-38, Nov 2005.

[25] A. Hilton and A. Roth, “BOLT: energy-efficient out-of-order latency-tolerant execution,” in Proceedings of HPCA-16, Feb 2010.

[26] A. Gonzalez, J. Gonzalez and M. Valero, “Virtual-physical registers,” in Proceedings of HPCA-4, Feb 1998.

[27] A. Gonzalez, M. Valero, J. Gonzalez, and T. Monreal, “Virtual registers,” in Proceedings of HPCA-3, Feb 1997.

EuroCon 2013 • 1-4 July 2013 • Zagreb, Croatia

1828978-1-4673-2232-4/13/$31.00 ©2013 IEEE