advanced computer architecture lesson 5 and 6

75
Reduced Instruction Set Computers Lesson 5 (RISC)

Upload: mukiibi-ismail

Post on 06-May-2015

1.791 views

Category:

Education


0 download

DESCRIPTION

Read about computer architecture

TRANSCRIPT

Page 1: Advanced computer architecture lesson 5 and 6

Reduced Instruction Set Computers

Lesson 5 (RISC)

Page 2: Advanced computer architecture lesson 5 and 6

The semantic gap is the difference between the operations provided in HLLs (High Level Languages) and those provided in computer architecture.

Symptoms of this gap are alleged to include execution inefficiency, excessive machine program size, and compiler complexity. Designers responded with architectures intended to close this gap. Key features include large instruction sets, dozens of addressing modes, and various HLL statements implemented in hardware. An example of the latter is the CASE machine instruction on the VAX. Such complex instruction sets are intended to

• Ease the task of the compiler writer.• Improve execution efficiency, because complex

sequences of operations can be implemented in microcode.

• Provide support for even more complex and sophisticated HLLs.

Page 3: Advanced computer architecture lesson 5 and 6

Reduced instruction set computer (RISC) architecture The RISC architecture is a dramatic departure from the

historical trend in processor architecture. An analysis of the RISC architecture brings into focus many of the important issues in computer organization and architecture.

Although RISC systems have been defined and designed in a variety of ways by different groups, the key elements shared by most designs are these:

• A large number of general-purpose registers, and/or the use of compiler technology to optimize register usage

• A limited and simple instruction set• An emphasis on optimizing the instruction pipeline

Page 4: Advanced computer architecture lesson 5 and 6

Semantic gap

In order to improve the efficiency of software development, new and powerful programming languages have been developed (Ada, C++, Java).

They provide: high level of abstraction, conciseness, power.

• By this evolution the semantic gap grows

Page 5: Advanced computer architecture lesson 5 and 6

Problem: How should new HLL programs be compiled and executed efficiently on a processor architecture?

Two possible answers:1. The CISC approach: design very complex

architectures including a large number of instructions and addressing modes; include also instructions close to those present in HLL.

2. The RISC approach: simplify the instruction set and adapt it to the real requirements of user programs

Page 6: Advanced computer architecture lesson 5 and 6

Why need RISC

RISC architectures represent an important innovation in the area of computer organization.

• The RISC architecture is an attempt to produce more CPU power by simplifying the instruction set of the CPU.

• The opposed trend to RISC is that of complex instruction set computers (CISC).

Both RISC and CISC architectures have been developed as an attempt to cover the semantic gap.

Page 7: Advanced computer architecture lesson 5 and 6

INSTRUCTION EXECUTION CHARACTERISTICS IN RISC Operations performed: These determine the

functions to be performed by the processor and its interaction with memory.

Operands used: The types of operands and the frequency of their use determine the memory organization for storing them and the addressing modes for accessing them.

Execution sequencing: This determines the control and pipeline organization.

Page 8: Advanced computer architecture lesson 5 and 6

Evaluation of Program execution

Several studies have been conducted to determine the execution characteristics of machine instruction sequences generated from HLL programs.

• Aspects of interest:1. The frequency of operations performed.2. The types of operands and their frequency of

use.3. Execution sequencing (frequency of jumps,

loops, subprogram calls).

Page 9: Advanced computer architecture lesson 5 and 6

Frequency of Instructions Executed• Frequency distribution of executed machine

instructions: moves: 33% conditional branch: 20% Arithmetic/logic: 16% others: Between 0.1% and 10%• Addressing modes: the overwhelming majority of

instructions uses simple addressing modes, in which the address can be calculated in a single cycle (register, register indirect, displacement); complex addressing modes (memory indirect, indexed+indirect, displacement+indexed, stack) are used only by ~18% of the instructions.

Page 10: Advanced computer architecture lesson 5 and 6

Operand Types

• 74 to 80% of the operands are scalars (integers, reals, characters, etc.) which can be hold in registers;

• the rest (20-26%) are arrays/structures; 90% of them are global variables;

• 80% of the scalars are local variables.

NB:The majority of operands are local variables of scalar type, which can be stored in registers

Page 11: Advanced computer architecture lesson 5 and 6

Some statistics concerning procedure calls:

• Only 1.25% of called procedures have more than six parameters.

• Only 6.7% of called procedures have more than six local variables.

• Chains of nested procedure calls are usually short and only very seldom longer than 6.

Page 12: Advanced computer architecture lesson 5 and 6

Conclusions from Evaluation of Program Execution

• An overwhelming preponderance of simple (ALU and move) operations over complex operations.

• Preponderance of simple addressing modes.• Large frequency of operand accesses; on average each

instruction references 1.9 operands.• Most of the referenced operands are scalars (so they can

be stored in a register) and are local variables or parameters.

• Optimizing the procedure CALL/RETURN mechanism promises large benefits in speed.

These conclusions have been at the starting point to the Reduced Instruction Set Computer (RISC) approach.

Page 13: Advanced computer architecture lesson 5 and 6

Characteristics of Reduced Instruction Set Architectures Although a variety of different approaches

to reduced instruction set architecture have been taken, certain characteristics are common to all of them:

• One instruction per cycle

• Register-to-register operations

• Simple addressing modes

• Simple instruction formats

Page 14: Advanced computer architecture lesson 5 and 6

The first characteristic listed is that there is one machine instruction per machine cycle. A machine cycle is defined to be the time it takes to fetch two operands from registers, perform an ALU operation, and store the result in a register. Thus, RISC machine instructions should be no more complicated than, and execute about as fast as, microinstructions on CISC machines. With simple, one-cycle instructions, there is little or no need for microcode; the machine instructions can be hardwired. Such instructions should execute faster than comparable machine instructions on other machines, because it is not necessary to access a microprogram control store during instruction execution.

Page 15: Advanced computer architecture lesson 5 and 6

The goal is to create an instruction set containing instructions that execute quickly; most of the RISC instructions are executed in a single machine cycle (after fetched and decoded).

- RISC instructions, being simple, are hard-wired, while CISC architectures have to use microprogramming in order to implement complex instructions.

- Having only simple instructions results in reduced complexity of the control unit and the data path; as a consequence, the processor can work at a high clock frequency.

- The pipelines are used efficiently if instructions are simple and of similar execution time.

- Complex operations on RISCs are executed as a sequence of simple RISC instructions. In the case of CISCs they are executed as one single or a few complex instruction.

Page 16: Advanced computer architecture lesson 5 and 6

example we have a program with 80% of executed

instructions being simple and 20% complex;

- on a CISC machine simple instructions take 4 cycles, complex instructions take 8 cycles; cycle time is 100 ns (10-7 s);

- on a RISC machine simple instructions are executed in one cycle; complex operations are implemented as a sequence of instructions; we consider on average 14 instructions (14 cycles) for a complex operation; cycle time is 75 ns (0.75 * 10-7 s).

Page 17: Advanced computer architecture lesson 5 and 6

How much time takes a program of 1 000 000 instructions?

CISC: (10^6*0.80*4 + 10^6*0.20*8)*10-7 = 0.48 s

RISC: (10^6*0.80*1 + 10^6*0.20*14)*0.75*10-7 = 0.27 s

• complex operations take more time on the RISC, but their number is small;

• because of its simplicity, the RISC works at a smaller cycle time; with the CISC, simple instructions are slowed down because of the increased data path length and the increased control complexity.

Page 18: Advanced computer architecture lesson 5 and 6

A second characteristic is that most operations should be register to register, with only simple LOAD and STORE operations accessing memory.

Only LOAD and STORE instructions reference data in memory; all other instructions operate only with registers (are register-to-register instructions); thus, only the few instructions accessing memory need more than one cycle to execute (after fetched and decoded).

Page 19: Advanced computer architecture lesson 5 and 6

Third Characteristic, Instructions use only few addressing modes

- Addressing modes are usually register, direct, register indirect, displacement.

Almost all RISC instructions use simple register addressing.

Forth Characteristic Instructions are of fixed length and uniform format

- This makes the loading and decoding of instructions simple and fast; it is not needed to wait until the length of an instruction is known in order to start decoding the following one;

- Decoding is simplified because opcode and address fields are located in the same position forall instructions

Page 20: Advanced computer architecture lesson 5 and 6

Fifth Characteristic, A large number of registers is available

- Variables and intermediate results can be stored in registers and do not require repeated loads and stores from/to memory.

- All local variables of procedures and the passed parameters can be stored in registers.

Page 21: Advanced computer architecture lesson 5 and 6

What happens when a new procedure is called?

- Normally the registers have to be saved in memory (they contain values of variables and parameters for the calling procedure); at return to the calling procedure, the values have to be again loaded from memory. This takes a lot of time.

- If a large number of registers is available, a new set of registers can be allocated to the called procedure and the register set assigned to the calling one remains untouched.

Page 22: Advanced computer architecture lesson 5 and 6

Is the strategy above realistic?

- The strategy is realistic, because the number of local variables in procedures is not large. The chains of nested procedure calls is only exceptionally larger than 6.

- If the chain of nested procedure calls becomes large, at a certain call there will be no registers to be assigned to the called procedure; in this case local variables and parameters have to be stored in memory

Page 23: Advanced computer architecture lesson 5 and 6

Why is a large number of registers typical for RISCarchitectures?

- Because of the reduced complexity of the processor there is enough space on the chip to be allocated to a large number of registers. This, usually, is not the case with CISCs.

Page 24: Advanced computer architecture lesson 5 and 6

The delayed load problem

• LOAD instructions (similar to the STORE) require memory access and their execution cannot be completed in a single clock cycle.

However, in the next cycle a new instruction is started by the processor.

Two possible solutions:1. The hardware should delay the execution of the

instruction following the LOAD, if this instruction needs the loaded value

2. A more efficient, compiler based, solution, which has similarities with the delayed branching, is the delayed load:

Page 25: Advanced computer architecture lesson 5 and 6

With delayed load the processor always executes the instruction following a LOAD, without a stall; It is the programmers (compilers) responsibility that this instruction does not need the loaded value.

Page 26: Advanced computer architecture lesson 5 and 6

CISC versus RISC Characteristics

After the initial enthusiasm for RISC machines, there has been a growing realization that

(1) RISC designs may benefit from the inclusion of some CISC features

(2) CISC designs may benefit from the inclusion of some RISC features.

The result is that the more recent RISC designs, notably the PowerPC, are no longer “pure” RISC and the more recent CISC designs, notably the Pentium II and later Pentium models, do incorporate some RISC characteristics.

Page 27: Advanced computer architecture lesson 5 and 6

Typical RISC characteristics

1. A single instruction size.2. That size is typically 4 bytes.3. A small number of data addressing modes, typically less

than five. This parameter is difficult to pin down. In the table, register and literal modes are not counted and different formats with different offset sizes are counted separately.

4. No indirect addressing that requires you to make one memory access to get the address of another operand in memory.

5. No operations that combine load/store with arithmetic (e.g., add from memory, add to memory).

Page 28: Advanced computer architecture lesson 5 and 6

6. No more than one memory-addressed operand per instruction.

7. Does not support arbitrary alignment of data for load/store operations.

8. Maximum number of uses of the memory management unit (MMU) for a data address in an instruction.

9. Number of bits for integer register specifier equal to five or more. This means that at least 32 integer registers can be explicitly referenced at a time.

10. Number of bits for floating-point register specifier equal to four or more. This means that at least 16 floating-point registers can be explicitly referenced at a time.

Page 29: Advanced computer architecture lesson 5 and 6

RISC PIPELINING

Instruction pipelining is often used to enhance performance.

Let us reconsider this in the context of a RISC architecture. Most instructions are register to register, and an instruction cycle has the following two stages:

• I: Instruction fetch.• E: Execute. Performs an ALU operation with register input

and output.For load and store operations, three stages are required:• I: Instruction fetch.• E: Execute. Calculates memory address• D: Memory. Register-to-memory or memory-to-register

operation

Page 30: Advanced computer architecture lesson 5 and 6

The two stages of the pipeline are an instruction fetch stage, and an execute/memory stage that executes the instruction, including register-to-memory and memory to- register operations. Thus we see that the instruction fetch stage of the second instruction can e performed in parallel with the first part of the execute/memory stage. However, the execute/memory stage of the second instruction must be delayed until the first instruction clears the second stage of the pipeline. This scheme can yield up to twice the execution rate of a serial scheme. Two problems prevent the maximum speedup from being achieved. First, we assume that a single-port memory is used and that only one memory access is possible per stage. This requires the insertion of a wait state in some instructions. Second, a branch instruction interrupts the sequential flow of execution.To accommodate this with minimum circuitry, a NOOP instruction can be inserted into the instruction stream by the compiler or assembler

Page 31: Advanced computer architecture lesson 5 and 6

Pipelining can be improved further by permitting two memory accesses per stage. This yields the sequence. Now, up to three instructions can be overlapped, and the improvement is as much as a factor of 3. Again, branch instructions cause the speedup to fall short of the maximum possible. Also, note that data dependencies have an effect. If an instruction needs an operand that is altered by the preceding instruction, a delay is required. Again, this can be accomplished by a NOOP

Page 32: Advanced computer architecture lesson 5 and 6

. The pipelining discussed so far works best if the three stages are of

approximately equal duration. Because the E stage usually involves an ALU

operation, it may be longer. In this case, we can divide into two substages: • Register file read • ALU operation and register write Because of the simplicity and regularity of a RISC instruction set,

the design of the phasing into three or four stages is easily accomplished.

Figure 13.6d shows the result with a four-stage pipeline. Up to four instructions at a time

can be under way, and the maximum potential speedup is a factor of 4. Note

again the use of NOOPs to account for data and branch delays.

Page 33: Advanced computer architecture lesson 5 and 6

Optimization of Pipelining Because of the simple and regular nature of RISC

instructions, pipelining schemes can be efficiently employed.There are few variations in instruction execution duration, and the pipeline can be tailored to reflect this. However, we have seen that data and branch dependencies reduce the overall execution rate.

DELAYED BRANCH To compensate for these dependencies, code reorganization techniques have been developed. First, let us consider branching instructions. Delayed branch, a way of increasing the efficiency of the pipeline, makes use of a branch that does not take effect until after execution of the following instruction (hence the term delayed).

Page 34: Advanced computer architecture lesson 5 and 6

LOOP UNROLLING Another compiler technique to improve instruction parallelism is loop unrolling [BACO94]. Unrolling replicates the body of a loop some number of times called the unrolling factor (u) and iterates by step u instead of step 1.

Unrolling can improve the performance by

• reducing loop overhead

• increasing instruction parallelism by improving pipeline performance

• improving register, data cache, or TLB locality

Page 35: Advanced computer architecture lesson 5 and 6

Instruction Set Table below lists the basic instruction set for all MIPS R

series processors. All processor instructions are encoded in a single 32-bit word format. All data operations are register to register; the only memory references are pure load/store operations.

The R4000 makes no use of condition codes. If an instruction generates a condition, the corresponding flags are stored in a general-purpose register. This avoids the need for special logic to deal with condition codes as they affect the pipelining mechanism and the reordering of instructions by the compiler. Instead, the mechanisms already implemented to deal with register-value dependencies are employed. Further, conditions mapped onto the register files are subject to the same compile-time optimizations in allocation and reuse as other values stored in registers.

Page 36: Advanced computer architecture lesson 5 and 6

As with most RISC-based machines, the MIPS uses a single 32-bit instruction length. This single instruction length simplifies instruction fetch and decode, and it also simplifies the interaction of instruction fetch with the virtual memory management unit (i.e., instructions do not cross word or page boundaries). The three instruction formats share common formatting of opcodes and register references, simplifying instruction decode. The effect of more complex instructions can be synthesized at compile time.

Only the simplest and most frequently used memory-addressing mode is implemented in hardware. All memory references consist of a 16-bit offset from a 32-bit register.

Page 37: Advanced computer architecture lesson 5 and 6

MIPS R-Series Instruction Set (OP & Description)Load/Store Instructions LB Load Byte LBU Load Byte Unsigned LH Load Halfword LHU Load Halfword Unsigned LW Load Word LWL Load Word Left LWR Load Word Right SB Store Byte SH Store Halfword SW Store Word SWL Store Word Left SWR Store Word Right

Page 38: Advanced computer architecture lesson 5 and 6

Arithmetic Instructions(3-operand, R-type)ADD AddADDU Add UnsignedSUB SubtractSUBU Subtract UnsignedSLT Set on Less ThanSLTU Set on Less Than UnsignedAND ANDOR ORXOR Exclusive-ORNOR NOR

Arithmetic Instructions(ALU Immediate) ADDI Add Immediate ADDIU Add Immediate

Unsigned SLTI Set on Less Than

Immediate SLTIU Set on Less Than

Immediate Unsigned ANDI AND Immediate ORI OR Immediate XORI Exclusive-OR

Immediate LUI Load Upper Immediate

Page 39: Advanced computer architecture lesson 5 and 6

Multiply/Divide InstructionsMULT MultiplyMULTU Multiply UnsignedDIV DivideDIVU Divide UnsignedMFHI Move From HIMTHI Move To HIMFLO Move From LOMTLO Move To LO

Shift Instructions SLL Shift Left Logical SRL Shift Right Logical SRA Shift Right Arithmetic SLLV Shift Left Logical

Variable SRLV Shift Right Logical

Variable SRAV Shift Right Arithmetic

Variable

Page 40: Advanced computer architecture lesson 5 and 6

Coprocessor InstructionsLWCz Load Word to CoprocessorSWCz Store Word to CoprocessorMTCz Move To CoprocessorMFCz Move From CoprocessorCTCz Move Control To CoprocessorCFCz Move Control From CoprocessorCOPz Coprocessor OperationBCzT Branch on Coprocessor z TrueBCzF Branch on Coprocessor z False

Special InstructionsSYSCALL System CallBREAK Break

Jump and Branch Instructions J Jump JAL Jump and Link JR Jump to Register JALR Jump and Link Register BEQ Branch on Equal BNE Branch on Not Equal BLEZ Branch on Less Than or

Equal to Zero BGTZ Branch on Greater Than

Zero BLTZ Branch on Less Than Zero BGEZ Branch on Greater Than

or Equal to Zero BLTZAL Branch on Less Than

Zero And Link BGEZAL Branch on Greater

Than or Equal to Zero And Link

Page 41: Advanced computer architecture lesson 5 and 6

Instruction Pipeline With its simplified instruction architecture, the MIPS can

achieve very efficient pipelining. It is instructive to look at the evolution of the MIPS pipeline, as it illustrates the evolution of RISC pipelining in general.

The initial experimental RISC systems and the first generation of commercial RISC processors achieve execution speeds that approach one instruction per system clock cycle.To improve on this performance, two classes of processors have evolved to offer execution of multiple instructions per clock cycle: superscalar and superpipelined architectures. In essence, a superscalar architecture replicates each of the pipeline stages so that two or more instructions at the same stage of the pipeline can be processed simultaneously.

Page 42: Advanced computer architecture lesson 5 and 6

A superpipelined architecture is one that makes use of more, and more fine-grained, pipeline stages. With more stages, more instructions can be in the pipeline at the same time, increasing parallelism.

Both approaches have limitations.With superscalar pipelining, dependencies between instructions in different pipelines can slow down the system. Also, overhead logic is required to coordinate these dependencies.With superpipelining, there is overhead associated with transferring instructions from one stage to the next.

Page 43: Advanced computer architecture lesson 5 and 6

RISC VERSUS CISC CONTROVERSY

The work that has been done on assessing merits of the RISC approach can be grouped into two categories:

• Quantitative: Attempts to compare program size and execution speed of programs on RISC and CISC machines that use comparable technology

• Qualitative: Examins issues such as high-level language support and optimum use of VLSI real estate

Page 44: Advanced computer architecture lesson 5 and 6

Most of the work on quantitative assessment has been done by those working on RISC systems [PATT82b, HEAT84, PATT84], and it has been, by and large, favorable to the RISC approach. Others have examined the issue and come away unconvinced [COLW85a, FLYN87, DAVI87]. There are several problems with attempting such comparisons [SERL86]:

• There is no pair of RISC and CISC machines that are comparable in life-cycle cost, level of technology, gate complexity, sophistication of compiler, operating system support, and so on.

• No definitive test set of programs exists. Performance varies with the program.

• It is difficult to sort out hardware effects from effects due to skill in compiler writing.

• Most of the comparative analysis on RISC has been done on “toy” machines rather than commercial products. Furthermore, most commercially available machines advertised as RISC possess a mixture of RISC and CISC characteristics. Thus, a fair comparison with a commercial, “pure-play” CISC machine (e.g.,VAX, Pentium) is difficult.

The qualitative assessment is, almost by definition, subjective.

Page 45: Advanced computer architecture lesson 5 and 6

INSTRUCTION-LEVEL PARALLELISMAND SUPERSCALAR PROCESSORS

Lesson 6

Page 46: Advanced computer architecture lesson 5 and 6

A superscalar processor is one in which multiple independent instruction pipelines are used. Each pipeline consists of multiple stages, so that each pipeline can handle multiple instructions at a time. Multiple pipelines introduce a new level of parallelism, enabling multiple streams of instructions to be processed at a time. A superscalar processor exploits what is known as instruction-level parallelism, which refers to the degree to which the instructions of a program can be executed in parallel

Page 47: Advanced computer architecture lesson 5 and 6

A superscalar processor typically fetches multiple instructions at a time and then attempts to find nearby instructions that are independent of one another and can therefore be executed in parallel. If the input to one instruction depends on the output of a preceding instruction, then the latter instruction cannot complete execution at the same time or before the former instruction. Once such dependencies have been identified, the processor may issue and complete instructions in an order that differs from that of the original machine code.

The processor may eliminate some unnecessary dependencies by the use of additional registers and the renaming of register references in the original code.

Whereas pure RISC processors often employ delayed branches to maximize the utilization of the instruction pipeline, this method is less appropriate to a superscalar machine. Instead, most superscalar machines use traditional branch prediction methods to improve efficiency

Page 48: Advanced computer architecture lesson 5 and 6

A superscalar implementation of a processor architecture is one in which common instructions—integer and floating-point arithmetic, loads, stores, and conditional branches—can be initiated simultaneously and executed independently. Such implementations raise a number of complex design issues related to the instruction pipeline

Page 49: Advanced computer architecture lesson 5 and 6

The term superscalar, first coined in 1987 [AGER87], refers to a machine that is designed to improve the performance of the execution of scalar instructions. In most applications, the bulk of the operations are on scalar quantities. Accordingly, the superscalar approach represents the next step in the evolution of high-performance general-purpose processors

The essence of the superscalar approach is the ability to execute instructions independently and concurrently in different pipelines

Page 50: Advanced computer architecture lesson 5 and 6

Superscalar versus Superpipelined

An alternative approach to achieving greater performance is referred to as superpipelining. Superpipelining exploits the fact that many pipeline stages perform tasks that require less than half a clock cycle.Thus, a doubled internal clock speed allows the performance of two tasks in one external clock cycle

Page 51: Advanced computer architecture lesson 5 and 6

The pipeline has four stages: instruction fetch, operation decode, operation execution, and result write back. The execution stage is crosshatched for clarity. Note that although several instructions are executing concurrently, only one instruction is in its execution stage at any one time.

Both the superpipeline and the superscalar implementations have the same number of instructions executing at the same time in the steady state. The superpipelined processor falls behind the superscalar processor at the start of the program and at each branch target

Page 52: Advanced computer architecture lesson 5 and 6

Limitations

The superscalar approach depends on the ability to execute multiple instructions in parallel.

The term instruction-level parallelism refers to the degree to which, on average, the instructions of a program can be executed in parallel.A combination of compiler-based optimization and hardware techniques can be used to maximize instruction-level parallelism.

Before examining the design techniques used in superscalar machines to increase instruction-level parallelism, we need to look at the fundamental limitations to parallelism with which the system must cope. [JOHN91] lists five limitations:

• True data dependency• Procedural dependency• Resource conflicts• Output dependency• Antidependency

Page 53: Advanced computer architecture lesson 5 and 6

A typical RISC processor takes two or more cycles to perform a load from memory when the load is a cache hit. It can take tens or even hundreds of cycles for a cache miss on all cache levels, because of the delay of an off-chip memory access.

One way to compensate for this delay is for the compiler to reorder instructions so that one or more subsequent instructions that do not depend on the memory load can begin flowing through the pipeline.This scheme is less effective in the case of a superscalar pipeline: The independent instructions executed during the load are likely to be executed on the first cycle of the load, leaving the processor with nothing to do until the load completes.

Page 54: Advanced computer architecture lesson 5 and 6

DESIGN ISSUES

Instruction-level parallelism exists when instructions in a sequence are independent and thus can be executed in parallel by overlapping

As an example of the concept of instruction-level parallelism, consider the following two code fragments [JOUP89b]:

Load R1 ← R2 Add R3 ← R3, “1” Add R3 ← R3, “1” Add R4 ← R3, R2 Add R4 ← R4, R2 Store [R4] ← R0

Page 55: Advanced computer architecture lesson 5 and 6

The three instructions on the left are independent, and in theory all three could be executed in parallel. In contrast, the three instructions on the right cannot be executed in parallel because the second instruction uses the result of the first, and the third instruction uses the result of the second.

The degree of instruction-level parallelism is determined by the frequency of true data dependencies and procedural dependencies in the code. These factors, in turn, are dependent on the instruction set architecture and on the application.

Instruction-level parallelism is also determined by what [JOUP89a] refers to as operation latency: the time until the result of an instruction is available for use as an operand in a subsequent instruction. The latency determines how much of a delay a data or procedural dependency will cause

Page 56: Advanced computer architecture lesson 5 and 6

Machine parallelism is a measure of the ability of the processor to take advantage of instruction-level parallelism. Machine parallelism is determined by the number of instructions that can be fetched and executed at the same time (the number of parallel pipelines) and by the speed and sophistication of the mechanisms that the processor uses to find independent instructions.

Both instruction-level and machine parallelism are important factors in enhancing performance. A program may not have enough instruction-level parallelism to take full advantage of machine parallelism. The use of a fixed-length instruction set architecture, as in a RISC, enhances instruction-level parallelism. On the other hand, limited machine parallelism will limit performance no matter what the nature of the program

Page 57: Advanced computer architecture lesson 5 and 6

Instruction Issue Policy The processor must also be able to identify instruction level

parallelism and orchestrate the fetching, decoding, and execution of instructions in parallel

Instruction issue to refer to the process of initiating instruction execution in the processor’s functional units and the term

instruction issue policy to refer to the protocol used to issue instructions.

In general, we can say that instruction issue occurs when instruction moves from the decode stage of the pipeline to the first execute stage of the pipeline. In essence, the processor is trying to look ahead of the current point of execution to locate instructions that can be brought into the pipeline and executed. Three types of orderings are important in this regard:

• The order in which instructions are fetched• The order in which instructions are executed• The order in which instructions update the contents of register and

memory locations

Page 58: Advanced computer architecture lesson 5 and 6

In general terms, we can group superscalar instruction issue policies into the following categories:

• In-order issue with in-order completion

• In-order issue with out-of-order completion

• Out-of-order issue with out-of-order completion

Page 59: Advanced computer architecture lesson 5 and 6

IN-ORDER ISSUE WITH IN-ORDER COMPLETION The simplest instruction issue policy is to issue instructions in the exact order that would be achieved by sequential execution (in-order issue) and to write results in that same order (in-order completion). Not even scalar pipelines follow such a simple-minded policy. However, it is useful to consider this policy as a baseline for comparing more sophisticated approaches.

Page 60: Advanced computer architecture lesson 5 and 6

IN-ORDER ISSUE WITH OUT-OF-ORDER COMPLETION Out-of-order completion is used in scalar RISC processors to improve the performance of instructions that require multiple cycles.

With out-of-order completion, any number of instructions may be in the execution stage at any one time, up to the maximum degree of machine parallelism across all functional units. Instruction issuing is stalled by a resource conflict, a data dependency, or a procedural dependency.

Page 61: Advanced computer architecture lesson 5 and 6

OUT-OF-ORDER ISSUE WITH OUT-OF-ORDER COMPLETION With in-order issue, the processor will only decode instructions up to the point of a dependency or conflict. No additional instructions are decoded until the conflict is resolved. As a result, the processor cannot look ahead of the point of conflict to subsequent instructions that may be independent of those already in the pipeline and that may be usefully introduced into the pipeline.

To allow out-of-order issue, it is necessary to decouple the decode and execute stages of the pipeline. This is done with a buffer referred to as an instruction window. With this organization, after a processor has finished decoding an instruction, it is placed in the instruction window.As long as this buffer is not full, the processor can continue to fetch and decode new instructions.When a functional unit becomes available in the execute stage, an instruction from the instruction window may be issued to the execute stage.Any instruction may be issued, provided that

(1) it needs the particular functional unit that is available, and (2) no conflicts or dependencies block this instruction

Page 62: Advanced computer architecture lesson 5 and 6

The result of this organization is that the processor has a lookahead capability, allowing it to identify independent instructions that can be brought into the execute stage. Instructions are issued from the instruction window with little regard for their original program order. As before, the only constraint is that the program execution behaves correctly

Page 63: Advanced computer architecture lesson 5 and 6

One common technique that is used to support out-of-order completion is the reorder buffer.The reorder buffer is temporary storage for results completed out of order that are then committed to the register file in program order. A related concept is Tomasulo’s algorithm.

The term antidependency is used because the constraint is similar to that of a true data dependency, but reversed: Instead of the first instruction producing a value that the second instruction uses, the second instruction destroys a value that the first instruction uses.

Page 64: Advanced computer architecture lesson 5 and 6

Register Renaming

One method for coping with these types of storage conflicts is based on a traditional resource-conflict solution: duplication of resources. In this context, the technique is referred to as register renaming. In essence, registers are allocated dynamically by the processor hardware, and they are associated with the values needed by instructions at various points in time.

Page 65: Advanced computer architecture lesson 5 and 6

When a new register value is created (i.e., when an instruction executes that has a register as a destination operand), a new register is allocated for that value. Subsequent instructions that access that value as a source operand in that register must go through a renaming process: the register references in those instructions must be revised to refer to the register containing the needed value. Thus, the same original register reference in several different instructions may refer to different actual registers, if different values are intended

Page 66: Advanced computer architecture lesson 5 and 6

An alternative to register renaming is a scoreboarding. In essence, scoreboarding is a bookkeeping technique that allows instructions to execute whenever they are not dependent on previous instructions and no structural hazards are present.

Page 67: Advanced computer architecture lesson 5 and 6

Branch Prediction Any high-performance pipelined machine

must address the issue of dealing with branches. For example, the Intel 80486 addressed the problem by fetching both the next sequential instruction after a branch and speculatively fetching the branch target instruction. However, because there are two pipeline stages between prefetch and execution, this strategy incurs a two-cycle delay when the branch gets taken

Page 68: Advanced computer architecture lesson 5 and 6

With the advent of RISC machines, the delayed branch strategy was explored. This allows the processor to calculate the result of conditional branch instructions before any unusable instructions have been prefetched where the processor always executes the single instruction that immediately follows the branch. This keeps the pipeline full while the processor fetches a new instruction stream.

With the development of superscalar machines, the delayed branch strategy has less appeal. The reason is that multiple instructions need to execute in the delay slot, raising several problems relating to instruction dependencies. Thus, superscalar machines have returned to pre-RISC techniques of branch prediction. Some, like the PowerPC 601, use a simple static branch prediction technique. More sophisticated processors, such as the PowerPC 620 and the Pentium 4, use dynamic branch prediction based on branch history analysis.

Page 69: Advanced computer architecture lesson 5 and 6

Superscalar Implementation

Based on our discussion so far, we can make some general comments about the processor hardware required for the superscalar approach. [SMIT95] lists the following key elements:

• Instruction fetch strategies that simultaneously fetch multiple instructions, often by predicting the outcomes of, and fetching beyond, conditional branch instructions. These functions require the use of multiple pipeline fetch and decode stages, and branch prediction logic.

• Logic for determining true dependencies involving register values, and mechanisms for communicating these values to where they are needed during execution

Page 70: Advanced computer architecture lesson 5 and 6

Mechanisms for initiating, or issuing, multiple instructions in parallel.

• Resources for parallel execution of multiple instructions, including multiple pipelined functional units and memory hierarchies capable of simultaneously servicing multiple memory references.

• Mechanisms for committing the process state in correct order.

Page 71: Advanced computer architecture lesson 5 and 6

PENTIUM 4 Although the concept of superscalar design is generally associated with the

RISC architecture, the same superscalar principles can be applied to a CISC machine. Perhaps the most notable example of this is the Pentium. The evolution of superscalar concepts in the Intel line is interesting to note. The 386 is a traditional CISC nonpipelined machine.

The 486 introduced the first pipelined x86 processor, reducing the average latency of integer operations from between two and four cycles to one cycle, but still limited to executing a single instruction each cycle, with no superscalar elements. The original Pentium had a modest superscalar component, consisting of the use of two separate integer execution units. The Pentium Pro introduced a full-blown superscalar design. Subsequent Pentium models have refined and enhanced the superscalar design.

A general block diagram of the Pentium 4 was shown in figure below depicts the same structure in a way more suitable for the pipeline discussion in this section. The operation of the Pentium 4 can be summarized as follows:

1. The processor fetches instructions from memory in the order of the static program.

2. Each instruction is translated into one or more fixed-length RISC instructions, known as micro-operations, or micro-ops.

Page 72: Advanced computer architecture lesson 5 and 6

3. The processor executes the micro-ops on a superscalar pipeline organization, so that the micro-ops may be executed out of order.

4. The processor commits the results of each micro-op execution to the processor’s register set in the order of the original program flow

Page 73: Advanced computer architecture lesson 5 and 6

Pentium 4 Block Diagram

Pg 538

Page 74: Advanced computer architecture lesson 5 and 6

Pentium 4 Pipeline

Page 75: Advanced computer architecture lesson 5 and 6

Front End

GENERATION OF MICRO-OPS The Pentium 4 organization includes an in-order front end that can be considered outside the scope of the pipeline depicted in figure above. This front end feeds into an L1 instruction cache, called the trace cache, which is where the pipeline proper begins. Usually, the processor operates from the trace cache; when a trace cache miss occurs, the in-order front end feeds new instructions into the trace cache.