deterministic features in the evolution of … · 2017-01-03 · figure 1: the increase of the...

1

DETERMINISTIC FEATURES IN THE EVOLUTION OFMICROPROCESSORS-----------------------------------------------------------------------------------------'(=6 �6,0$��0(0%(5��,(((

The fierce demand for higher performance has provoked a dramatic evolution in thefield of microprocessors. In this paper we show that this immense performance increasecould only be achieved by the subsequent introduction of temporal, issue and intra-instruction parallelism, in such a way that exploiting the full potential along onedimension gives rise to the additional introduction of parallelism along a furtherdimension. Moreover, the debut of each basic technique used to implement parallelismalong a given dimension inevitably calls for the introduction of further innovativetechniques in order to fully capitalize on the potential of the basic technique. In this wayan underlying deterministic framework can be identified for the fascinating evolution ofmicroprocessors, which is presented in our paper.

Keywords- Processor performance, microarchitecture, ILP, temporal-parallelism,issue-parallelism, intra-instruction parallelism,

I. INTRODUCTION

Since the birth of microprocessors in 1971 the IC-industry has succeeded inmaintaining an incredibly rapid increase in performance. This is demonstrated in Figure1 by reviewing how integer performance of the Intel family of microprocessors has beenraised over the last 20 years [1], [2]. Given in terms of SPECint92, the performance hasincreased by an astonishingly large rate of about two orders of magnitude per decade.

2

Relative

performanceinteger

(SPECint92)

5

10

50

Year86 8879 1980 81 82 83 84 85 87 89 1990 91 92 93 94 95 96 97 98 99

2

386/16

8088/5

0.5

100

8088/8

80286/10

80286/12

386/20386/25

386/33

500

1000

Date of first volume shipments(P denots Pentium)

20

200

1

0.2

486/25486/33

486/50 486-DX2/66

Pentium/66

Pentium/100 Pentium/120

Pentium Pro/200

PIII/500

PIII/600

486-DX4/100

Pentium/133 Pentium/166Pentium/200

PII/300PII/333

PII/400PII/450 PIII/550

486-DX2/50

~ 100*/10years

Figure 1: The increase of the relative integer performance of the Intel x86 processors

This impressive development and all the innovative techniques which were necessaryto achieve it have inspired a number of overview papers [3] - [7]. These reviewsemphasized either the techniques introduced or the quantitative aspects of the evolution.In contrast, our paper focuses on the incentives and implications of the major steps ofprocessor evolution, i.e. on the deterministic features of this evolution.

Our discussion begins in Section II. with a reinterpretation of the notion of absoluteprocessor performance. Our definition is aimed at giving the number of operationsrather than the number of instructions executed by the processor per second. Based onthis and on an assumed model of processor operation, we then identify the maindimensions of processor performance. In subsequent sections III. – VI. we discussfeasible approaches to achieving performance increase along each main dimension.From these, we identify the basic techniques which became part of the mainstreamevolution of processors. We also point out the implications of their introduction byhighlighting resulting bottlenecks and the techniques brought into use to cope withthem. Section VII. reviews and discusses the main steps of processor evolution, whereasSection VIII. summarizes the deterministic features of it.

3

II. THE DESIGN SPACE OF INCREASING PROCESSORPERFORMANCE

Most recently used industry standard benchmarks, including the SPEC benchmarksuite [8] -[10], Ziff-Davis’s Winstone [11] and CPUmark ratings [12] or BABCo’sSYSmark scores [13], are relative performance measures. This means that they give anindication of how fast a processor will run a set of applications under given conditionsin comparison to a reference installation. These benchmarks are commonly used forperformance comparisons in processor presentations and in articles discussing thequantitative aspects of the evolution. However, for one discussion of subsequent stepsof processor evolution as well as of their incentives and implications, the notion ofabsolute processor performance is a more appropriate starting point. This notionreflects the operation speed of the processor-memory complex, and paves the way for ananalysis of diverse ways to increase processor performance.

The notion of absolute processor performance (PP) is usually interpreted as theaverage number of instructions executed by the processor per second, and is nowadaystypically given in units such as MIPS (million instructions per second) or GIPS (Gigainstructions per second). We note that earlier synthetic benchmarks, like Whetstone [14]or Drystone [15], were also given as absolute measures.

PP is clearly a product of the clock frequency (fC), and the average number ofinstructions executed per clock cycle, called the throughput (TIPC), as indicated in Figure2.

Program

add r1,r2,r3

mul r4,r5,r6

[MIPS,etc]

P = f * TIPCC

: Throughput, interpreted as the averageIPCT

(Processor)P

P

P

number of

IPCT

(1)

instructions executed per cycle

Figure 2: Usual, instruction-based interpretation of the notion of absolute processor performance

Nevertheless, by taking into account that recent processor developments increasinglyinvolve multi-operation instructions, it is appropriate to reinterpret the notion ofabsolute processor performance by focusing on the number of operations rather than onthe number of instructions executed per second. In this way, the notion of absolute

4

processor performance more aptly reflects the work actually done. Here, again theabsolute processor performance (denoted in this case by PPO) can be given as theproduct of the clock frequency (fC) and the throughput (TOPC), which is now interpretedas the average number of operations executed per cycle (see Figure 3).

[MOPS,etc]

P = f * TOPCC

: Throughput, interpreted as the averageOPCT

PO

P

number of

OPCT (2)

operations executed per cycle

Figure 3: Operations-based interpretation of the notion of absolute processor performance

Next, we want to express TOPC by operational parameters of the processor. For thisreason we assume the following model of processor operation (see Figure 4).

Instructionsissued

sm

n jIPL

= 2

n_

OPI = 1.5

s j s1s2

n jCPI = 3

s j

njILP

njOPI

:

:

:

njCPI

:

jth issue interval

number of instructions issued at the beginning of sj

average number of operations included in theinstructions issued in sj

length of (in cycles)sj

o1o2

o1o2

o1

Issueintervals

_

j

Figure 4: Assumed model of processor operation

(a) We take for granted that the processor operates in cycles, issuing in each cycle 0,1...ni instructions, where ni is the issue rate of the processor.

(b) We allow instructions to include more than one operation.(c) Out of the cycles needed to execute a given program we focus on those in which

the processor issues at least one instruction. We call these cycles issue cycles anddenote them by cj, j = 1...m. The issue cycles cj subdivide the execution time ofthe program into issue intervals sj, j = 1...m such that each issue interval begins

5

with an issue cycle and lasts until the next issue cycle begins. s1 is the first issueinterval, whereas sm is the last one belonging to the given program.

(d) We describe the operation of the processor by a set of three parameters which aregiven for each of the issue intervals sj., j = 1...m. The set of the chosenparameters is as follows (see Figure 4):

njIPL = the number of instructions issued at the beginning of the issue interval sj,

j = 1...m,

n OPI = the average number of operations included in the instructions, which areissued in the issue interval sj, j = 1...m,

njCPI = the length of the issue interval sj in cycles, j = 1...m. Here nm

CPI, is the lengthof the last issue interval, which is interpreted as the number of cycles tobe passed until the processor is ready to issue instructions again.

Then in the issue interval sj the processor issues njOPC operations per cycle, where:

nn n

OPCILP OPI

n CPI

*j

j

jj

(3)

Now let us consider njOPC to be a stochastic variable, which is derived from the

stochastic variables njILP, n j

OPI and njCPI, as indicated in (3). Assuming that the

stochastic variables involved are independent, the throughput of the processor (TOPC)

that is the average value of nOPC ( n OPC), can be calculated from the averages of the threestochastic variables included, as indicated below:

n =OPC *

nn1/nCPI ILP OPI* (4)T =OPC

Temporalparallelism

Issueparallelism

Intra-instructionparallelism

The components of TOPC can be interpreted as follows:

n CPI is the average length of the issue intervals. Assuming a given ISA (instructionset architecture), this reflects the temporal parallelism of the processor. For a

traditional microprogrammed processor n CPI !! 1, whereas for a pipelined

processor n CPI ~ 1.

n ILP is the average number of instructions issued per issue interval. This term

indicates the issue parallelism of the processor. For a scalar processor n ILP =

1, whereas for a superscalar one n ILP ! 1. Finally,

n OPI shows the average number of operations per instruction, which reveals the

intra-instruction parallelism. In the case of a traditional ISA n OPI = 1, whereas

e.g. for a VLIW processor n OPI !! 1.

Based on this model, processor performance PPO, can be reinterpreted as:

6

P=

f 1/n*

nCPO CPI ILP OPI* n* (5)

ClockIUHTXHQF\

7HPSRUDOSDUDOOHOLVP

,VVXHSDUDOOHOLVP

,QWUD�LQVWUXFWLRQSDUDOOHOLVP

Here the clock frequency of the processor (fc) depends first of all on thesophistication of the IC technology, whereas the remaining three components, thetemporal, issue and the intra-instruction parallelism, are determined mainly by theefficiency of the processor level architecture, that is by both the ISA and by themicroarchitecture of the processor (see Figure 5).

PPO nCPI-____1

* ILPn- OPIn-*

Sophistication of the

= f *

Efficiency of the processor-level

c

architecture(ISA/microarchitecture)

technology

Figure 5: Constituents of processor performance

This recognition provides an appealing framework for discussing the mainpossibilities to increase processor performance. Accordingly, the main possibilities forincreasing processor performance are as follows: to increase clock frequency, or tointroduce and increase temporal, issue and intra-instruction parallelism, as summarizedin Figure 6.

P = * *1

n CPI

_ *_n OPI

_n ILPPO fc

Raisingthe clockfrequency

increasingof temporalparallelism

increasing increasing

parallelism parallelismof issue of intra-instruction

Introduction/ Introduction/ Introduction/

Figure 6: Main possibilities to increase processor performance

In the following sections we address each of these possibilities individually.

7

III. INCREASING THE CLOCK FREQUENCY

A. Increases in the Clock Frequency of the Processors

Figure 7 illustrates the phenomenal increase in the clock frequency of the Intel x86line of processors [1] over the past two decades.

5

10

50

Year

2

8088

100

386

Pentium

Date of first volume shipments

Clock

MHzfrequency

500

1000

20

200

486-DX2

79 1980 81 82 83 84 85 86 87 88 89 1990 91 92 93 94 95 96 97 98 9978

3µ

1.5µ

486 0.8µ

0.35µ

0.25µ

Pentium II

Pentium III

286

Pentium Pro

1

1µ

0.6µ486-DX4

~10*/10years

~100*/10years

Figure 7: Historical increase in the clock frequency of the Intel x86 line of processors

As Figure 7 indicates, the clock frequency was raised until the middle of the 1990sby approximately an order of magnitude per decade, and subsequently by about twoorders of magnitude per decade. This massive frequency boost was achieved mainly by acontinuous scaling down of the chips through improved IC process technology,furthermore, by using longer pipelines in the processors and by improving the circuitlayouts.

By taking into account the fact that the increase of processor performance may beachieved either by raising the clock frequency or by increasing the efficiency of themicroarchitecture or by both (see Figure 5), Intel’s example of how it increased theefficiency of the microarchitecture in its processors is very telling.

8

Efficiencyof the microarchitecture(SPECint92/100 MHz)

0.5

1

2

1985 86 87 88 89 90 91 92 93 94 95 96 97 98 99

Year of first volume shipment

x

x

x

x x x

i386

i486Pentium

Pentium Pro

Pentium II

Pentium III

~10*/10 years

Year

1.5

Figure 8: Increase in the efficiency of the microarchitecture of Intel’s x86 processors

As Figure 8 shows, the overall efficiency (cycle by cycle performance) of the Intelprocessors [1] was raised in the decade between 1985 and 1995 by about an order ofmagnitude. In this decade both the clock frequency and the efficiency of themicroarchitecture were raised by about an order of magnitude, which resulted in anapproximately two order of magnitude performance increase. But after the introductionof the PentiumPro [16], Intel continued to use basically the same processor core in bothits Pentium II [17] and Pentium III [18] processors. The enhancements introduced,including multimedia (MM) and 3D support, higher cache capacity, increased busfrequency etc, made only a marginal contribution to the efficiency of themicroarchitecture for general purpose applications, as is reflected in the SPECbenchmark figures. Thus, Intel was forced to boost clock frequency at about two ordersof magnitude per decade in order to achieve the same rate of performance increase asbefore.

B. Implications of Increasing the Clock Frequency

The increase in clock frequency of the processor brings along with it the need to raisethe bandwidth of the processor bus in order to avoid a bottleneck in the system levelarchitecture. This fact calls for a suitable and relevant development of processor busstandards.

1988 199089 91 92 93 94 1995

ISA EISA PCI PCI v. 2 PCI v. 2.1

8.33 MHz 8.33 MHz 33 MHz 33 MHz 66 MHz

8-bit 32-bit 32-bit 64-bit 64-bit

96 97 98 99

PCI - X(proposal)

133 MHz64-bit

Figure 9: Evolution of processor bus standards

Accordingly, Figure 9 indicates how the major processor bus standards evolved interms of their data width and maximum frequency. As depicted in the figure, the

9

standardized 8-bit wide AT-bus, know as the ISA bus (International StandardArchitecture) [19], was first extended to provide 32-bit data width (called the EISA bus[20]). This was subsequently replaced by the PCI bus and its wider and faster versions(PCI versions 2, 2.1 [21] and the PCI-X proposal [22]). Figure 9 demonstrates that themaximum processor bus frequency was raised at roughly the same rate as the clockfrequency of the processors.

IV. INTRODUCTION AND INCREASING OF TEMPORALPARALLELISM

A. Overview of Possible Approaches to Introduce Temporal Parallelism

For a given ISA, n CPI, the average length of the issue intervals, reflects the temporalparallelism of the instruction execution. A traditional von Neumann processor executes

instructions strictly sequentially. In this case, n CPI indicates the average execution time

of the instructions in cycles. Usually, n CPI >>1. Obviously, RISC ISAs induce smaller

n CPI values than CISC ISAs.

For a given ISA, n CPI can be reduced by overlapping the processing of subsequentinstructions, that is by making use of temporal parallelism during instruction processing.Basically, there are three main possibilities for this; (a) overlap the fetch and executephases, (b) overlap the execute phases, and (c) overlap all phases of instructionprocessing, as indicated in Figure 10.

Overlapping

Mainframes

Microprocessors

Introduction of temporal parallelism(Reduction of )

Sequentialprocessing

Overlapping the fetch and further phases

Overlapping theexecute phases all phases

Throughpipelining

Throughmultiple EUs

ii

+1ii

+3ii

+2ii

E E E1 2 3

ii +1iiF D E W F D E Wii

+1ii F D E W

+2ii

ii

+1ii

+3ii

+2ii

F E WD

Earlymainframes

i80286 (1982)M68020 (1985)

Stretch (1961)

Prefetching

IBM 360/91 (1967)CDC 7600 (1969)

CDC 6600 (1964)

Pipelinedprocessors

Atlas (1963)IBM 360/91 (1967)

i80386 (1985)M68030 (1988)

R2000 (1988)

PipelinedEUs

ii

+1ii

+2ii

E E E1 2 3

nCPI

29

30

23 24 25

26

27

28

31

32

33

Figure 10: Main approaches to temporal parallelism(F: fetch cycle, D: decode cycle, E: execute cycle, W: write cycle)

The superscripts after the machine or processor designations are references to the related machines orprocessors.

In this and in the subsequent figures the dates indicate the year of first shipment (in the case ofmainframes) or that of first volume shipment (in the case of microprocessors).

10

(a) Overlapping the fetch and execute phases of subsequent instructions is calledprefetching [23]. Assuming that the processor operates as indicated in Figure 10, thistechnique reduces the average execution time by one cycle compared to fully sequentialprocessing. However, control transfer instructions (CTIs), which divert instructionexecution from the sequential path, make prefetched instructions obsolete. This lessensthe performance gain of prefetching to less than one cycle per instruction.

(b) The next possibility is to overlap the execution phases of subsequent instructionseither by providing multiple execution units (EU) [24] or by pipelining the instructionexecution phase [25], [26]. If the processor has multiple EUs, subsequent instructionscan be executed in different EUs. In contrast, pipelined EUs allow a processors toaccept a new instruction for execution in every new clock cycle, provided that nodependencies between subsequent instructions exist. In this way, elements of vectorscan be processed in a more effective way than in sequential processing. Thus,overlapping the execution phases of subsequent instructions results in a substantialperformance gain.

(c) Finally, the ultimate solution to exploit temporal parallelism is to extendpipelining to all phases of instruction processing [27], [28]. Fully pipelined instruction

processing results in a one cycle mean time between subsequent instructions ( n CPI = 1)provided that the instructions processed are free of dependencies. Dependencies give

rise to delays and increase n CPI. The related processors are known as pipelinedprocessors. They include, in principle, multiple pipelined EUs. But, for efficiencyreasons, the execution phase of some instructions (such as division, square root

calculation) is not pipelined. This fact causes a slight increase of n CPI. With pipelinedprocessors instruction level parallel processors, or ILP processors for short, arrived. Infact, both prefetching and overlapping of the execution phases of subsequentinstructions provide partial parallel execution. Nevertheless, processors providing thesetechniques alone are usually not considered to be ILP processors.

Of the possibilities mentioned above that were introduced into mainframes in the1960s (see Figure 10), first 16-bit microprocessors made use of prefetching. Subsequentprocessors introduced pipelined instruction processing because of its highestperformance/cost potential among the alternatives discussed. Thus, pipelined processorsbecame the first major step in the evolution of the microarchitecture of prevailingmicroprocessors. As Figure 11 shows, pipelined microprocessors emerged and cameinto widespread use in the second half of the 1980s.

Here we note that the very first step of the evolution of microprocessors was markedby increasing the word length from 4 bits to 16 bits, as exemplified by the Intelprocessors 4004, [34], 8008, 8080 and 8086 [35]. This evolution gave rise to theintroduction of a new ISA for each wider word length until 16-bit ISAs arrived. For thisreason we discuss the evolution of processor performance beginning with the 16-bitISAs.

11

x86

M68000

MIPS R

1980 81 82 83 84 85 86 87 88 89 1990 91 92

80386 80486

68030 68040

R3000 R6000 R4000

pipelined processors

R2000

68020

80286

Figure 11: The introduction of pipelined processors

C. Implications of the Introduction of Pipelined Instruction Processing1) Overview: The introduction of pipelined instruction processing however, leads to

two problems which undermines performance gain. First, pipelining gives rise to amemory bottleneck, and second, pipelined processing of CTIs (control transferinstructions) substantially impede the achievable performance gain, as detailed below.Thus, in order to fully exploit the potential of pipelined instruction processing twoappropriate techniques also need to be introduced; caches and speculative branchprocessing.

2) The memory bottleneck and the introduction of caches: The first major implicationof pipelined instruction processing is that it has a higher memory bandwidthrequirement than sequential processing, a demand which can not be met by the slowmemory subsystem. Consider that a pipelined processor intends to fetch a newinstruction in every new clock cycle. This clearly calls for a higher memory bandwidthwhile the processor fetches instructions. Furthermore, overlapped instruction processingresults in more frequent load and store operations and reading and writing of memoryoperands in the case of memory architectures. Consequently, pipelined instructionprocessing requires a higher memory bandwidth for both instructions and data. As thememory is typically inherently slower than the processor, the increased memorybandwidth requirement calls for a memory enhancement. Thus, pipelined instructionprocessing accelerated and made inevitably the introduction of caches. With caches,frequently used program segments (cycles) can be held in a fast memory, which allowsinstruction and data requests to be served at a higher rate. For this reason, cachespioneered in the IBM 360/85 [36] in 1968 came into widespread use in microprocessorsin the second half of the 1980s, in essence, along with the introduction of pipelinedinstruction processing (see Figure 12).

12

x86

M68000

MIPS R

1980 81 82 83 84 85 86 87 88 89 1990 91 92

80386 80486

68030 68040

R3000 R6000 R4000

C(8),Spe

C(1/4,1/4) C(4,4),Spe

C(4,4) C(4,4) C(16) C(8,8),Spe

pipelined (scalar ILP)

C(n) cache (universal cache, size in kB)

C(n/m) cache (instruction/data cache, size in kB)

Spe Speculative execution of branches

C(0,1/4)

R2000

68020

80286

Figure 12: The introduction of caches and speculative branch processing

3) The performance degradation caused by CTIs and the introduction of speculativebranch processing: The trouble with pipelined processing of CTIs is that until a CTI isrecognized in the decode stage, the processor has already fetched the next sequentialinstruction instead of the branch target instruction, needed in most cases. As aconsequence, pipelined processing of each unconditional CTI gives rise to at least onewasted cycle, known as a bubble. This clearly impedes performance.

Pipelined execution of conditional CTIs reduces processor performance even muchmore. Consider that in the case of conditional CTIs the processor needs to know thespecified condition prior to deciding whether to fetch the next sequential instruction orthe branch target instruction. Thus, a long latency instruction, such as a division, cancause dozens of wasted cycles if the specified condition of a subsequent CTI relates toits result. A vital reduction in the number of wasted cycles can be achieved byintroducing speculative branch processing (speculative execution of CTIs) [37] - [40].With speculative branch processing the processor simply makes a guess whether thefetched CTI will cause branching or not and continues to fetch and to processinstructions along the presumed path. Later, when the specified condition becomesknown, the processor checks whether it guessed right. In the case of a correct guess itacknowledges the instructions processed. Else it cancels the incorrectly executedinstructions and resumes the execution along the correct path. As Figure 12demonstrates, microprocessors introduced speculative execution of CTIs basically alongwith pipelining.

Thus, in order to exploit the intrinsic potential of pipelined instruction processing itis mandatory to simultaneously introduce both caches and speculative branchprocessing.

(4) Limits of utilizing temporal parallelism: With the massive introduction oftemporal parallelism into pipelined instruction processing, the average length of the

issue intervals was decreased to almost one clock cycle. But n CPI = 1 marks the limitachievable through temporal parallelism. It follows that processor performance couldonly be further increased if additional parallelism was introduced along a seconddimension as well. There are two possibilities for this, either to introduce issue

13

parallelism or intra-instruction parallelism. Following the evolutionary path, next wediscuss the former.

V. INTRODUCTION AND INCREASE OF ISSUE PARALLELISM

A. Introduction of Superscalar Instruction IssueSuperscalar instruction issue [5], [41] represents a second dimension of parallelism

in the operation of processors. By enabling the processor to issue more than oneinstruction per clock cycle, the average number of issued instructions per issue interval

( n ILP) is increased from 1 (for pipelined processors) to higher values. Clearly, thisresults in a proportional increase of the processor performance.

We note that with pipelined instruction processing the average length of the issue

intervals ( n IPC) is approximately one cycle. Then n ILP represents roughly the average

number of instructions issued per cycle ( n IPC).As Figure 13 indicates, superscalar processors appeared around 1990, soon after the

potentials of pipelined instruction processing became fully utilized. Superscalars rapidlybecame predominant in all major processor lines.

Intel 960 960KA/KB 960CA (3)

M 88000 MC 88100 MC 88110 (2)

HP PA PA 7000 PA7100 (2)

SPARC MicroSparc SuperSparc (3)

Mips R R 40001,2 R 8000 (4)

Am 29000 29000 sup (4)29040

IBM Power Power1(4)RS/6000

DEC α α21064(2)

PowerPC PPC 601 (3)PPC 603 (3)

87 88 89 90 91 92 93 94 95 96

CISC processors

RISC processors

Intel x86 i486 Pentium(2)

M 68000 M 68040 M 68060 (2)

Gmicro Gmicro/100p Gmicro500(2)

AMD K5 K5 (~2)3

CYRIX M1 M1 (2)

1 We do not take into account the low cost R 4200 (1992) since superscalar architectures are intended to extend the performance of the high-end models of a particular line.2 We omit processors offered by other manufactures than MIPS Inc., such as the R 4400 (1994) from IDT, Toshiba and NEC.3 The K5 issues 4 RISC-like microoperations per cycle, roughly equivalent to two CISC instructions per cycle.

denotes superscalar processors.The figures in brackets denote the issue rate of the processors.

Figure 13: The appearance and spread of superscalar processors

B. Implications of the Introduction of Superscalar Instruction Issue

1) Overview: With the introduction of superscalar instruction issue two newimplementation problems arose, issue and decode bottlenecks. These bottleneckssharply limit the achievable performance gain unless they are resolved by appropriatetechniques. In the subsequent sections we discuss these bottlenecks and the techniqueselicited to cope with them.

14

2) Issue Bottlenecks and the Techniques Used to Counteract Them: When superscalarinstruction issue is implemented in a straightforward way, control, data and resourcedependencies between subsequent instructions cause issue blockages. This leads to anissue bottleneck which restricts the average number of issued instructions per cycle(a n IPC) in general purpose applications to about two [42], [43]. In Figure 14 weillustrate how the issue bottleneck arises assuming a four-wide superscalar processor,which issues instructions in an unsophisticated way, called the straightforward directissue scheme.

In this case a four-wide superscalar processor has a four instructions wide instructionwindow that holds the last four entries of the instruction buffer. This is the windowfrom where the processor issues executable instructions to the execution units (EUs).

Icache

I-buffer

Issue window (4)

Decode,check,issue

Dependent instructions blockinstruction issue

EU EU EU

Issue

Figure 14: Principle of the direct issue scheme(In the figure we have assumed a four-instruction wide issue window)

In the direct issue scheme every clock cycle the processor checks the instructionsheld in the instruction window for control-, data- and resource dependencies, andforwards executable instructions directly to the EUs. A direct issue scheme is calledstraightforward, if the processor issues instructions in order and needs to issue allinstructions from the window before it refills it with the subsequent four instructions. InFigure 15 we demonstrate how this scheme works, assuming that in cycle ci instructionsi2 and i4 are dependent on instructions which are actually in execution. In cycle ci,instruction i1 will be issued, but neither i2 nor any one of the subsequent instructions isissued, since i2 blocks instruction issue. Let us assume that afterwards both i2 and i4

become executable in the next cycle (ci+1). Thus, instruction i2 and all subsequentinstructions will be issued from the window. In the next cycle (ci+2) the window is filledby the subsequent four instructions (i1-i4) and the issue process resumes in a similar wayas described before.

15

Executable instruction

Dependent instruction

Issue

Instr. window

Ci

Ci+1

Ci+2

i1i2i3i4

i1i2i3i4

i2i3i4

Figure 15: Instruction issue in a simple direct issue scheme

By taking into account that in general purpose programs approximately every fifthinstruction is a conditional branch and conditional branches are often unresolved whenthey come to execution, it becomes plausible that in the assumed issue scheme issueblockages restrict the average number of issued instructions to about two per cycle.

The issue bottleneck discussed above can be prevented by introducing a set of threeadvanced techniques: (a) speculative branches, (b) instruction shelving and (c) registerrenaming.

(a) Speculative branches have already been introduced in pipelined processors inorder to avoid wasted pipeline cycles caused by both unconditional and conditionalbranches. This technique contributes effectively to removing the issue bottleneck aswell. Consider here that during speculative branch processing the processor makes aguess about the outcome of each conditional branch and resumes fetching and issuinginstructions along the presumed path, as described before. In this way, unresolved butspeculated conditional branches do not cause issue blockages. In other words,speculated conditional branches result in a virtual enlargement of the basic blocks, asindicated in Figure 16. In this figure we assume that the processor is able to speculateonly until the next conditional branch. Along the speculated path only data and resourcedependencies can restrict instruction issue. Clearly, this reduces the frequency of issueblockages to a significant degree.

16

Instructions

Conditional branches

Virtual

basic

block

Basic

block

Basic

block

Basic

block

Basic

block

quessed path

Figure 16: Virtual enlargement of the basic blocks due to speculative branch processing

How far speculative branch processing enlarges basic blocks depends primarily onhow many consecutive unconditional branches the processor is able to speculate.Clearly, the more consecutive branches a guessed path may involve, the larger the basicblocks become virtually and the higher is the achievable throughput gain. But at thesame time a wrong guess results in more wasted cycles, which leads to a practical limitfor the deepness of consecutive guesses.

(b) The second technique used to remove the issue bottleneck is instruction shelving(dynamic instruction issue) [4], [5], [44], [45]. Shelving assumes that the processor hasdedicated buffers, called shelving buffers, in front of the EUs, as shown in Figure 17.With shelving the processor first issues the instructions into the respective shelvingbuffers without checking for data- or control dependencies or for busy execution units.Issued instructions remain in the shelving buffers until they can be dispatched forexecution. Thus, with shelving the processor is able to issue in each cycle as manyinstructions into the shelving buffers as its issue rate, provided that no hardwarerestrictions confine the issue, such as missing free buffer entries or datapath restrictionswhere multiple instructions need to be issued into the same buffer.

17

I cache

Issue

Dispatch

I-buffer

Decode/IssueInstructions are issued withoutchecking for dependences to theshelving buffers (reservation stations)

Shelved instructions

execution to the EUs.Dep. checking/

dispatchDep. checking/

dispatchDep. checking/

dispatch

EU EU EU

Issue window

are scheluded for

Shelvingbuffer

Shelvingbuffer

Shelvingbuffer

Figure 17: The principle of shelving assuming that the processor has individual shelving buffers(called reservation stations) in front of the execution units.

Shelving improves processor performance not only by removing the issue bottleneckbut also by significantly widening the instruction window. Without shelving theprocessor tries to find executable instructions in a small instruction window, whosewidth equals the issue rate (usually 2 - 4). When shelving is used, the processor scansthe instructions held in the shelving buffers for executable ones. Thus, with shelving theinstruction window is comprised of all occupied places in the shelving buffers. Clearly,the maximum width of the window equals the total capacity of all shelving buffersavailable, but its actual width changes dynamically. Since many more shelving buffersare typically available than the issue rate of the processor, shelving usually greatlywidens the width of the instruction window as well.

(c) The third major technique which is necessary to make superscalar instructionissue really effective is register renaming [4], [5], [39]. This technique aims atremoving false register data dependencies. These are write after read (WAR) and writeafter write (WAW) dependencies of register data between instructions in execution andthose to be issued or among the instructions to be issued at the same time. Theelimination of false data dependencies increases the rate of executable instructions in theinstruction window, leading to clear performance gains.

Both shelving and renaming were introduced to remove the issue bottleneck in thesecond wave of superscalars around the middle of the 1990s (see Figures 18 and 19).Here we note that in the Figures 18, 19 and 21 the numbers in brackets after theprocessor designations indicate the issue rate. Furthermore, for simplification we giveonly in Figure 18 references to the given processors. Both techniques mentionedemerged more or less in two phases. First, these techniques were introduced only for asingle or a few instruction types, e.g. for floating point instructions known as partialshelving and partial renaming. Subsequently, their use spread to all instruction types.

18

PentiumPro (3)16

Alpha 21264(4)48

PA8000 (4)52

PM1 (4) (Sparc64)

70

K5 (4) 81

1990 1991 1992 1993 1994 1995 1996 1997 1998 1999

PA8200(4)

MC 88000

Gmicro

M

SPARC

PowerPC

PA

R

Nx/K

80x86

Power

ES

MC 68000

Motorola

CYRIX

Sun/Hal

MIPS

AMD/NexGen

Intel

IBM

HP

TRON

Compaq/DEC

PowerPCAlliance

Alpha

RISC processors

IBM

Motorola

CISC processors

PPC 601 (3)57* PPC 604 (4)

* 59

PPC 620 (4)61*

PPC 603 (3)58*

R 10000 (4)66

PPC 602 (2)* 60

The Nx586 has scalar issue for CISC instructions but a 3-way superscalar core for converted RISC instructions. **- Partial shelving

- Full shelving

PPC designates PowerPC.*

*** The issue rate of the Power2 and P2SC is 6 along the sequential path while only 4 immediately after a branch.

52PA 8500 (4)

53

Power3 (4)63

R 12000 (4)67

Pentium II (3)17

Pentium III (3)18

MII (2)78

K6 (2)45 31

PA7200 (2)51

MI (2)77

80Nx586 (1/3)

**

Power1 (4)(RS/6000)

54 Power2 (6/4)***55

56P2SC (6/4)

***

Alpha 21064 (2)46

PA7100 (2)50

SuperSparc (3)68

MC88110 (2)49

ES/9000 (2)75

Gmicro/500 (2)76

Pentium (2)73

MC 68060 (3)79

Alpha 21064A (2)46

R 8000 (4)65

Alpha 21164 (4)47

UltraSparc-2 (4)71UltraSparc (4)

69

G3 (3)62

Pentium/MMX (2)74

K7 (3)84

K6-2 (2)82

K6-3 (2)83

UltraSparc-3 (4)72

G4 (3)64

Figure 18: Introduction of shelving

Gmicro/500 (2)

Alpha 21064 (2) Alpha 21164 (4)

SuperSparc (3)

PA7100 (2)

Pentium (2)

MC 68060 (3)

R 8000 (4)

Power1 (4)(RS/6000)

ES/9000 (2)

Power2 (6/4)***

Alpha 21264(4)

PA8000 (4)

PM1 (4) (Sparc64)

K5 (4)

1990 1991 1992 1993 1994 1995 1996 1997 1998 1999

Pentium III (3)

PA8200(4)

UltraSparc-2 (4)

K6 (2)

Power3 (4)

PA 8500 (4)

R 12000 (4)

K7 (3)

UltraSparc-3 (4)

MC88110 (2)

UltraSparc (4)

MC 88000

Gmicro

M

SPARC

PowerPC

PA

R

80x86

Power

ES

Motorola

CYRIX

Sun/Hal

MIPS

AMD/NexGen

Intel

IBM

HP

TRON

Compaq/DEC

PowerPCAlliance

Alpha

RISC processors

IBM

Motorola

CISC processors

PPC 601 (3)* PPC 604 (4)

*

Pentium/MMX (2)

Pentium II (3)

PPC 620 (4)*

The Nx586 has scalar issue for CISC instructions but a 3-way superscalar core for converted RISC instructions. **- Partial renaming

- Full renaming


PPC 603 (3)*

R 10000 (4)PPC 602 (2)

*

PA7200 (2)

P2SC (6/4)***


MII (2)

Alpha 21064A (2)

MI (2)

Nx586 (1/3)**Nx/K

MC 68000

PentiumPro (3)

G3 (3) G4 (3)

K6-3 (2)K6-2 (2)

Figure 19: Introduction of register renaming

3) The Decode Bottleneck and Predecoding: With the introduction of superscalarinstruction issue, the issue process becomes considerably more complex than it was inthe scalar case for many reasons. First of all, superscalar issue forces the processor todecode multiple instructions in a single clock cycle. Furthermore, shelving andrenaming, which are used to resolve the issue bottleneck make instruction issue evenmore intricate. In the case of shelving, for example, the processor also needs to decide towhich shelving buffer each of the decoded instructions should be issued and for each ofthe decoded instructions it needs to check whether there are any issue restrictions suchas missing free buffer entries etc. Additionally, the issue policy of the processor mayalso call for fetching the operands of the issued instructions during the issue cycle.Renaming also burdens the processor with additional tasks than need to be performed in

19

connection with instruction issue. Since these tasks would excessively lengthen the issuecycle, renaming is usually performed in a separate cycle dedicated solely to renaming.But even without renaming, higher issue rates (rates of 3 or higher), can unduly lengthenthe time needed to perform the complex issue task. This either decreases the clockfrequency or gives rise to multiple issue cycles, which increases the penalty ofmispredicted branches. Thus, in order to avoid the impediments of higher issue rates thecomplexity of the issue task should be reduced. An appropriate technique for this ispredecoding [39], [45].

The fundamental idea behind predecoding is to perform partial decoding alreadywhen the processor fetches instructions into the instruction buffer, as indicated in Figure20. During predecoding the processor may decode the instruction type, recognizebranches, determine the length of the instruction in the case of a CISC-processor, etc.

Typically 128 bits/cycle

E.g.148 bits/cycle usually

Second-level cache(or memory)

Predecodeunit

I-cache

When instructions are written into the

I-cache, the predecode unit appends

4-7 bits to each RISC instruction

Figure 20: The basic idea of predecoding

Predecoding emerged with the second wave of superscalars about the middle of the1990s and soon became a standard feature in RISC processors with issue rates of four,as shown in Figure 21. We note that recent CISC processors have typically a lower issuerate than RISC processors (in most cases of three) and include usually a RISC/CISCconversion [16] -[18], [39], [45], [81], [84], [87], which decouples decoding andinstruction issue. Thus, recent CISC processors demand much less the use ofpredecoding than RISCs.

20

Gmicro/500 (2)

Alpha 21064 (2) Alpha 21164 (4)

SuperSparc (3)

PA7100 (2)

Pentium (2)

MC 68060 (3)

R 8000 (4)

Power1 (4)(RS/6000)

Power2 (6/4)***PA8000 (4)

PM1 (4) (Sparc64)

K5 (4)

1990 1991 1992 1993 1994 1995 1996 1997 1998 1999

PA8200(4)

Power3 (4)

PA 8500 (4)

UltraSparc-3 (4)

MC88110 (2)

UltraSparc (4)

Motorola

CYRIX

Sun/Hal

MIPS

AMD/NexGen

Intel

IBM

HP

TRON

Compaq/DEC

PowerPCAlliance

IBM

Motorola

MC 88000

Gmicro

M

SPARC

PowerPC

PA

R

Nx/K

80x86

Power

ES

MC 68000

Alpha

RISC processors

CISC processors

PPC 601 (3)* PPC 604 (4)

*

Pentium/MMX (2)

Pentium II (3)

PPC 620 (4)*

The Nx586 has scalar issue for CISC instructions but a 3-way superscalar core for converted RISC instructions. **

- Partial renaming

- Full renaming


PPC 603 (3)*

R 10000 (4)PPC 602 (2)

*

PA7200 (2)

P2SC (6/4)***


Alpha 21064A (2)

Nx586 (1/3)**

UltraSparc-2 (4)

K6 (2)

ES/9000 (2)

PentinumPro (3)

MI (2) MII (2)

R 12000 (4)

PentiumIII (3)

Alpha 21264(4)

K7 (3)

G4 (3)G3 (3)

K6-3 (2)K6-2 (2)

Figure 21: The emergence of predecoding

All in all, in order to fully utilize the potential of issue parallelism both the issue andthe decode bottlenecks should be removed. This calls on the one side for the use ofspeculative branches, shelving and renaming, and on the other, for the introduction ofpredecoding at least in RISC processors.

In addition, we note that superscalar instruction issue also gives rise to highermemory traffic. Thus, superscalar processors require larger and more efficient cachesthan their scalar peers.

4) Requirements for Further Increases in Throughput: Clearly, the starting point forfurther increases in the throughput of a superscalar processor is to raise its issue rate.However, raising the issue rate will only result in an increase in throughput if allrelevant cross sections of the microarchitecture are augmented appropriately and noprocessing bottlenecks arise. It follows that for a further increase in throughput it isnecessary to raise the issue rate and at the same time to augment all further pertinentcross sections of the microarchitecture. This basically means:

- widening the instruction window by providing more shelving buffers,- increasing the dispatch rate by providing more and/or faster EUs,- increasing the retire rate,- improving the accuracy of the branch prediction,- enhancing the cache subsystem by using larger and faster caches and/or faster buses,- exceeding the dataflow limit which is caused by RAW dependencies between subsequent instructions by introducing speculative loads [61], [66], [70], [80], and/or store forwarding [40],[72], [78], [80], [83], [85], [86].Related research work focuses on the last mentioned topic. In order to exceed

the dataflow limit various novel techniques have been proposed, such as valueprediction [87] - [90], value reuse [91] -[93], dynamically speculated loads [96], [98],speculative store forwarding [40], load value prediction [40], [98] and load value reuse[85], [93], [98].

5) Limits of utilizing issue parallelism: Obviously, there is a practical limit beyondwhich the average issue rate cannot be increased efficiently. This limit is set by the

21

instruction level parallelism available in the executed programs. General purposeprograms include an average instruction level parallelism of about 4 - 8, [99] and recentprocessors approach an average issue parallelism of about four. Given this, it seems thatthere is not too much room for an efficient increase in the throughput of presentsuperscalars. Further performance increases can only be achieved by the introduction ofparallelism along the third dimension of intra-instruction parallelism.

VI. INTRODUCTION AND INCREASE OF INTRA-INSTRUCTIONPARALLELISM

A. Major Approaches to Introduce Intra-instruction ParallelismThe third dimension of parallelism which can be utilized to boost processor

performance is to include multiple operations into the instructions, that is to introduceintra-instruction parallelism.

(a) Traditional instructions include only a single data operation per instruction, i.e.nOPI = 1. As Figure 22 indicates, there are basically three different approaches for theintroduction of intra-instruction parallelism; (a) three-operand instructions, (b) SIMD-instructions and (c) VLIW-instructions.

Three-operandinstructions instructions

SIMD

-

instructionsVLIW

Possible approaches to introduceintra-instuction parallelism

ISA-extension

OPIn : Average number of operations per instruction

NarrowVLIWs

WideVLIWs

OPIn

2 2/4/8/16 (2/3; e.g.EPIC) (~n*10)

(i=a*b+c) (MM-support)

i: O2 O1 O1O4 O3 O2i: i:Om Om-1 O1O3 O2O1O3 O2i:

Dedicated use General use

OPIn-

>11+ε ~2 >>2

New ISA

Trend

i: O2 O1

2

(3D-support)

FX-SIMD FP-SIMD

OPIn : Number of operations per instruction

Figure 22: Possibilities to introduce intra-instruction parallelism

(a) Three-operand instructions include two operations in the same instruction. Themost widely used one is the multiply-add instruction (multiply-and-accumulate or fusedmultiply-add instruction), which calculates the dot product (x = a * b + c) for floating-point data. Clearly, the introduction of three-operand instructions calls for anappropriate extension of the ISA.

22

Multiply-add instructions were introduced in the early 1990s into the POWER [100],PowerPC [101], PA-RISC [102] and MIPS-IV [103] ISAs and into the respectivemodels. The multiply-add instruction is effective only for numeric computations. Thus,in general purpose applications they only marginally increase the average number of

operations per instruction ( n OPI).

(b) SIMD instructions allow the same operation to be performed on more than one setof operands. E.g. in Intel’s MMX multimedia extension [104], the

PADDW MM1, MM2

SIMD instruction carries out four fixed point additions on the four 16-bit operandpairs held in the 64-bit registers MM1 and MM2 (which are actually aliased to thefloating-point registers FP1, FP2) and places the results into register MM1.

As Figure 22 indicates, SIMD instructions may be defined either for fixed point dataor for floating point data. Fixed point SIMD instructions support multimediaapplications, i.e. multiple (2/4/8/16) operations on pixels, whereas floating point SIMDinstructions accelerate 3D graphics by executing usually two floating point operations inparallel. Clearly, the introduction of SIMD instructions into a traditional ISA requires anappropriate ISA extension.

Fixed point SIMD instructions were pioneered in 1993-1994 in the processorsMC88110 [49] and PA-7100LC [105], as shown in Figure 23. Driven by the spread ofmultimedia applications, SIMD extensions soon became a standard feature of mostestablished processor families (such as AltiVec from Motorola [106], MVI fromCompaq [107], MDMX from MIPS [108], MAX-2 from hp [109], VIS from Sun [110]or MMX from Intel [111]). Floating point SIMD instructions, such as those found in3DNow from AMD, CYRIX and IDT [112] and SSE from Intel [113], have emergedsince 1998, first in the K6-2 [82], K6-3 [83] and Pentium III [18], followed by the G4[64] and K7 [84] in order to support 3D applications.

Compaq/DEC

Motorola

Sun/Hal

MIPS

HP

Alpha 21064 Alpha 21164 21264

MC88110

R 12000R 10000

PA7100 PA8000 PA 8500 PA-7200 PA-7100LC PA-8200

21164PC

CYRIX

AMD/NexGen

Intel Pentium PentiumPro

K5 Nx586

Pentium III

K7

M1

K6

MII

Pentium II

Pentium/MMX

K6-2

K6-3

Multimedia support (FX-SIMD)

Support of 3D (FP-SIMD)

1992 1993 1994 1995 1996 1997 1998 1999

RISC processors

MC 88000

PA

Alpha

M

SPARC

PowerPC

R

Nx/K

80x86

CISC processors

Power PCAlliance

PPC 601 (3)

PPC 603 (3) PPC 602 (2)

PPC 604 (4)

R 80000

PPC 620 (4) G3 (3) Power3 (4)

SuperSparc UltraSparc UltraSparc-2 UltraSparc-3

G4 (3)

Figure 23: The emergence of FX-SIMD and PF-SIMD instructions in processors

23

Clearly, multimedia and 3D support results merely in respective applications in aconsiderably performance boost. For instance, Intel published a per cycle performancegain of about 37 % with its Pentium II providing multimedia support, as compared tothe PentiumPro, based on Media Benchmark ratings [114]. On the other hand, measuredscores of 3D Winbench99 - 3D Lighting and Transformation Test runs reveal thatPentium III processors have a cycle by cycle gain of about 61 % in favor of the PentiumIIs [115]. In contrast, for general purpose applications Pentium IIs offer only a slight (3-5 %) cycle per cycle performance increase over PentiumPros, whereas Pentium IIIsshow a similar small cycle per cycle benefit over the Pentium IIs in terms of SPECint95ratings [1].

(c) The third major possibility to introduce intra-instruction parallelism is the VLIW(Very Long Instruction Word) approach. In this case available EUs or groups of EUs aresimultaneously controlled by different subwords of the same instruction word. As aconsequence, VLIW instruction words become often very long.

VLIWs are scheduled statically, which means that the compiler is responsible forkeeping track and for the resolution of all types of dependencies. In addition, thecompiler is expected to perform a parallel optimization in order to find enoughexecutable operations to provide a high throughput.

It is useful to differentiate between narrow VLIWs and wide VLIWs. Narrow VLIWs,which include only two on three subinstructions, emerged in the second half of the1970s (such as the FPS 120B [116], FPS-164 [116], CDC AFP [117]). They were usedas attached processors to boost the performance of minicomputers in numericapplications and were designated usually as array processors. Narrow VLIWsdisappeared with the decline of minicomputers. Nevertheless, along with Intel’s IA64(Merced [118]) to be introduced in 2000, they seem to be enjoying a revival for generalpurpose applications and man be poised to become rivals of superscalars.

In contrast, instructions in wide VLIWs usually incorporate a large number ofoperations (typically on the order of 10). Wide VLIWs emerged as prototypes in the firsthalf of the 1980s (Polycyclic architecture [119], ELI-512 [120]), followed by twocommercial machines (TRACE [121], Cydra-5 [122]) in the second half of the 1980s.Wide VLIWs disappeared quickly from the market, because of inadequate softwaresupport and further problems detailed in [4].

Unlike three-operand and SIMD instructions, VLIWs are intended for general

purpose use. As far as the average number of operations in an instruction n OPI isconcerned, VLIWs and especially wide VLIWs can be expected to include on theaverage considerably more than one operation per instruction.

VII. THE MAIN ROAD OF THE EVOLUTION

As pointed out before, processor throughput can be increased along three maindimensions, i.e. by introducing and raising temporal, issue and intra-instructionparallelism. Although, theoretically, any sequence is imaginable, the main path ofevolution is marked by the sequence temporal, issue and intra-instruction parallelism(see Figure 24). This sequence has been determined basically by the decreasingefficiency of hardware utilization.

24

Scalar pipelined processors, which provide temporal parallelism, lead to the besthardware utilization since in essence, always all stages of the pipeline take part in theprocessing of instructions. Superscalar processors, which allow issue parallelism, followwith lower hardware utilization, because of the availability of multiple (parallel)execution paths. SIMD hardware extensions, which enable intra-instruction parallelism,are utilized least of all as they are used only for MM and 3D applications. To put it theother way around, higher throughput necessarily leads to higher hardware redundancy,as indicated in Figure 24.

Extent of

opereration. level

parallelism

Extent of the

utilization of

the hardware

SequentialParallel

Temporalparallelism

+ Spatialparallelism parallelism

+ Intra-instruction

Traditional

von N. procs.

Pipelined

processors

Superscalar

processors

Superscalar processors

with MM/3D support

~ 1985/88 ~ 1990/93 ~ 1994/97

Figure 24: Main steps of the evolution of the microarchitecture of the processors

We note that beyond the main path of evolution, processor history - as opposed tomicroprocessor history - revealed a second possible scenario for the evolution. Thisscenario is characterized by only two consecutive phases, as indicated in Figure 25.

Evolutionary scenario (Superscalar approach)a.

b. Revolutionary scenario (VLIW approach)

Introduction/

temporalparallelism

increase ofIntroduction/

issueparallelism

increase of

Introduction/

temporalparallelism

increase of

Introduction/

intra-instructionsparallelism

increase of

Introduction/

intra-instructionsparallelism

increase of

Figure 25: Possible scenarios for the evolution of processors

25

The second scenario represents a departure from the main path after its second phase.On this path, the introduction of temporal parallelism was followed by intra-instructionparallelism in the form of VLIW-instructions instead of issue parallelism. Clearly,introducing multiple operations per instruction instead of multiple instructions per clockcycles is a competing alternative for boosting throughput. But at the end of the 1980sthis alternative turned out to be a dead end for wide VLIWs, mainly due to stronglyenhanced requirements for the performance of the compilers needed to resolvedependencies and to optimize parallel operations in VLIW processors.

In very broad terms, the main path was chosen since it allowed the manufacturers toretain the basic ISA throughout, even though it became necessity to extend the basicISA in the last phase. In contrast, in a sense the second scenario represents a morerevolutionary path as the introduction of multi-operation VLIW instructions demanded acompletely new ISA.

VIII. CONCLUSIONS

The massive demand for higher processor performance could only be satisfied by thesuccessive introduction of temporal, issue and intra-instruction parallelism into theprocessor operation. This sequence was basically determined by hardware efficiencyaspects. Consequently, traditional sequential processors, pipelined processors,superscalar processors and superscalar processors with multimedia and 3D support marksubsequent evolutionary phases of microprocessors, as indicated in Figure 26.

Traditional sequentialprocessors

Pipelinedprocessors

branch proc.

Caches

Speculative

by pipelinedinstruction processing

Introduction oftemporal parallelism

ShelvingRenaming

Predecoding

Widening theinstr. window

Increasing thedispatch rate

Improving the accuracyof branch prediction

Enhancing the cachesubsystem

Increasing theissue rate

Exceeding the dataflow

processorsSuperscalar

by superscalar instr.issue

Introduction ofissue parallelism

by SIMD - instructions

with MM/3D supportSuperscalar processors

ISA extension

Introduction ofintra-instr. parallelism

Traditional sequentialprocessing

~ 1985/88 ~ 1990/93 ~ 1994/97

limit

26

Figure 26: Major deterministic features of the evolution of processors

The effective utilization of pipelined instruction processing inevitably called for theintroduction of caches and of speculative branch processing, for reasons discussed earlier.On the other hand, the debut of issue parallelism by means of superscalar instruction issueresulted in an appropriate performance increase only if shelving, renaming and - in thecase of RISC processors - also predecoding were introduced additionally. Furtherperformance gains required a further increase in the issue rate and an appropriate tuningof the microarchitecture in order to remove bottlenecks. The latter called first of all forwidening of the instruction window, increasing the dispatch rate, for increasing theaccuracy of the branch prediction, enhancing the cache subsystem, and exceeding thedataflow limit caused by RAW dependencies. Finally, the utilization of intra-instructionparallelism through SIMD instructions required an adequate extension of the ISA. All inall, these deterministic features constitute a framework which explains the sequence ofmajor innovations in the course of processor evolution.

REFERENCES

[1] ___,“Intel Microprocessor Quick ReferenceGuide,” http://developer.intel.com/pressroom/kits/processors/quickref.html

[2] L. Gwennap, “Processor performance climbssteadily”, Microprocessor Report, vol. 9, no. 1,pp. 17-23, 1995.

[3] J. L. Hennessy, “VLSI processor architecture,”IEEE Transactions on Computers, vol. C-33, no.12, pp. 1221-1246, Dec. 1984.

[4] B. R. Rau and J. A. Fisher, “Instruction levelparallel processing: history, overview andperspective,” The Journal of Supercomputing,vol. 7, no. 1, pp. 9-50, 1993.

[5] P. E. Smith and G. S. Sohi, “Themicroarchitecture of superscalar processors,”Proc. IEEE, vol. 83, no. 12, pp. 1609-1624,Dec. 1995.

[6] A. Yu, “The Future of microprocessors,” IEEEMicro, vol. 16, no. 6, pp. 46-53, Dec. 1996.

[7] K. Diefendorff, “PC processormicroarchitecture, a concise review of thetechniques used in modern PC processors,”Microprocessor Report, vol. 13, no. 9, pp. 16-22, 1999.

[8] ___, “SPEC Benchmark Suite, Release 1.0,”SPEC, Santa Clara, CA, Oct. 1989.

[9] ___, “SPEC CPU92 Benchmarks,” http://www.specbench.org/osg/cpu92/

[10]___, “SPEC CPU95 Benchmarks,” http://www.specbench.org/osg/cpu95/

[11] ___, “Winstone 99,” http://www1.zdnet.com/zdbob/winstone/winstone.html.

[12] ___, “WinBench 99,” http://www.zdnet.com/zdbop/winbench/winbench.html

[13] ___, “SYSmark Bench Suite,” http://www.babco.com/

[14] H. J. Curnow and B. A. Wichmann, “Asynthetic benchmark,” The Computer J., vol. 19,no. 1, pp. 43-49, 1976.

[15] R. P. Weicker, “Drystone: A synthetic systemsprogramming benchmark,” Comm. ACM, vol.27, no. 10, pp. 1013-1030, 1984.

[16] L. Gwennap, “Intel’s P6 uses decoupledsuperscalar design,” Microprocessor Report,vol. 9, no. 2, pp. 9-15, 1995.

[17] L. Gwennap, “Klamath extends P6 family,”Microprocessor Report, vol. 11, no. 2, pp. 1, 6-9, 1997.

27

[18] K. Diefendorff, “Pentium III = PentiumII +SSE,” Microprocessor Report, vol. 13. no. 3,pp. 1, 6-11, 1999.

[19] D. Anderson and T. Shanley, ISA SystemArchitecture, 3rd ed. Reading, MA: Addison-Wesley Developers Press, 1995.

[20] D. Anderson and T. Shanley, EISA SystemArchitecture, 2nd ed. Reading, MA: Addison-Wesley Developers Press, 1995.

[21] D. Anderson and T. Shanley, PCI SystemArchitecture, 4th ed. Reading, MA: Addison-Wesley Developers Press, 1999.

[22] ___, “PCI-X Addendum Released for MemberReview,” http://www.pcisig.com/

[23] E. Bloch, “The engineering design of theSTRETCH computer”, in Proc. East. JointComp. Conf., New York: Spartan Books,1959, pp. 48-58

[24] J. E. Thornton, Design of a computer: TheControl Data 6600. Glenview: Scott andForesmann, 1970.

[25] R. M. Tomasulo, “An efficient algorithm forexploiting multiple arithmetic units,” IBM J.Res. and Dev. vol. 11, no.1, pp. 25-33, 1967.

[26] R. W. Hockney and C. R. Jesshope, ParallelComputers. Bristol: Adam Hilger, 1981.

[27] T. Kilburn, D. B. G. Edwards, M. J. Laniganand F. H. Sumner, “One-level storage system,”IRE Trans. EC-11, vol. 2, pp. 223-235, Apr.1962.

[28] D. W. Anderson, F. J. Sparacio and F. M.Tomasulo, “The IBM System/360 Model 91:Machine philosophy and instruction-handling,” IBM Journal, vol. 11, no 1, pp. 8-24, Jan. 1967.

[29] ___, 80286 High performance microprocessorwith memory management and protection,Microprocessors, vol. 1. Mt. Prospect, IL:Intel, 1991, pp. 3.60-3.115

[30] T. L. Johnson, “A comparison of M68000family processors,” BYTE, vol. 11, no. 9, pp.205-218, 1986.

[31] G. Kane and J. Heinrich, MIPS RISCArchitecture. Englewood Cliffs, NJ: PrenticeHall, 1992.

[32] ___, 80386 DX High performance 32-bitCHMOS microprocessor with memorymanagement and protection,Microprocessors, vol. 1. Mt. Prospect, IL:Intel, 1991, pp. 5.287-5.424

[33] ___, The “68030 microprocessor: a window on1988 computing,” Computer Design, vol. 27,no. 1, pp. 20-23, Jan. 1988.

[34] F. Faggin, M. Shima, M. E. Hoff, Jr., H.

Feeney and S. Mazor, “The MCS-4: An LSIMicro Computer System,” in Proc. IEEERegion 6 Conf., 1972, pp. 8-11.

[35] S. P. Morse, B. W. Ravenel, S. Mazor and W.B. Pohlman, “Intel microprocessors: 8008 to8086,” Intel Corp. 1978, in D. P. Siewiorek,C. G. Bell and A. Newell, ComputerStructures: Principles and Examples.McGraw-Hill Book Comp., New York: 1982.

[36] C. J. Conti, D. H. Gibson and S. H. Pitkowsky,“Structural aspects of the System/360Model85, Part 1: General Organization,” IBMSyst. J. vol. 7, no. 1, pp. 2-14, 1968.

[37] J. E. Smith, "A study of branch predictionstrategies," in Proc. 8th SCA, pp. 135-148,May 1981.

[38] K. F. Lee and A. J. Smith, "Branch predictionstrategies and branch target buffer design,"Computer, vol. 17, no. 1, pp. 6-22, Jan. 1984.

[39] D. Sima, T. Fountain and P. Kacsuk, AdvancedComputer Architectures. Harlow: Addison-Wesley, 1997.

[40] M. H. Lipasti and J. P. Shen,"Superspeculative microarchitecture forbeyond AD 2000," IEEE Computer, pp. 59-66, Sept. 1997.

[41] D. Sima, “Superscalar instruction issue,” IEEEMicro, vol. 17, no. 5, pp. 28-39, 1997.

[42] N. P. Jouppi and D. W. Wall, “Availableinstruction-level parallelism for superscalarand superpipelined machines,” in Proc.ASPLOS-III, 1989, pp. 272-282.

[43] M. S. Lam and R. P. Wilson, “Limits ofcontrol flow on parallelism,” in Proc. 19th

Ann. Intern. Symp. Comp. Arch. (AISCA),1992, pp. 46-57.

[44] D. Sima, “The design space of shelving,” J.Systems Architecture, vol. 45, no. 11, pp. 863-885, 1999.

[45] B. Shriver and B. Smith, The Anatomy of aHigh-Performance Microprocessor. LosAlamitos, CA: IEEE Computer Society Press,1998.

[46] ___, “DECchip 21064 and DECchip 21064AAlpha AXP Microprocessors HardwareReference Manual,” Maynard, MA: DEC,1994.

[47] ___, “Alpha 21164 Microprocessor HardwareReference Manual,” Maynard, MA: DEC,1994.

[48] D. Leibholz and R. Razdan, "The Alpha21264: a 500 MIPS out-of-order executionmicroprocessor," in Proc. COMPCON, 1997,pp. 28-36.

[49] K. Diefendorff and M. Allen, “Organization ofthe Motorola 88110 superscalar RISC

28

microprocessor,” IEEE Micro, vol. 12, no. 2,pp. 40-62, 1992.

[50] T. Asprey, G. S. Averill, E. Delano, B. Weiner,J. Yetter, “ Performance features of thePA7100 microprocessor, IEEE Micro, vol. 13,no. 3, pp. 22-35, 1993.

[51] G. Kurpanek, K. Chan, J. Zheng, E. CeLanoand W. Bryg, "PA-7200: A PA-RISCprocessor with integrated high performanceMP bus interface," in Proc. COMPCON,1994, pp. 375-82.

[52] A. P. Scott et. al., "Four-Way Superscalar PA-RISC Processors," Hewlett-Packard Journal,pp. 1-9, Aug. 1997.

[53] G. Lesartre and D. Hunt, "PA-8500: TheContinuing Evolution of the PA-8000 Family,"PA-8500 Document, Hewlett-PackardCompany, 1998, pp. 1-11.

[54] G. F. Grohoski, "Machine organization of theIBM RISC System/6000 processor," IBM J.Research and Development, vol. 34, no. 1, pp.37-58, 1990..

[55] S. White and J. Reysa, "PowerPC andPOWER2: Technical Aspects of the New IBMRISC System/6000," Austin, TX: IBM Corp.1994.

[56] L. Gwennap, "IBM crams Power2 onto singlechip," Microprocessor Report, vol. 10, no. 11,pp. 14-16, 1996.

[57] M. Becker et al., "The PowerPC 601microprocessor," IEEE Micro, vol. 13, no. 5,pp. 54-68, 1993.

[58] B. Burgess et al., "The PowerPC 603microprocessor," Comm. ACM, vol. 37, no. 6,pp. 34-42, 1994 .

[59] S. P. Song et al., "The PowerPC 604 RISCmicroprocessor," IEEE Micro, vol. 14, no. 5,pp.8-17, 1994.

[60] D. Ogden et al., "A new PowerPCmicroprocessor for low power computingsystems," in Proc. COMPCON 1995, pp. 281-284.

[61] D. Levitan et al., "The PowerPC 620microprocessor: a high performancesuperscalar RISC microprocessor," in Proc.COMPCON 1995, pp. 285-291.

[62] L. Gwennap, “Arthur revitalizes PowerPCline”, Microprocessor Report, vol. 11, no. 2,pp. 10-13, 1997.

[63] S.P. Song, "IBM's Power3 to replace P2SC,"Microprocessor Report, vol.11, no. 15, pp.23-27, 1997.

[64] A. Patrizio and M. Hachman, “Motorolaannounces G4 chip,” http://www.techweb.com/wire/story/TWB19981016S0013

[65] P. Y-T. Hsu, “Designing the FPTmicroprocessor”, IEEE Micro, vol. 14, no. 2,pp. 23-33, 1994.

[66] ___, “R10000 Microprocessor ProductOverview”, MIPS Technologies Inc., Oct.1994, pp. 1-12

[67] L. Gwennap, "MIPS R12000 to hit 300MHz," Microprocessor Report, vol. 11, no.13, pp. 1,6-7,17, 1997.

[68] ___,”The SuperSPARC microprocessorTechnical White Paper”, Mountain View,CA: Sun Microsystems, 1992.

[69] L. Gwennap, “UltraSparc unleashes SPARCperformance, ”Microprocessor Report, vol.8, no. 13, pp. 1-10, 1994.

[70] N. Patkar, A. Katsuno, S. Li, T. Maruyama,S. Savkar, M. Simone, G. Shen, R. Swamiand D. Tovey, “Microarchitecture of Hal’sCPU,” in Proc. COMPCON 1995, pp. 259-266.

[71] G. Goldman and P. Tirumalai,“UltraSPARC-II: the advancement ofUltraComputing,” in Proc. COMPCON1996, pp. 417-423.

[72] P. Song, “UltraSparc-3 aims at MP servers,”Microprocessor Report, vol. 11, no. 14, pp.29-34, 1997.

[73] D. Alpert and D. Avnon, “Architecture ofthe Pentium microprocessor,” IEEE Micro,vol. 13, no. 3, pp. 11-21, 1993.

[74] M. Slater, “Intel’s long awaited P55Cdisclosed,” Microprocessor Report, vol. 10,no.14, pp. 20-23, 1996.

[75] J. S. Liptay, "Design of the IBM EnterpriseSytem/9000 high-end processor," IBM J.Research and Development, vol. 36, no. 4,pp. 713-731, July 1992.

[76] K. Uchiyama, F. Arakawa, S. Narita, H.Aoki, I. Kawasaki, S. Matsui, M.Yamamoto, N. Nakagawa and I. Kudo, “TheGmicro/500 superscalar microprocessor withbranch buffers,” IEEE Micro, vol. 13, no. 5,pp. 12-22, 1993.

[77] ___, “The Cyrix M1 architecture,”Richardson, TX: Cyrix Corp. 1995.

[78] ___, “Cyrix 686 MX processor,” Richardson,TX: Cyrix Corp. 1997.

[79] J. Circello and F. Goodrich, ”The Motorola68060 microprocessor,” in Proc.COMPCON 1993, pp. 73-78.

[80] L. Gwennap, “NexGen enters market with66-MHz Nx586” Microprocessor Report,vol. 8, no. 4, pp. 12-17, 1994.

[81] M. Slater, “AMD's K5 designed to outrunPentium,” Microprocessor Report, vol. 8,no. 14, pp. 1-11, 1994.

[82] L. Gwennap, “AMD deploys K6-2 with3DNow,” Microprocessor Report, vol. 12,no. 7, pp. 16-17, 1998.

[83] L. Gwennap, “AMD gets the IIIrd degreee,”Microprocessor Report, vol. 13, no. 3, pp.22-24, 1999.

29

[84] K. Diefendorff, “Athlon outruns PentiumIII,” Microprocessor Report, vol. 13, no. 11,pp. 1, 6-11, 1999.

[85] M. Franklin and G. S. Sohi, “ARB: ahardware mechanism for dynamic reorderingof memory references,” IEEE Trans.Computers, vol. 45, no. 5, pp. 552-571,1996.

[86] K. Diefendorff, "Jalapeno Powers Cyrix'sM3," Microprocessor Report, vol. 12, no.15, pp. 24-29, 1998.

[87] C. Fu, M.D. Jennings, S. Y. Larin an T. M.Conte, “Value speculation scheduling forhigh performance processors,” in Proc.ASPLOS-VIII, 1998, pp. 262-271.

[88] M. H. Lipasti and J. P. Shen, "Exceeding thedataflow limit via value prediction," inProc. MICRO29, 1996, pp. 226-237.

[89] Y. Sazeides and J. E. Smith, "Thepredictability of data values," in Proc.MICRO30, 1997, pp. 248-258.

[90] B. Calder, P. Feller and A. Eustace, "Valueprofiling," in Proc. MICRO30, 1997, pp.259-269.

[91] D. Michie, "Memo functions and machinelearning," Nature, no 218, pp. 19-22, 1968.

[92] S. Richardson, "Exploiting trivial andredundant computation," in Proc. 11th Symp.Computer Arithmetic, 1993, pp. 220-227.

[93] A. Sodani and G.S. Sohi, "Dynamicinstruction reuse," in Proc. 24th ISCA, 1997,pp. 194-205.

[94] A. Sodani and G.S. Sohi, "An empiricalanalysis of instruction repetition," in Proc.ASPLOS VIII, 1998, pp. 35-45.

[95] D. Citron, D. Feitelson and L. Rudolph,"Accelerating multi-media processing byimplementing memoing in multiplication anddivision," in Proc. ASPLOS VIII, 1998, pp.252-261.

[96] A. Moshovos et al., "Dynamic speculationand synchronization of data dependencies,"in Proc. 24th ISCA, 1997, pp. 181-193.

[97] G. Z. Chrysos and J. S. Emer, "Memorydependence prediction using store sets," inProc. 25th ISCA, 1998, pp. 142-153.

[98] M. H. Lipasti, C. B. Wilkerson and J. P.Shen, "Value locality and load valueprediction," in Proc. ASPLOS VII, 1996, pp.138-147.

[99] D. W. Wall, “Limits of instruction levelparallelism,”in Proc. ASPLOS IV, 1991, pp.176-188.

[100] R. R. Oehler and M. W. Blasgen, "IBMRISC System/6000: Architecture andperformance," IEEE Micro, vol. 11, no. 3,pp. 14-17, 56-62, 1991.

[101] K. Diefendorff and E. Shilha, "ThePowerPC user instruction set architecture,"IEEE Micro, vol. 14, no. 5, pp. 30-41, 1994.

[102] D. Hunt, "Advanced performance featuresof the 64-bit PA-8000," in Proc.COMPCON, 1995, pp. 123-128.

[103]___, MIPS IV Instruction Set Architecture,White Paper, MIPS Technologies Inc.,Mountain View, CA, 1994.

[104] ___, Intel Architecture Software Developer’sManual, http://developer.intel.com/design/PentiumII/manuals/

[105] R. L. Lee, “Accelerating multimedia withenhanced microprocessors,” IEEE Micro,vol. 15, no. 2, pp. 22-32, 1995.

[106] L. Gwennap, “AltiVec vectorizes PowerPC,”Microprocessor Report, vol. 12, no. 6, pp. 1,6-9, 1998.

[107] ___, Advanced Technology for VisualComputing: Alpha Architecture with MVI(White Paper), http://www.digital.com/semiconductor/mvi-backgrounder.htm

[108] D. Sweetman, See MIPS Run. SanFrancisco, CA: Morgan Kaufmann, 1999.

[109] R. B. Lee, “Subword parallelism withMAX-2,” IEEE Micro, vol. 16, no.4, pp. 51-59, 1996.

[110] L. Kohn, G. Maturana, M. Tremblay, A.Prabhu and G. Zyner, “The VisualInstruction Set (VIS) in UltraSPARC”, inProc. COMPCON95, pp. 1995, pp. 462-469.

[111] A. Peleg and U. Weiser, “MMX technologyextension to the Intel architecture,” IEEEMicro, vol. 16, no. 4, pp. 42-50, 1996.

[112] S. Oberman, G. Favor and F. Weber, “AMD3DNow! technology: architecture andimplementations,” IEEE Micro, vol. 19, no. 2,pp. 37-48, 1999.

[113] ___, Intel Architecture Software DevelopersManual, http://developer.intel.com/design/PentiumIII/manuals/

[114] M. Mittal, A. Peleg, and U. Weiser, “MMXtechnology overview,” Intel TechnologyJournal, pp. 1-10, 3rd Quarter 1997.

[115] ---, “3D Winbench 99-3D Lightning andTransformation Test,” http://developer.intel.com/procs/perf/PentiumIII/ed/3dwinbench.html

[116] A. E. Charlesworth, “An approach toscientific array processing: the architecturaldesign of the AP-120B/FPS-164 family,”Computer, vol. 14, no. 9, pp.18-27, 1981.

[117] ___, CDC Advanced Flexible ProcessorMicrocode Cross Assembler (MICA)Reference Manual, Publication no. 77900500,Control Data Corp., Apr. 1980.

[118] L. Gwennap, “Intel discloses new IA-64features,” Microprocessor Report, vol. 13,no.3, pp. 16-19, 1999.

30

[119] B. R. Rau, C. D. Glaser and R. L. Picard,“Efficient code generation for horizontalarchitectures: compiler techniques andarchitectural support,” in 9th AISCA, 1982, pp.131-139.

[120] J. A. Fisher, “Very long instruction wordarchitectures and the ELI-512,” in Proc. 10th

AISCA, 1983, pp.140-150.

[121] R. P. Colwell, R. P. Nix, J. O. Donell, D. B.Papworth and P. K. Rodman,” A VLIWarchitecture for a trace scheduling compiler,”IEEE Trans. Computers, vol. 37, no. 8, pp.967-979, 1988

[122] B. R. Rau, D. W. L. Yen, W. Yen and R. A.Towle, “The Cydra 5 departmentalsupercomputer”, Computer, vol.22, no.1,pp.12-35, 1989.

deterministic features in the evolution of … · 2017-01-03 · figure 1: the increase of the...

Documents