1 chapter 4 multiple-issue processors. 2 multiple-issue processors this chapter concerns...

1

Chapter 4Chapter 4

Multiple-Issue ProcessorsMultiple-Issue Processors

2

Multiple-issue Multiple-issue pprocessorsrocessors

This chapter concerns multiple-issue processors, i.e. superscalar and VLIW (very long instruction word) processors.

Most of today's general-purpose microprocessors are four- or six-issue superscalar often with an enhanced Tomasulo scheme.

VLIW is the choice for most signal processors.

VLIW is proposed as EPIC (explicitly parallel instruction computing) by Intel for its IA-64 ISA.

3

Components of a Components of a ssuperscalar uperscalar pprocessorrocessor

L o ad /S to re

U n it(s)

F lo a tin g -P o in t

R eg is te rs

G en e ra lP u rp o se

R eg is te rs

M u lti-m ed ia

R eg is te rs

F lo a tin g -P o in t

U n it(s)

In teg e rU n it(s)

M u lti-m ed iaU n it(s)

R e tireU n it

R en am eR eg is te rs

D -cach eM M U

R eo rd e r B u ffe r

In s tru c tio n B u ffe rIn s tru c tio nIssu e U n it

In s tru c tio n D eco d e an dR eg is te r R en am e U n it

In s tru c tio n F e tchU n it

B ran chU n it

B H T B TA C

R A S

M M UI-cach e

B u sIn te rface

U n it

4

FloorplanFloorplanof the of the PowerPC 604PowerPC 604

5

Superscalar Superscalar ppipelineipeline(PowerPC- and enhanced Tomasulo-(PowerPC- and enhanced Tomasulo-sscheme)cheme)

Instructions in the instruction window are free from control dependencies due to branch prediction, and free from name dependences due to register renaming.

So, only (true) data dependences and structural conflicts remain to be solved.

In s tru c tio n F e tch

. . .

In s tru c tio n D eco d e

an d R en am e

. . .

Inst

ruct

ion

Win

dow

I s su e

Res

erva

tion

St

atio

ns

E x ecu tio n

Res

erva

tion

St

atio

ns

E x ecu tio n. . .

R etire an d

W rite B ack

6

Superscalar Superscalar ppipeline ipeline wwithout ithout rreservation eservation sstationstations

Reg

iste

r fi

le

Dat

a-ca

che

Fet

ch

Dec

ode

Ren

ame

Wak

eup

Sele

ct

Byp

ass

Issu

ew

indo

w

R eg W riteC o m m it

D cach eA ccess

E x ecu teB y p assR eg R eadW ak eu p

S e lec tR en am eD eco d eF e tch

7

Superscalar Superscalar ppipeline with ipeline with ddecoupled ecoupled iinstruction nstruction wwindowsindows

Reg

iste

r fi

le

Dat

a-ca

che

Fet

ch

Dec

ode

Ren

ame

Wak

eup

Sel

ect

Byp

ass

R e g W riteC o m m it

D cach eA ccess

E x ecu teB y p assR e g R eadW ak eu p

S e le c tR e n a m eD eco d eF e tc h

...

...

...

8

IssueIssue The issue logic examines the waiting instructions in the instruction window and

simultaneously assigns (issues) a number of instructions to the FUs up to a maximum issue bandwidth.

Several instructions can be issued simultaneously (the issue bandwidth). The program order of the issued instructions is stored in the reorder buffer. Instruction issue from the instruction window can be:

– in-order (only in program order) or out-of-order – it can be subject to simultaneous data dependences and resource

constraints, – or it can be divided in two (or more) stages

• checking structural conflict in the first and data dependences in the next stage (or vice versa).

• In the case of structural conflicts first, the instructions are issued to reservation stations (buffers) in front of the FUs where the issued instructions await missing operands (PowerPC/enhanced Tomasulo scheme).

9

Reservation Reservation sstation(s)tation(s)

Two definitions in literature:– A reservation station is a buffer for a single instruction with its operands

(original Tomasulo paper, Flynn's book, Hennessy/Patterson book).– A reservation station is a buffer (in front of one or more FUs) with one or

more entries and each entry can buffer an instruction with its operands (e.g. PowerPC literature).

Depending on the specific processor, reservation stations can be central to a number of FUs or each FU has one or more own reservation stations.

Instructions await their operands in the reservation stations, as in the Tomasulo algorithm.

10

DispatchDispatch (PowerPC- and enhanced Tomasulo-Scheme)(PowerPC- and enhanced Tomasulo-Scheme)

An instruction is then said to be dispatched from a reservation station to the FU when all operands are available, and execution starts.

If all its operands are available during issue and the FU is not busy, an instruction is immediately dispatched, starting execution in the next cycle after the issue.

So, the dispatch is usually not a pipeline stage. An issued instruction may stay in the reservation station for zero to several

cycles. Dispatch and execution is performed out of program order.

Other authors interchange the meaning of issue and dispatch or use different semantic.

11

CompletionCompletion

When the FU finishes the execution of an instruction and the result is ready for forwarding and buffering, the instruction is said to complete.

Instruction completion is out of program order. During completion the reservation station is freed and the state of the

execution is noted in the reorder buffer. The state of the reorder buffer entry can denote an interrupt occurrence. The instruction can be completed and still be speculatively assigned, which is

also monitored in the reorder buffer.

12

CommitmentCommitment

After completion, operations are committed in-order. An instruction can be committed:

– if all previous instructions due to the program order are already committed or can be committed in the same cycle,

– if no interrupt occurred before and during instruction execution, and– if the instruction is no more on a speculative path.

By or after commitment, the result of an instruction is made permanent in the architectural register set, – usually by writing the result back from the rename register to the

architectural register.

13

Precise Precise iinterrupt (Precise nterrupt (Precise eexception)xception)

If an interrupt occurred, all instructions that are in program order before the interrupt signaling instruction are committed, and all later instructions are removed.

Precise exception means that all instructions before the faulting instruction are

committed and those after it can be restarted from scratch.

Depending on the architecture and the type of exception, the faulting

instruction should be committed or removed without any lasting effect.

14

RetirementRetirement

An instruction retires when the reorder buffer slot of an instruction is freed either– because the instruction commits (the result is made permanent) or– because the instruction is removed (without making permanent changes).

A result is made permanents by copying the result value from the rename register to the architectural register. – This is often done in an own stage after the commitment of the instruction

with the effect that the rename register is freed one cycle after commitment.

15

Explanation of the Explanation of the tterm erm “s“superscalaruperscalar””

Definition:

Superscalar machines are distinguished by their ability to (dynamically) issue multiple instructions each clock cycle from a conventional linear instruction stream.

In contrast to superscalar processors, VLIW processors use a long instruction word that contains a usually fixed number of instructions that are fetched, decoded, issued, and executed synchronously.

16

Explanation of the Explanation of the tterm “erm “ssuperscalaruperscalar”” Instructions are issued from a sequential stream of normal instructions (in

contrast to VLIW where a sequential stream of instruction tuples is used).

The instructions that are issued are scheduled dynamically by the hardware (in contrast to VLIW processors which rely on a static scheduling by the compiler).

More than one instruction can be issued each cycle (motivating the term superscalar instead of scalar).

The number of issued instructions is determined dynamically by hardware, that is, the actual number of instructions issued in a single cycle can be zero up to a maximum instruction issue bandwidth

(In contrast to VLIW where the number of scheduled instructions is fixed due to padding instructions with no-ops in case the full issue bandwidth would not be met.)

17

Explanation of the Explanation of the tterm “erm “ssuperscalaruperscalar””

Dynamic issue of superscalar processors can allow issue of instructions either in-order, or it can allow also an issue of instructions out of program order. – Only in-order issue is possible with VLIW processors.

The dynamic instruction issue complicates the hardware scheduler of a superscalar processor if compared with a VLIW.

The scheduler complexity increases when multiple instructions are issued out-of-order from a large instruction window.

It is a presumption of superscalar that multiple FUs are available. – The number of available FUs is at least the maximum issue bandwidth, but

often higher to diminish potential resource conflicts. The superscalar technique is a microarchitecture technique, not an

architecture technique.

18

Please recall: Please recall: aarchitecture, ISA, rchitecture, ISA, mmicroarchitectureicroarchitecture

The architecture of a processor is defined as the instruction set architecture (ISA), i.e. everything that is seen outside of a processor.

In contrast, the microarchitecture comprises implementation techniques – like number and type of pipeline stages, issue bandwidth, number of FUs,

size and organization of on-chip cache memories etc. – The maximum issue bandwidth and the internal structure of the processor

can be changed. – Even several architectural compatible processors may exist with different

microarchitectures and all are able to execute the same code.

An optimizing compiler may also use the knowledge of the microarchitecture.

19

Sections of a Sections of a ssuperscalar uperscalar pprocessorrocessor

The ability to issue and execute instructions out-of-order partitions a superscalar pipeline in three distinct sections:– in-order section with the instruction fetch, decode and rename stages - the

issue is also part of the in-order section in case of an in-order issue,– out-of-order section starting with the issue in case of an out-of-order issue

processor, the execution stage, and usually the completion stage, and again an

– in-order section that comprises the retirement and write-back stages.

20

Temporal vs. Temporal vs. sspacial pacial pparallelismarallelism

Instruction pipelining, superscalar and VLIW techniques all exploit fine-grain (instruction-level) parallelism.

Pipelining utilizes temporal parallelism.

Superscalar and VLIW techniques utilize also spatial parallelism.

Performance can be increased by longer pipelines (deeper pipelining) and faster transistors (a faster clock) emphasizing an improved pipelining.

Provided that enough fine-grain parallelism is available, performance can also be increased by more FUs and a higher issue bandwidth using more transistors in the superscalar and VLIW cases.

21

I-I-ccache ache aaccess and ccess and iinstruction nstruction ffetchetch

Harvard architecture: separate instruction and data memory and access paths– is internally used in a high-performance microprocessor with separate on-

chip primary I-cache and D-cache. The I-cache is less complicated to control than the D-cache, because

– it is read-only and– it is not subjected to cache coherence in contrast to the D-cache.

Sometimes the instructions in the I-cache are predecoded on their way from the memory interface to the I-cache to simplify the decode stage.

22

Instruction Instruction ffetchetch The main problem of instruction fetching is control transfer performed by

jump, branch, call, return, and interrupt instructions:– If the starting PC address is not the address of the cache line, then fewer

instructions than the fetch width are returned. – Instructions after a control transfer instruction are invalidated.– A multiple cache lines fetch from different locations may be needed in

future very wide-issue processors where often more than one branch will be contained in a single contiguous fetch block.

Problem with target instruction addresses that are not aligned to the cache line addresses:– Self-aligned instruction cache reads and concatenates two consecutive

lines within one cycle to be able to always return the full fetch bandwidth. Implementation:

• either by use of a dual-port I-cache, • by performing two separate cache accesses in a single cycle, • or by a two-banked I-cache (preferred).

23

Prefetching and Prefetching and iinstruction nstruction ffetch etch ppredictionrediction

Prefetching improves the instruction fetch performance, but fetching is still limited because instructions after a control transfer must be invalidated.

Instruction fetch prediction helps to determine the next instructions to be fetched from the memory subsystem.

Instruction fetch prediction is applied in conjunction with branch prediction.

24

Branch Branch ppredictionrediction Branch prediction foretells the outcome of conditional branch instructions. Excellent branch handling techniques are essential for today's and for future

microprocessors. The task of high performance branch handling consists of the following

requirements:– an early determination of the branch outcome (the so-called branch

resolution),– buffering of the branch target address in a BTAC after its first calculation

and an immediate reload of the PC after a BTAC match,– an excellent branch predictor (i.e. branch prediction technique) and

speculative execution mechanism, – often another branch is predicted while a previous branch is still

unresolved, so the processor must be able to pursue two or more speculation levels,

– and an efficient rerolling mechanism when a branch is mispredicted (minimizing the branch misprediction penalty).

25

Misprediction Misprediction ppenaltyenalty

The performance of branch prediction depends on the prediction accuracy and the cost of misprediction.

Prediction accuracy can be improved by inventing better branch predictors. Misprediction penalty depends on many organizational features:

– the pipeline length (favoring shorter pipelines over longer pipelines), – the overall organization of the pipeline, – the fact if misspeculated instructions can be removed from internal

buffers, or have to be executed and can only be removed in the retire stage,

– the number of speculative instructions in the instruction window or the reorder buffer. Typically only a limited number of instructions can be removed each cycle.

Rerolling when a branch is mispredicted is expensive: – 4 to 9 cycles in the Alpha 21264, – 11 or more cycles in the Pentium II.

26

Branch-Target Buffer or Branch-Target Address Branch-Target Buffer or Branch-Target Address CacheCache The Branch Target Buffer (BTB) or Branch-Target Address Cache (BTAC)

stores branch and jump target addresses. It should be known already in the IF stage whether the as-yet-undecoded

instruction is a jump or branch. The BTB is accessed during the IF stage. The BTB consists of a table with branch addresses, the corresponding target

addresses, and prediction information. Variations:

Branch Target Cache (BTC): stores one or more target instructions additionally.

Return Address Stack (RAS): a small stack of return addresses for procedure calls and returns is used additional to and independent of a BTB.

27

... ... ...

Branch address Target addressPrediction

bits

Branch-Target Buffer or Branch-Target Address Branch-Target Buffer or Branch-Target Address CacheCache

28

Static Static bbranch ranch ppredictionrediction

Static Branch Prediction predicts always the same direction for the same branch during the whole program execution.

It comprises hardware-fixed prediction and compiler-directed prediction.

Simple hardware-fixed direction mechanisms can be:– Predict always not taken– Predict always taken– Backward branch predict taken, forward branch predict not taken

Sometimes a bit in the branch opcode allows the compiler to decide the prediction direction.

29

Dynamic Dynamic bbranch ranch ppredictionrediction

In a dynamic branch prediction scheme the hardware influences the prediction while execution proceeds.

Prediction is decided on the computation history of the program.

After a start-up phase of the program execution, where a static branch prediction might be effective, the history information is gathered and dynamic branch prediction gets effective.

In general, dynamic branch prediction gives better results than static branch prediction, but at the cost of increased hardware complexity.

30

One-bit One-bit ppredictorredictor

NT

NTT

T

Predict TakenPredict Not

Taken

31

One-bit vs. One-bit vs. ttwo-bit wo-bit ppredictorsredictors

A one-bit predictor correctly predicts a branch at the end of a loop iteration, as long as the loop does not exit.

In nested loops, a one-bit prediction scheme will cause two mispredictions for the inner loop:– One at the end of the loop, when the iteration exits the loop instead of

looping again, and – one when executing the first loop iteration, when it predicts exit instead of

looping. Such a double misprediction in nested loops is avoided by a two-bit predictor

scheme. Two-bit Prediction: A prediction must miss twice before it is changed when a

two-bit prediction scheme is applied.

32

Two-bit Two-bit ppredictorsredictors(Saturation Counter Scheme)(Saturation Counter Scheme)

NT

NTT

T

(11) Predict Strongly

Taken

NT

T

NT

T


Not Taken

(01) Predict Weakly

Not Taken

(10) Predict Weakly

Taken

33

Two-bit Two-bit ppredictorsredictors(Hysteresis Scheme)(Hysteresis Scheme)

NT

NT

T

T


Taken

NT

T

NT

T


Not Taken

(01) Predict Weakly

Not Taken

(10) Predict Weakly

Taken

34

Two-bit Two-bit ppredictorsredictors

The two-bit prediction scheme is extendable to an n-bit scheme. Studies showed that a two-bit prediction scheme does almost as well as an

n-bit scheme with n>2. Two-bit predictors can be implemented in the Branch Target Buffer (BTB)

assigning two state bits to each entry in the BTB. Another solution is to use a BTB for target addresses

and a separate Branch History Table (BHT) as prediction buffer. A mispredict in the BHT occurs due to two reasons:

– either a wrong guess for that branch, – or the branch history of a wrong branch is used because the table is

indexed. In an indexed table lookup part of the instruction address is used as index to

identify a table entry.

35

Two-bit Two-bit ppredictors and redictors and ccorrelation-based orrelation-based ppredictionrediction

Two-bit predictors work well for programs which contain many frequently executed loop-control branches (floating-point intensive programs).

Shortcomings arise from dependent (correlated) branches, which are frequent in integer-dominated programs.

36

Example:Example:

bnez R1,L1 ; branch b1 (d0)

addi R1, R0,#1 ; d==0, so d=1

L1: subi R3, R1,#1

bnez R3, L2 ; branch b2 (d 0)

...

L2: ...

Consider a sequence where d alternates between 0 and 2 a sequence of NT-T-NT-T-NT-T for branches b1 and b2

The execution behavior is given in the following table:initial d d==0 b1 d before b2 d==1 b2

0 yes NT 1 yes NT2 no T 2 no T

??

if (d==0) /* branch b1*/ d=1;if (d==1) /*branch b2 */ ...

37

One-bit predictor initialized to “predict taken”One-bit predictor initialized to “predict taken”


addi R1, R0,#1 ; d==0, so d=1

L1: subi R3, R1,#1


...

L2: ... d alternates between 0 and 2

b1:b2:

Initialprediction

TT

d==0

NTNT

d==2

TT

d==0

NTNT

38

Two-bit saturation counter predictor initialized to Two-bit saturation counter predictor initialized to “predict weakly taken”“predict weakly taken”


addi R1, R0,#1 ; d==0, so d=1

L1: subi R3, R1,#1


...


b1:b2:

Initialprediction

WTWT

d==0

WNTWNT

d==2

WTWT

d==0

WNTWNT

NT

NTT

T


Taken

NT

T

NT

T


Not Taken

(01) Predict Weakly

Not Taken

(10) Predict Weakly

Taken

39

Two-bit predictor (Hysteresis counter) initialized to Two-bit predictor (Hysteresis counter) initialized to “predict weakly taken”“predict weakly taken”


addi R1, R0,#1 ; d==0, so d=1

L1: subi R3, R1,#1


...


b1:b2:

Initialprediction

WTWT

d==0

SNTSNT

d==2

WNTWNT

d==0

SNTSNT

NT

NTT

T


Taken

NT

T

NT

T


Not Taken

(01) Predict Weakly

Not Taken

(10) Predict Weakly

Taken

40

Predictor Predictor bbehavior in ehavior in eexamplexample

A one-bit predictor initialized to “ predict taken” for branches b1 and b2, every branch is mispredicted.

A two-bit predictor of of saturation counter scheme starting from the state “predict weakly taken” every branch is mispredicted.

The two-bit predictor of UltraSPARC mispredicts every second branch execution of b1 and b2.

A (1,1) correlating predictor takes advantage of the correlation of the two branches; it mispredicts only in the first iteration when d = 2.

41

Correlation-based Correlation-based ppredictorredictor

The two-bit predictor scheme uses only the recent behavior of a single branch to predict the future of that branch.

Correlations between different branch instructions are not taken into account. The correlation-based predictors or correlating predictors are branch

predictors that additionally use the behavior of other branches to make a prediction.

While two-bit predictors use self-history only, the correlating predictor uses neighbor history additionally.

Notation: (m,n)-correlation-based predictor or (m,n)-predictor uses the behavior of the last m branches to choose from 2m branch predictors, each of which is a n-bit predictor for a single branch.

Branch history register (BHR): The global history of the most recent m branches can be recorded in a m-bit shift register where each bit records whether the branch was taken or not taken.

42

...

...

...

...

...

...

Pattern History Tables PHTs (2-bit predictors)

...

...

1 1

Branch address

10

0Branch History Register BHR

(2-bit shift register) 1

select

Correlation-based Correlation-based ppredictionrediction (2,2)-predictor(2,2)-predictor

43

Prediction behavior of (1,1) correlating predictorPrediction behavior of (1,1) correlating predictor


addi R1, R0,#1 ; d==0, so d=1

L1: subi R3, R1,#1


...


b1:b2:

Initialprediction

TT

BHR PHT0:1:

b1 b2

1 1 1

1 1

d==0

44



addi R1, R0,#1 ; d==0, so d=1

L1: subi R3, R1,#1


...


b1:b2:

Initialprediction

TT

d==0

NT

BHR PHT0:1:

b1 b2

0 1 1

0 1

45



addi R1, R0,#1 ; d==0, so d=1

L1: subi R3, R1,#1


...


b1:b2:

Initialprediction

TT

d==0

NT

BHR PHT0:1:

b1 b2

0 1 0

0 1

NT

d==2

46



addi R1, R0,#1 ; d==0, so d=1

L1: subi R3, R1,#1


...


b1:b2:

Initialprediction

TT

d==0

NTNT

d==2

T

BHR PHT0:1:

b1 b2

1 1 0

0 1

47



addi R1, R0,#1 ; d==0, so d=1

L1: subi R3, R1,#1


...


b1:b2:

Initialprediction

TT

d==0

NTNT

d==2

T

BHR PHT0:1:

b1 b2

1 1 0

0 1

T

48

Two-level Two-level aadaptive daptive ppredictorsredictors

Developed by Yeh and Patt at the same time (1992) as the correlation-based prediction scheme.

The basic two-level predictor uses a single global branch history register (BHR) of k bits to index in a pattern history table (PHT) of 2-bit counters.

Global history schemes correspond to correlation-based predictor schemes. Denotation:

GAg: – a single global BHR (denoted G) and – a single global PHT (denoted g), – A stands for adaptive.

All PHT implementations of Yeh and Patt use 2-bit predictors. GAg-predictor with a 4-bit BHR length is denoted as GAg(4).

49

1 1 00

1100

...

...

1 1Index

predict: taken

Branch History Register (BHR)

Branch Pattern History Table

(PHT)

shift direction

Implementation of a GAg(4)-predictorImplementation of a GAg(4)-predictor

In the GAg predictor schemes the PHT lookup depends entirely on the bit pattern in the BHR and is completely independent of the branch address.

50

Mispredictions can be restrained by additionally Mispredictions can be restrained by additionally using:using:

the full branch address to distinguish multiple PHTs (called per-address PHTs),

a subset of branches (e.g. n bits of the branch address) to distinguish multiple PHTs (called per-set PHTs),

the full branch address to distinguish multiple BHRs (called per-address BHRs),

a subset of branches to distinguish multiple BHRs (called per-set BHRs),

or a combination scheme.

51

...

...

...

...

Per-address PHTs

1 1 00...

...

1 1Index

...

Branch address

BHR

...

Implementation of a GAp(4) predictorImplementation of a GAp(4) predictor

Gap(4) means a 4-bit BHR a PHT for each branch.

52

...

...

...

...

Per-set PHTs

1 1 00...

...

1 1Index

...

n bits of branch address

n

BHR

...

GAs(4, 2GAs(4, 2nn))

Gas(4,2n) means a 4-bit BHR n bits of the branch address are used to choose among 2n PHTs

with 24 entries each.

53

...

...

...

...

...

...

Pattern History Tables PHTs (2-bit predictors)

...

...

1 1

Branch address

10

0Branch History Register BHR

(2-bit shift register) 1

select

...

...

...

...

Per-set PHTs

1 1 00...

...

1 1Index

...


n

BHR

...

Compare Compare ccorrelation-based (2,2)-predictor (left) orrelation-based (2,2)-predictor (left) with with ttwo-level wo-level aadaptive GAs(4,2daptive GAs(4,2nn) predictor (right)) predictor (right)

54

Two-level Two-level aadaptive daptive ppredictors:redictors:Per-address history schemesPer-address history schemes

The first-level branch history refers to the last k occurrences of the same branch instruction (using self-history only!)

Therefore a BHR is associated with each branch instruction. The per-address branch history registers are combined in a table that is called

per-address branch history table (PBHT). In the simplest per address history scheme, the BHRs index into a single

global PHT. denoted as PAg (multiple per-address indexed BHRs, and a single global PHT).

55

...

...

1 1Index

Per-address BHT PHT

...

...

0 01 1Branch address

0 01 1Branch address

PPAAg(4)g(4)

56

...

...

...

...

Per-address PHTs

...

...

1 1

b1

Index

Per-address BHT

...

...

0 01 1

0 01 1...

Branch address b2

Branch address b1

0 1

b2

PPAAp(4)p(4)

57

Two-level Two-level aadaptive daptive ppredictors: redictors: Per-set history schemesPer-set history schemes

Per-set history schemes (SAg, SAs, and SAp): the first-level branch history means the last k occurrences of the branch instructions from the same subset. Each BHR is associated with a set of branches.

Possible Set attributes: – branch opcode, – the branch class assigned by the compiler, or – the branch address (most important!).

58

...

...

1 1Index

Per-set BHT PHT

...

...

0 01 1

0 01 1



n

n

SSAAg(4)g(4)

59

...

...

...

...

Per-set PHTs

...

...

1 1

b1, b2

n

Index

Per-setBHT

...

...

0 01 1... n bits of

branch address b2

n bits ofbranch address b1 n

n

SSAAs(4)s(4)

60


Full table:

single global PHT per-set PHTs per-address PHTs

single global BHR GAg GAs GAp

per-address BHT PAg PAs PAp

per-set BHT SAg SAs SAp

61

Estimation of hardware costsEstimation of hardware costs

Scheme name BHR Length No. of PHTs Hardware Cost

GAg(k ) k 1 k + 2 k * 2

GAs(k,p ) k p k + p * 2 k * 2

GAp(k ) k b k + b * 2 k * 2

PAg(k ) k 1 b * k + 2 k * 2

PAs(k,p ) k p b * k + p * 2 k * 2

PAp(k ) k b b * k + b * 2 k * 2

SAg(k ) k 1 s * k + 2 k * 2

SAs(k,s * p ) k p s * k + p * 2 k * 2

SAp(k ) k b s * k + b * 2 k * 2

In the table b is the number of PHTs or entries in the BHT for the per-address schemes.

P and s denotes the number of PHTs or entries in the BHT for the per-set schemes.

62


Simulations of Yeh and Patt using the SPEC89 benchmarks

The performance of the global history schemes is sensitive to the branch history length.

Interference of different branches that are mapped to the same pattern history table is decreased by lengthening the global BHR.

Similarly adding PHTs reduces the possibility of pattern history interference by mapping interfering branches into different tables.

Global history schemes are better than the per-address schemes for the integer SPEC89 programs, – utilize branch correlation, which is often the case in the frequent if-then-

else statements in integer programs Per-address schemes are better for the floating-point intensive programs.

– better in predicting loop-control branches which are frequent in the floating-point SPEC89 benchmark programs.

The per-set history schemes are in between both other schemes.

63

gselectgselect and and gsharegshare ppredictorsredictors

gselect predictor: concatenates some lower order bit of the branch address and the global history

gshare predictor: uses the bitwise exclusive OR of part of the branch address and the global history as hash function.

McFarling: gshare slightly better than gselect

Branch Address BHR gselect4/4 gshare8/8

00000000 00000001 00000001 00000001

00000000 00000000 00000000 00000000

11111111 00000000 11110000 11111111

11111111 10000000 11110000 01111111

64

Hybrid Hybrid ppredictorsredictors

The second strategy of McFarling is to combine multiple separate branch predictors, each tuned to a different class of branches.

Two or more predictors and a predictor selection mechanism are necessary in a combining or hybrid predictor.– McFarling: combination of two-bit predictor and gshare two-level adaptive,– Young and Smith: a compiler-based static branch prediction with a two-

level adaptive type,– and many more combinations!

Hybrid predictors often better than single-type predictors.

65

Simulations Simulations [[GrunwaldGrunwald]]

committed conditional takenApplication instructions branches branches (in millions) (in millions) (%) SAg gshare combining

compress 80.4 14.4 54.6 10.1 10.1 9.9gcc 250.9 50.4 49.0 12.8 23.9 12.2perl 228.2 43.8 52.6 9.2 25.9 11.4go 548.1 80.3 54.5 25.6 34.4 24.1m88ksim 416.5 89.8 71.7 4.7 8.6 4.7xlisp 183.3 41.8 39.5 10.3 10.2 6.8vortex 180.9 29.1 50.1 2.0 8.3 1.7jpeg 252.0 20.0 70.0 10.3 12.5 10.4mean 267.6 46.2 54.3 8.6 14.5 8.1

misprediction rate(%)

SAg, gshare and MCFarling‘s combining predictor

66

ResultsResults

Simulation of Keeton et al. 1998 using an OLTP (online transaction workload) on a PentiumPro multiprocessor reported a misprediction rate of 14% with an branch instruction frequency of about 21%.

The speculative execution factor, given by the number of instructions decoded divided by the number of instructions committed, is 1.4 for the database programs.

Two different conclusions may be drawn from these simulation results: – Branch predictors should be further improved – and/or branch prediction is only effective if the branch is predictable.

If a branch outcome is dependent on irregular data inputs, the branch often shows an irregular behavior.

Question: Confidence of a branch prediction?

67

Predicated Predicated iinstructions and nstructions and mmultipath ultipath eexecutionxecution- Confidence - Confidence eestimationstimation

Confidence estimation is a technique for assessing the quality of a particular prediction.

Applied to branch prediction, a confidence estimator attempts to assess the prediction made by a branch predictor.

A low confidence branch is a branch which frequently changes its branch direction in an irregular way making its outcome hard to predict or even unpredictable.

Four classes possible:– correctly predicted with high confidence C(HC), – correctly predicted with low confidence C(LC), – incorrectly predicted with high confidence I(HC), and – incorrectly predicted with low confidence I(LC).

68

Implementation of a confidence estimatorImplementation of a confidence estimator

Information from the branch prediction tables is used:– Use of saturation counter information to construct a confidence estimator

speculate more aggressively when the confidence level is higher– Used of a miss distance counter table (MDC):

Each time a branch is predicted, the value in the MDC is compared to a threshold. If the value is above the threshold, then the branch is considered to have high confidence, and low confidence otherwise.

– A small number of branch history patterns typically leads to correct predictions in a PAs predictor scheme. The confidence estimator assigned high confidence to a fixed set of patterns and low confidence to all others.

Confidence estimation can be used for speculation control,thread switching in multithreaded processors or multipath execution

69

Predicated Predicated iinstructionsnstructions

Provide predicated or conditional instructions and one or more predicate registers.

Predicated instructions use a predicate register as additional input operand. The Boolean result of a condition testing is recorded in a (one-bit) predicate

register. Predicated instructions are fetched, decoded and placed in the instruction

window like non predicated instructions. It is dependent on the processor architecture, how far a predicated instruction

proceeds speculatively in the pipeline before its predication is resolved: – A predicated instruction executes only if its predicate is true, otherwise

the instruction is discarded. In this case predicated instructions are not executed before the predicate is resolved.

– Alternatively, as reported for Intel's IA64 ISA, the predicated instruction may be executed, but commits only if the predicate is true, otherwise the result is discarded.

70

Predication Predication eexamplexample

if (x = = 0) { /*branch b1 */a = b + c;d = e - f;

}g = h * i; /* instruction independent of branch b1 */

(Pred = (x = = 0) ) /* branch b1: Pred is set to true in x equals 0 */if Pred then a = b + c; /* The operations are only performed */if Pred then e = e - f; /* if Pred is set to true */g = h * i;

71

PredicationPredication

Able to eliminate a branch and therefore the associated branch prediction increasing the distance between mispredictions.

The the run length of a code block is increased better compiler scheduling.

Predication affects the instruction set, adds a port to the register file, and complicates instruction execution.

Predicated instructions that are discarded still consume processor resources; especially the fetch bandwidth.

Predication is most effective when control dependences can be completely eliminated, such as in an if-then with a small then body.

The use of predicated instructions is limited when the control flow involves more than a simple alternative sequence.

72

Eager (Eager (mmultipath) ultipath) eexecutionxecution

Execution proceeds down both paths of a branch, and no prediction is made. When a branch resolves, all operations on the non-taken path are discarded.

Oracle execution: eager execution with unlimited resources– gives the same theoretical maximum performance as a perfect branch

prediction

With limited resources, the eager execution strategy must be employed carefully.

Mechanism is required that decides when to employ prediction and when eager execution: e.g. a confidence estimator

Rarely implemented (IBM mainframes) but some research projects:– Dansoft processor, Polypath architecture, selective dual path execution,

simultaneous speculation scheduling, disjoint eager execution

73

(a) Single path (a) Single path speculative executionspeculative execution(b) (b) FFull eager ull eager executionexecution(c) (c) DDisjoint eager isjoint eager executionexecution

.12

6

.17

5

.24

4

.34

3

.49

2

.7

1

.05

.07

.10

.15

.21

.3

.49

3

.71

.21

.3 2

.09 6.21

45 .24

5

.34

3

.49

2

.7

1

.10

.15

.21

.34

6

(a) (b) (c)

74

Prediction of Prediction of iindirect ndirect bbranchesranches

Indirect branches, which transfer control to an address stored in register, are harder to predict accurately.

Indirect branches occur frequently in machine code compiled from object-oriented programs like C++ and Java programs.

One simple solution is to update the PHT to include the branch target addresses.

75

Branch handling techniques and implementationsBranch handling techniques and implementationsTechnique Implementation examplesNo branch prediction Intel 8086Static prediction

always not taken Intel i486always taken Sun SuperSPARCbackward taken, forward not taken HP PA-7x00semistatic with profiling early PowerPCs

Dynamic prediction:1-bit DEC Alpha 21064, AMD K52-bit PowerPC 604, MIPS R10000,

Cyrix 6x86 and M2, NexGen 586two-level adaptive Intel PentiumPro, Pentium II, AMD K6

Hybrid prediction DEC Alpha 21264Predication Intel/HP Merced and most signal processors as e.g.

ARM processors, TI TMS320C6201 and many otherEager execution (limited) IBM mainframes: IBM 360/91, IBM 3090Disjoint eager execution none yet

76

High-High-bbandwidth andwidth bbranch ranch ppredictionrediction

Future microprocessor will require more than one prediction per cycle starting speculation over multiple branches in a single cycle,– e.g. Gag predictor is independent of branch address.

When multiple branches are predicted per cycle, then instructions must be

fetched from multiple target addresses per cycle, complicating I-cache access. – Possible solution: Trace cache in combination with next trace prediction.

Most likely a combination of branch handling techniques will be applied, – e.g. a multi-hybrid branch predictor combined with support for context

switching, indirect jumps, and interference handling.

77

Details of Details of ssuperscalar uperscalar ppipelineipeline

In-order section:– Instruction Fetch (BTAC access, simple branch prediction)

Fetch buffer– Instruction decode

• often: more complex branch prediction techniques• Register Rename

Instruction window Out-of-order section:

– Instruction issue to FU or Reservation station– Execute till completion

In-order section:– Retire (commit or remove)– Write-back

78

Decode Decode sstagetage

Superscalar processor: In-order delivery of instructions to the out-of-order execution kernel!

Instruction Delivery: – Fetch and decode instructions at a higher bandwidth than execute them. – Delivery task: Keep instruction window kept full

the deeper instruction look-ahead allows to find more instructions to issue to the execution units.

The processor fetches and decodes today about 1.4 to twice as many instructions than it commits (because of mispredicted branch paths).

Typically the decode bandwidth is the same as the instruction fetch bandwidth. Multiple instruction fetch and decode is supported by a fixed instruction length.

79

Decoding variable-length instructionsDecoding variable-length instructions

Variable instruction length:often the case for legacy CISC instruction sets as the Intel i86 ISA.

a multistage decode is necessary. – The first stage determines the instruction limits within the instruction

stream.– The second stage decodes the instructions generating one or several

micro-ops from each instruction.

Complex CISC instructions are split into micro-ops which resemble ordinary RISC instructions.

80

PredecodingPredecoding

Predecoding can be done when the instructions are transferred from memory or secondary cache to the I-cache.

the decode stage is more simple.

MIPS R10000: predecodes each 32-bit instruction into a 36-bit format stored in the I-cache. – The four extra bits indicate which functional unit should execute the

instruction. – The predecoding also rearranges operand- and destination-select fields to

be in the same position for every instruction, and – modifies opcodes to simplify decoding of integer or floating-point

destination registers. The decoder can decode this expanded format more rapidly than the original

instruction format.

81

Rename stageRename stage

Aim of register renaming:remove anti and output dependencies dynamically by the processor hardware.

Register renaming is the process of dynamically associating physical registers (rename registers) with the architectural registers (logical registers) referred to in the instruction set of the architecture.

Implementation: – mapping table; – a new physical register is allocated for every destination register specified

in an instruction. Each physical register is written only once after each assignment from the free

list of available registers. If a subsequent instruction needs its value, that instruction must wait until it is

written (true data dependence).

82

Two principal techniques to implement renamingTwo principal techniques to implement renaming

Separate sets of architectural registers and rename (physical) registers are provided.– The physical registers contain values (of completed but not yet retired

instructions),– the architectural (or logical) registers store the committed values. – After commitment of an instruction, copying its result from the rename

register to the architectural register is required. Only a single set of registers is provided and architectural registers are

dynamically mapped to physical registers. – The physical registers contain committed values and temporary results. – After commitment of an instruction, the physical register is made

permanent and no copying is necessary.

Alternative to the dynamic renaming is the use of a large register file as defined for the Intel IA-64 (Itanium).

83

Register Register rrename ename llogicogic

Access a multi-ported map table with logical register designators as index Additionally dependence check logic detects cases where the logical register

is written by an earlier instruction set up output MUXes

M apTab le

D ep e n d en ceC h eckL o g ic(S lic e )

. . .. . .

L o g ic a lS o u rc eR eg s

L o g ic a lD es tR eg s

L o g ic a lS o u rc eR eg R

P h y sica lS o u rc eR eg s

P h y sica lD es tR eg s . . . M U X

P h y sica lR eg M ap p ed

to L o g ica lR eg R

84

Issue and Issue and ddispatchispatch

The notion of the instruction window comprises all the waiting stations between decode (rename) and execute stages.

The instruction window isolates the decode/rename from the execution stages of the pipeline.

Instruction issue is the process of initiating instruction execution in the processor's functional units.– issue to a FU or a reservation station– dispatch, if a second issue stage exists to denote when an instruction is

started to execute in the functional unit. The instruction-issue policy is the protocol used to issue instructions. The processor's lookahead capability is the ability to examine instructions

beyond the current point of execution in hope of finding independent instructions to execute.

85

Instruction Instruction wwindow indow oorganizationsrganizations

Single-stage issue out of a central instruction window Multi-stage issue: Operand availability and resource availability checking is

split into two separate stages. Decoupling of instruction windows: Each instruction window is shared by a

group of (usually related) functional units, most common: separate floating-point window and integer window.

Combination of multi-stage issue and decoupling of instruction windows:– In a two-stage issue scheme with resource dependent issue preceding the

data-dependent dispatch, the first stage is done in-order, the second stage is performed out-of-order.

86

Functional Units

Issue and Dispatch

Decode and Rename

The following issue schemes are commonly usedThe following issue schemes are commonly used

Single-level, central issue: single-level issue out of a central window as in Pentium II processor

87

Decode and Rename

Functional Units

Issue and Dispatch

Functional Units

Single-level, two-window issueSingle-level, two-window issue

Single-level, two-window issue: single-level issue with a instruction window decoupling using two separate windows – most common: separate floating point and integer windows as in HP 8000

processor

88

Decode and Rename

DispatchIssue

Functional Unit

Functional Unit

Functional Unit

Functional Unit

Reservation Stations

Two-level issue with multiple windowsTwo-level issue with multiple windows

Two-level issue with multiple windows with a centralized window in the first stage and separate windows in the second stage (PowerPC 604 and 620 processors).

89

Wakeup Wakeup llogicogic

rd y L o p d ta g L o p d ta g R rd y R

O R ==

== O R

. . .

tag IW tag 1

in s t0

. . .

rd y L o p d ta g L o p d ta g R rd y R in stN -1

. . .

Result tag broadcast to all instructions in windowIf match rdyL or rdyR flag setIf both ready ready flag set REQ signal raised

90

Selection Selection llogicogic

req0

gran

t0re

q1gr

ant1

req2

gran

t2re

q3gr

ant3

a n y req en ab le an y req en ab le an y req en ab le an y req en ab le

. . .

. . .Issu e W in d o w

req0

gran

t0re

q1gr

ant1

req2

gran

t2re

q3gr

ant3

a n y req en ab le

req0

req1

req2

req3 gr

ant0

gran

t1gr

ant2

gran

t3

A rb ite r C e llO R P rio rityE n co d e r

an y req en a b le

req0

gran

t0

e n ab le

ro o t ce ll

fro m /to o th e r su b tree s

REQ signals are raisedwhen all operands are available

91

Execution Execution sstagestages

Various types of FUs classified as: – single-cycle (latency of one) or – multiple-cycle (latency more than one) units.

Single-cycle units produce a result one cycle after an instruction started execution. Usually they are also able to accept a new instruction each cycle (throughput of one).

Multi-cycle units perform more complex operations that cannot be implemented within a single cycle.

Multi-cycle units – can be pipelined to accept a new operation each cycle or each other cycle – or they are non-pipelined.

Another class of units exists that perform the operations with variable cycle times.

92

Types of FUsTypes of FUs

Siingle-cycle (single latency) units: – (simple) integer and (integer-based) multimedia units,

Multicycle units that are pipelined (throughput of one): – complex integer, floating-point, and (floating-point -based) multimedia unit

(also called multimedia vector units), Multicycle units that are pipelined but do not accept a new operation each

cycle (throughput of 1/2 or less): – often the 64-bit floating-point operations in a floating-point unit,

Multicycle units that are often not pipelined: – division unit, square root units, complex multimedia units

Variable cycle time units: – load/store unit (depending on cache misses) and special implementations

of e.g. floating-point units

93

Media Media pprocessors and rocessors and mmultimedia ultimedia uunitsnits

Media processing (digital multimedia information processing) is the decoding, encoding, interpretation, enhancement, and rendering of digital multimedia information.

Todays video and 3D graphics require high bandwidth and processing performance:– Separate special-purpose video chips e.g. for MPEG-2, 3D-graphics, etc.

and multi-algorithm video chip sets– Programmable video processors (very sophisticated DSPs): TMS320C82,

Siemens Tricore, Hyperstone– Media processors and media coprocessors: Chromatics MPACT media

processor, Philips Trimedia TM-1, MicroUnity Media processor– Multimedia units: multimedia-extensions for general-purpose processors

(VIS, MMX, MAX)

94

Utilization of subword parallelism (data parallel instructions, SIMD)

Saturation arithmetic Additional arithmetic instructions, e.g. pavgusb (average instruction), masking

and selection instructions, reordering and conversion

Media Media pprocessors and rocessors and mmultimedia ultimedia uunitsnits

x1 x2 x3 x4 y1 y2 y3 y4

x1*y1 x2*y2 x3*y3 x4*y4

R1: R2:

R3:

95

Multimedia extensions in today's microprocessorsMultimedia extensions in today's microprocessors

Multimedia acceleration extensions (MAX-1, MAX-2) for HP PA-8000 and PA-8500

Visual instruction set (VIS) for UltraSPARC Matrix manipulation extensions (MMX, MMX2) for the Intel P55C and Pentium II AltiVec extensions for Motorola processors Motion video instructions (MVI) for Alpha processors and MIPS digital media extensions (MDMX) for MIPS processors.

3D Graphical Enhancements: ISSE (internet streaming SIMD extension) extends MMX in Pentium III 3DNow! of AMD K6-2 and Athlon

96

3D 3D ggraphical raphical eenhancementnhancement

The ultimate goal is the integrated real-time processing of multiple audio, video, and 2-D and 3-D graphics streams on a system CPU.

To speed up 3D applications by the main processor, fast low precision floating-point operations are required: – reciprocal instructions are of specific importance – e.g. square root reciprocal with low precision.

3D graphical enhancements apply so-called vector operations:– execute two paired single-precision floating-point operations in parallel

on two single-precision floating-point values stored in an 64-bit floating-point register.

Such vector operations are defined by 3Dnow! extension by AMD and by ISSE of Intel's Pentium III.

The 3DNow! defines 21 new instructions which are mainly paired single-precision floating-point operations.

97

Finalizing Finalizing ppipelined ipelined eexecutionxecution - - ccompletion, ompletion, ccommitment, ommitment, rretirement and etirement and wwrite-rite-bbackack An instruction is completed when the FU finished the execution of the

instruction and the result is made available for forwarding and buffering.– Instruction completion is out of program order.

Committing an operation means that the results of the operation have been made permanent and the operation retired from the scheduler.

Retiring means removal from the scheduler with or without the commitment of operation results, whichever is appropriate.– Retiring an operation does not imply the results of the operation are either

permanent or non permanent. A result is made permanent:

– either by making the mapping of architectural to physical register permanent (if no separate physical registers exist) or

– by copying the result value from the rename register to the architectural register ( in case of separate physical and architectural registers)in an own write-back stage after the commitment!

98

Precise Precise iinterruptsnterrupts

An interrupt or exception is called precise if the saved processor state corresponds with the sequential model of program execution where one instruction execution ends before the next begins.

The saved state should fulfil the following conditions:– All instructions preceding the instruction indicated by the saved program

counter have been executed and have modified the processor state correctly.

– All instructions following the instruction indicated by the saved program counter are unexecuted and have not modified the processor state.

– If the interrupt is caused by an exception condition raised by an instruction in the program, the saved program counter points to the interrupted instruction.

– The interrupted instruction may or may not have been executed, depending on the definition of the architecture and the cause of the interrupt. Whichever is the case, the interrupted instruction has either ended execution or has not started.

99

Precise Precise iinterruptsnterrupts

Interrupts belong to two classes:– Program interrupts or traps result from exception conditions detected

during fetching and execution of specific instructions • illegal opcodes, numerical errors such as overflow, or • part of normal execution, e.g., page faults.

– External interrupts are caused by sources outside of the currently executing instruction stream

• I/O interrupts and timer interrupts. • For such interrupts restarting from a precise processor state should be

made possible. When an exception condition can be detected prior to issue, then instruction

issuing is simply halted and the processor waits until all previous issued instructions are retired.

Processors often have two modes of operation:One mode guarantees precise exception and another mode, which is often 10 times faster, does not.

100

Reorder Reorder bbuffersuffers

The reorder buffer keeps the original program order of the instructions after instruction issue and allows result serialization during the retire stage.

State bits store if an instruction is on a speculative path, and when the branch is resolved, if the instruction is on a correct path or must be discarded.

When an instruction completes, the state is marked in its entry. Exceptions are marked in the reorder buffer entry of the triggering instruction. The reorder buffer is implemented as a circular FIFO buffer. Reorder buffer entries are allocate in the (first) issue stage and deallocated

serially when the instruction retires.

101

Reorder Reorder bbuffer uffer vvariationsariations

Reorder buffer holds only instruction execution states (results are in rename registers). – Johnson's description of a reorder buffer in combination with a so-called

future file. The future file is similar to the set of rename registers that are separate to the architectural registers.

– In contrast, Smith and Pleskun describe a reorder buffer in combination with a future file, whereby the reorder buffer and the future file receive and store results at the same time.

Other reorder buffer type: The reorder buffer holds the result values of completed instructions instead of rename registers.

Moreover the instruction window can be combined with the reorder buffer to a single buffer unit.

102

Other Other rrecovery ecovery mmechanismsechanisms

Checkpoint repair mechanism: – The processor provides a set of logical spaces, where each logical space

consists of a full set of software-visible registers and memory. – One is used for current execution, the others contain back-up copies of the

in-order state that corresponds to previous points in execution. – At various times during execution, a check-point is made by copying the

architectural state of the current logical state to the back-up space. – Restarting is accomplished by loading the contents of the appropriate

back-up stage into the current logical state. History buffer:

– The (architectural) register file contains the current state, and the history buffer contains old register values which have been replaced by new values.

– The history buffer is managed as LIFO stack, and the old values are used to restore a previous state if necessary.

103

Relaxing Relaxing iin-order n-order rretirementetirement

The only relaxation can be existent in the order of load and store instructions. Result serialization as it is demanded by the serial instruction flow of the von

Neumann architecture. A fully parallel and highly speculative processor must look like a simple von

Neumann processor as it was state-of-the-art in the fifties. Possible relaxation:

– Assume an instruction sequence A ends with a branch that predicts an instruction sequence B, and B is followed by a sequence C which is not dependent on B.

– Thus C is executed independently from the branch direction. – Therefore, instructions in C can start to retire before B.

104

The Intel P5 and P6 familyThe Intel P5 and P6 family

Year Type Transistors Technology Clock Issue Word L1 cache L2 cache(x1000) (mm) (MHz) format

1993 Pentium 3100 0.8 66 2 32-bit 2 X 8 kB1994 Pentium 3200 0.6 75-100 2 32-bit 2 X 8 kB1995 Pentium 3200 0.6/0.35 120-133 2 32-bit 2 X 8 kB1996 Pentium 3300 0.35 150-166 2 32-bit 2 X 8 kB1997 Pentium MMX 4500 0.35 200-233 2 32-bit 2 X 16 kB1998 Mobile Pentium MMX 4500 0.25 200-233 2 32-bit 2 X 16 kB1995 PentiumPro 5500 0.35 150-200 3 32-bit 2 X 8 kB 256/512 kB1997 PentiumPro 5500 0.35 200 3 32-bit 2 X 8 kB 1 MB1998 Intel Celeron 7500 0.25 266-300 3 32-bit 2 X 16 kB --1998 Intel Celeron 19000 0.25 300-333 3 32-bit 2 X 16 kB 128 kB1997 Pentium II 7000 0.25 233-450 3 32-bit 2 X 16 kB 256 kB/512 kB1998 Mobile Pentium II 7000 0.25 300 3 32-bit 2 X 16 kB 256 kB/512 kB1998 Pentium II Xeon 7000 0.25 400-450 3 32-bit 2 X 16 kB 512 kB/1 MB1999 Pentium II Xeon 7000 0.25 450 3 32-bit 2 X 16 kB 512 kB/2 MB1999 Pentium III 8200 0.25 450-1000 3 32-bit 2 X 16 kB 512 kB1999 Pentium III Xeon 8200 0.25 500-1000 3 32-bit 2 x 16 kB 512 kB2000 Pentium 4 42000 0.18 1500 3 32-bit 8kB / 12kµOps 256 kB

P5

P6

including L2 cache

NetBurst

105

Micro-Micro-ddataflow in PentiumPro 1995ataflow in PentiumPro 1995

... The flow of the Intel Architecture instructions is predicted and these instructions are decoded into micro-operations (mops), or series of mops, and these mops are register-renamed, placed into an out-of-order speculative pool of pending operations, executed in dataflow order (when operands are ready), and retired to permanent machine state in source program order. ...

R.P. Colwell, R. L. Steck: A 0.6 mm BiCMOS Processor with Dynamic Execution, International Solid State Circuits Conference, Feb. 1995.

106

PentiumPro and Pentium II/IIIPentiumPro and Pentium II/III

The Pentium II/III processors use the same dynamic execution microarchitecture as the other members of P6 family.

This three-way superscalar, pipelined micro-architecture features a decoupled, multi-stage superpipeline, which trades less work per pipestage for more stages.

The Pentium II/III processor has twelve stages with a pipestage time 33 percent less than the Pentium processor, which helps achieve a higher clock rate on any given manufacturing process.

A wide instruction window using an instruction pool. Optimized scheduling requires the fundamental “execute” phase to be

replaced by decoupled “issue/execute” and “retire” phases. This allows instructions to be started in any order but always be retired in the original program order.

Processors in the P6 family may be thought of as three independent engines coupled with an instruction pool.

107

PentiumPentium®® Pro Processor and Pentium II/III Pro Processor and Pentium II/III MicroarchitectureMicroarchitecture

108

PentiumPentiumII/IIIII/III

Bus Interface Unit

Instruction Fetch Unit (with I-cache)

Res

erva

tion

Sta

tion

Un

it

L2 CacheExternal Bus

MemoryReorderBuffer

D-cacheUnit

InstructionDecodeUnit

MemoryInterfaceUnit

FunctionalUnits

MicrocodeInstruction Sequencer

ReorderBuffer &RetirementRegisterFile

Branch Target Buffer

RegisterAliasTable

109

Pentium II/III: The Pentium II/III: The iin-n-oorder rder ssectionection

The instruction fetch unit (IFU) accesses a non-blocking I-cache, it contains the Next IP unit.

The Next IP unit provides the I-cache index (based on inputs from the BTB), trap/interrupt status, and branch-misprediction indications from the integer FUs.

Branch prediction: – two-level adaptive scheme of Yeh and Patt,– BTB contains 512 entries, maintains branch history information and the

predicted branch target address. – Branch misprediction penalty: at least 11 cycles, on average 15 cycles

The instruction decoder unit (IDU) is composed of three separate decoders

110

Pentium II/III: The Pentium II/III: The iin-n-oorder rder ssection (Continued)ection (Continued)

A decoder breaks the IA-32 instruction down to mops, each comprised of an opcode, two source and one destination operand. These mops are of fixed length. – Most IA-32 instructions are converted directly into single micro ops (by any

of the three decoders), – some instructions are decoded into one-to-four mops (by the general

decoder), – more complex instructions are used as indices into the microcode

instruction sequencer (MIS) which will generate the appropriate stream of mops.

The mops are send to the register alias table (RAT) where register renaming is performed, i.e., the logical IA-32 based register references are converted into references to physical registers.

Then, with added status information, mops continue to the reorder buffer (ROB, 40 entries) and to the reservation station unit (RSU, 20 entries).

111

The The ffetch/etch/ddecode ecode uunitnit

I-cache

Instruction Fetch Unit

Next_IP

BranchTargetBuffer

MicrocodeInstructionSequencer

RegisterAliasTable

InstructionDecodeUnit S

impl

e D

ecod

er

IA-32instructions

Alignment

Sim

ple

Dec

oder

Gen

eral

Dec

oder

op1 op2 op3(a) in-order section

(b) instruction decoder unit (IDU)

112

The The oout-of-ut-of-oorder rder eexecute xecute ssectionection

When the mops flow into the ROB, they effectively take a place in program order.

mops also go to the RSU which forms a central instruction window with 20 reservation stations (RS), each capable of hosting one mop.

mops are issued to the FUs according to dataflow constraints and resource availability, without regard to the original ordering of the program.

After completion the result goes to two different places, RSU and ROB. The RSU has five ports and can issue at a peak rate of 5 mops each cycle.

113

Latencies and throughtput for Pentium II/III FUsLatencies and throughtput for Pentium II/III FUs

RSU Port FU Latency Throughput Integer arithmetic/logical 1 1 Shift 1 1 Integer mul 4 1 Floating-point add 3 1 Floating-point mul 5 0.5 Floating-point div long nonpipelined MMX arithmetic/logical 1 1 MMX mul 3 1 Integer arithmetic/logical 1 1 MMX arithmetic/logical 1 1 MMX shift 1 1

2 Load 3 13 Store address 3 14 Store data 1 1

0

1

114

Issue/Execute UnitIssue/Execute Unit

to/fromReorderBuffer

Port 0

Port 1

Port 2

Port 3

Port 4

Res

erva

tion

Sta

tion

Uni

t

MMXFunctional Unit

Floating-pointFunctional Unit

IntegerFunctional Unit

MMXFunctional Unit

JumpFunctional Unit

IntegerFunctional Unit

LoadFunctional Unit

StoreFunctional Unit

StoreFunctional Unit

115

The The iin-n-oorder rder rretire etire ssection.ection.

A mop can be retired– if its execution is completed, – if it is its turn in program order, – and if no interrupt, trap, or misprediction occurred.

Retirement means taking data that was speculatively created and writing it into the retirement register file (RRF).

Three mops per clock cycle can be retired.

116

Retire Retire uunitnit

to/fromD-cache

to/from Reorder Buffer

ReservationStationUnit

MemoryInterfaceUnit

RetirementRegisterFile

117

The Pentium II/III The Pentium II/III ppipelineipeline

BT

B a

cces

sI-

cach

e ac

cess

Fet

ch a

nd p

rede

code

Dec

ode

BTB0

BTB1

IFU0

IFU1

Register renaming

Reorder buffer read

IFU2

IDU0

IDU1

RAT

ROBread R

etir

emen

t

Retirement

(a) (c)

Reorder bufferwrite-back

RRF

ROBwrite

Port 0

Port 1

Port 2

Port 3

Port 4

Exe

cuti

on a

nd c

ompl

etio

nIs

sue

Reservation station

Reorder buffer read

RSU

ROBread

(b)

118

PentiumPentium®® Pro Pro pprocessor rocessor bbasic asic eexecution xecution eenvironmentnvironment

Eight 32-bitRegisters

Six 16-bitRegisters

32 bits

32 bits

General PurposeRegisters

Segment Registers

EFLAGS Register

EIP (InstructionPointer Register)

* The address space can be flat or segmented

AddressSpace*

0

232-1

119

Application Application pprogramming rogramming rregistersegisters

120

Pentium IIIPentium III

121

Pentium II/III summary and offspringsPentium II/III summary and offsprings

Pentium III in 1999, initially at 450 MHz (0.25 micron technology), former name Katmai

two 32 kB caches, faster floating-point performance

Coppermine is a shrink of Pentium III down to 0.18 micron.

122

Pentium 4Pentium 4

Was announced for mid-2000 under the code name Willamette native IA-32 processor with Pentium III processor core running at 1.5 GHz 42 million transistors 0.18 µm 20 pipeline stages (integer pipeline), IF and ID not included trace execution cache (TEC) for the decoded µOps NetBurst micro-architecture

123

Pentium 4 Pentium 4 ffeatureseatures

Rapid Execution Engine:

Intel: “Arithmetic Logic Units (ALUs) run at twice the processor frequency” Fact: Two ALUs, running at processor frequency connected with a multiplexer

running at twice the processor frequency

Hyper Pipelined Technology:

Twenty-stage pipeline to enable high clock rates Frequency headroom and performance scalability

124

Advanced Advanced ddynamic ynamic eexecutionxecution

Very deep, out-of-order, speculative execution engine– Up to 126 instructions in flight (3 times larger than the Pentium III

processor)– Up to 48 loads and 24 stores in pipeline (2 times larger than the Pentium III

processor)

Branch prediction– based on µOPs– 4K entry branch target array (8 times larger than the Pentium III processor)– new algorithm (not specified), reduces mispredictions compared to gshare

of the P6 generation about one third

125

First level cachesFirst level caches

12k µOP Execution Trace Cache (~100 k)

Execution Trace Cache that removes decoder latency from main execution loops

Execution Trace Cache integrates path of program execution flow into a single line

Low latency 8 kByte data cache with 2 cycle latency

126

Second level cachesSecond level caches

Included on the die

Size: 256 kB

Full-speed, unified 8-way 2nd-level on-die Advance Transfer Cache

256-bit data bus to the level 2 cache

Delivers ~45 GB/s data throughput (at 1.4 GHz processor frequency)

Bandwidth and performance increases with processor frequency

127

NetBurst NetBurst mmicroicroaarchitecturerchitecture

128

Streaming SIMD Streaming SIMD eextensions 2 xtensions 2 (SSE2) (SSE2) ttechnologyechnology

SSE2 Extends MMX and SSE technology with the addition of 144 new instructions, which include support for:– 128-bit SIMD integer arithmetic operations.– 128-bit SIMD double precision floating point operations.– Cache and memory management operations.

Further enhances and accelerates video, speech, encryption, image and photo processing.

129

400 MHz Intel NetBurst 400 MHz Intel NetBurst microarchitecture system busmicroarchitecture system bus

Provides 3.2 GB/s throughput (3 times faster than the Pentium III processor).

Quad-pumped 100MHz scalable bus clock to achieve 400 MHz effective speed.

Split-transaction, deeply pipelined.

128-byte lines with 64-byte accesses.

130

Pentium 4 data typesPentium 4 data types

131

Pentium 4Pentium 4

132

Pentium 4 offspringsPentium 4 offsprings

Foster Pentium 4 with external L3 cache and DDR-SDRAM support provided for server clock rate 1.7 - 2 GHz to be launched in Q2/2001

Northwood 0.13 µm technique new 478 pin socket

133

VLIW (very long instruction word):

Compiler packs a fixed number of instructions into a single VLIW instruction. The instructions within a VLIW instruction are issued and executed in parallelExample: High-end signal processors (TMS320C6201)

EPIC (explicit parallel instruction computing):

Evolution of VLIWExample: Intel’s IA-64, exemplified by the Itanium processor

VLIW or EPICVLIW or EPIC

134

VLIWVLIW

VLIW (very long instruction word) processors use a long instruction word that contains a usually fixed number of operations that are fetched, decoded, issued, and executed synchronously.

All operations specified within a VLIW instruction must be independent of one another.

Some of the key issues of a (V)LIW processor:– (very) long instruction word (up to 1 024 bits per instruction),– each instruction consists of multiple independent parallel operations,– each operation requires a statically known number of cycles to complete,– a central controller that issues a long instruction word every cycle,– multiple FUs connected through a global shared register file.

135

VLIW and VLIW and ssuperscalaruperscalar

Sequential stream of long instruction words. Instructions scheduled statically by the compiler. Number of simultaneously issued instructions is fixed during compile-time. Instruction issue is less complicated than in a superscalar processor. Disadvantage: VLIW processors cannot react on dynamic events,

e.g. cache misses, with the same flexibility like superscalars. The number of instructions in a VLIW instruction word is usually fixed. Padding VLIW instructions with no-ops is needed in case the full issue

bandwidth is not be met. This increases code size. More recent VLIW architectures use a denser code format which allows to remove the no-ops.

VLIW is an architectural technique, whereas superscalar is a microarchitecture technique.

VLIW processors take advantage of spatial parallelism.

136

EPIC: a paradigm shiftEPIC: a paradigm shift

Superscalar RISC solution

– Based on sequential execution semantics

– Compiler’s role is limited by the instruction set architecture

– Superscalar hardware identifies and exploits parallelism

EPIC solution – (the evolution of VLIW)

– Based on parallel execution semantics

– EPIC ISA enhancements support static parallelization

– Compiler takes greater responsibility for exploiting parallelism

– Compiler / hardware collaboration often resembles superscalar

137

EPIC: a paradigm shiftEPIC: a paradigm shift

Advantages of pursuing EPIC architectures

– Make wide issue & deep latency less expensive in hardware

– Allow processor parallelism to scale with additional VLSI density

Architect the processor to do well with in-order execution

– Enhance the ISA to allow static parallelization

– Use compiler technology to parallelize program

– However, a purely static VLIW is not appropriate for general-purpose use

138

The fusion of VLIW and superscalar techniquesThe fusion of VLIW and superscalar techniques

Superscalars need improved support for static parallelization– Static scheduling– Limited support for predicated execution

VLIWs need improved support for dynamic parallelization– Caches introduce dynamically changing memory latency– Compatibility: issue width and latency may change with new hardware – Application requirements - e.g. object oriented programming with dynamic

binding EPIC processors exhibit features derived from both

– Interlock & out-of-order execution hardware are compatible with EPIC (but not required!)

– EPIC processors can use dynamic translation to parallelize in software

139

Many EPIC features are taken from VLIWsMany EPIC features are taken from VLIWs

Minisupercomputer products stimulated VLIW research (FPS, Multiflow, Cydrome)

Minisupercomputers were specialized, costly, and short-lived

Traditional VLIWs not suited to general purpose computing

VLIW resurgence in single chip DSP & media processors

Minisupercomputers exaggerated forward-looking challenges:

Long latency

Wide issue

Large number of architected registers

Compile-time scheduling to exploit exotic amounts of parallelism

EPIC exploits many VLIW techniques

140

Shortcomings of early VLIWsShortcomings of early VLIWs

Expensive multi-chip implementations

No data cache

Poor "scalar" performance

No strategy for object code compatibility

141

EPIC design challengesEPIC design challenges

Develop architectures applicable to general-purpose computing

– Find substantial parallelism in ”difficult to parallelize” scalar programs

– Provide compatibility across hardware generations– Support emerging applications (e.g. multimedia)

Compiler must find or create sufficient ILP

Combine the best attributes of VLIW & superscalar RISC (incorporated best concepts from all available sources)

Scale architectures for modern single-chip implementation

142

EPIC Processors, Intel's IA-64 ISA and ItaniumEPIC Processors, Intel's IA-64 ISA and Itanium

Joint R&D project by Hewlett-Packard and Intel (announced in June 1994) This resulted in explicitly parallel instruction computing (EPIC) design style:

– specifying ILP explicit in the machine code, that is, the parallelism is encoded directly into the instructions similarly to VLIW;

– a fully predicated instruction set;– an inherently scalable instruction set (i.e., the ability to scale to a lot of

FUs);– many registers;– speculative execution of load instructions

143

IA-64 ArchitectureIA-64 Architecture

Unique architecture features & enhancements– Explicit parallelism and templates – Predication, speculation, memory support, and others– Floating-point and multimedia architecture

IA-64 resources available to applications– Large, application visible register set– Rotating registers, register stack, register stack engine

IA-32 & PA-RISC compatibility models

144

Today’s Today’s aarchitecture rchitecture cchallengeshallenges

Performance barriers :– Memory latency– Branches – Loop pipelining and call / return overhead

Headroom constraints :– Hardware-based instruction scheduling

• Unable to efficiently schedule parallel execution– Resource constrained

• Too few registers• Unable to fully utilize multiple execution units

Scalability limitations :– Memory addressing efficiency

145

Intel's IA-64 ISA Intel's IA-64 ISA

Intel 64-bit Architecture (IA-64) register model:– 128 64-bit general purpose registers GR0-GR127

to hold values for integer and multimedia computations• each register has one additional NaT (Not a Thing) bit to indicate

whether the value stored is valid, – 128 82-bit floating-point registers FR0-FR127

• registers f0 and f1 are read-only with values +0.0 and +1.0, – 64 1-bit predicate registers P0-PR63

• the first register p0 is read-only and always reads 1 (true)– 8 64-bit branch registers BR0-BR7 to specify the target addresses of

indirect branches

146

IA-64’s IA-64’s llarge arge rregister egister ffileile

BR7BR7

BR0BR0

Branch RegistersBranch Registers

6363 00

96 Stacked, Rotating96 Stacked, Rotating

GR1GR1

GR31GR31

GR127GR127

GR32GR32

GR0GR0

NaTNaT 32 Static32 Static

00

Integer RegistersInteger Registers

6363 00

PredicatePredicate RegistersRegisters

11

PR1PR1

PR63PR63

PR0PR0

PR15PR15

PR16PR16

48 Rotating48 Rotating

16 Static16 Static

bit 0bit 0

96 Rotating96 Rotating

GR1GR1

GR31GR31

GR127GR127

GR32GR32

GR0GR0

32 Static32 Static

0.00.0

Floating-Point Floating-Point RegistersRegisters

8181 00

147

Intel's IA-64 ISAIntel's IA-64 ISA

– IA-64 instructions are 41-bit (previously stated 40 bit) long and consist of • op-code, • predicate field (6 bits),• two source register addresses (7 bits each), • destination register address (7 bits), and • special fields (includes integer and floating-point arithmetic).

– The 6-bit predicate field in each IA-64 instruction refers to a set of 64 predicate registers.

– 6 types of instructions• A: Integer ALU I-unit or M-unit• I: Non-ALU integer I-unit• M: Memory M-unit• B: Branch B-unit• F: Floating-point F-unit• L: Long Immediate I-unit

– IA-64 instructions are packed by compiler into bundles.

148

IA-64 IA-64 bbundlesundles

A bundle is a 128-bit long instruction word (LIW) containing three 41-bit IA-64 instructions along with a so-called 5-bit template that contains instruction grouping information.

IA-64 does not insert no-op instructions to fill slots in the bundles. The template explicitly indicates (ADAG):

– first 4 bits: types of instructions– last bit (stop bit): whether the bundle can be executed in parallel with the

next bundle – (previous literature): whether the instructions in the bundle can be

executed in parallel or if one or more must be executed serially (no more in ADAG description)

Bundled instructions don't have to be in their original program order, and they can even represent entirely different paths of a branch.

Also, the compiler can mix dependent and independent instructions together in a bundle, because the template keeps track of which is which.

149

IA-64 : Explicitly IA-64 : Explicitly pparallel arallel aarchitecturerchitecture

IA-64 template specifies – The type of operation for each instruction

• MFI, MMI, MII, MLI, MIB, MMF, MFB, MMB, MBB, BBB– Intra-bundle relationship

• M / MI or MI / I– Inter-bundle relationship

Most common combinations covered by templates– Headroom for additional templates

Simplifies hardware requirements Scales compatibly to future generations

Instruction 2Instruction 241 bits41 bits



TemplateTemplate5 bits5 bits

128 bits (bundle)128 bits (bundle)

M=MemoryM=MemoryF=Floating-pointF=Floating-pointI=IntegerI=IntegerL=Long ImmediateL=Long ImmediateB=BranchB=Branch

(MMI)(MMI)Memory (M)Memory (M) Memory (M)Memory (M) Integer (I)Integer (I)

150

IA-64 IA-64 sscalabilitycalability

A single bundle containing three instructions corresponds to a set of three FUs.

If an IA-64 processor had n sets of three FUs each then using the template information it would be possible to chain the bundles to create instruction word of n bundles in length.

This is the way to provide scalability of IA-64 to any number of FUs.

151

Predication in IA-64 ISAPredication in IA-64 ISA

Branch prediction: paying a heavy penalty in lost cycles if mispredicted. IA-64 compilers uses predication to remove the penalties caused by

mispredicted branches and by the need to fetch from noncontiguous target addresses by jumping over blocks of code beyond branches.

When the compiler finds a branch statement it marks all the instructions that represent each path of the branch with a unique identifier called a predicate.

IA-64 defines a 6-bit field (predicate register address) in each instruction to store this predicate.

64 unique predicates available at one time. Instructions that share a particular branch path will share the same predicate.

IA-64 also defines an advanced branch prediction mechanism for branches which cannot be removed.

152

If-then-else statementIf-then-else statement

153

Predication in IA-64 ISAPredication in IA-64 ISA

At run time, the CPU scans the templates, picks out the independent instructions, and issues them in parallel to the FUs.

Predicated branch: the processor executes the code for every possible branch outcome.

In spite of the fact that the processor has probably executed some instructions from both possible paths, none of the (possible) results is stored yet.

To do this, the processor checks predicate register of each of these instructions. – If the predicate register contains a 1,

the instruction is on the TRUE path (i.e., valid path), so the processor retires the instruction and stores the result.

– If the register contains a 0,

the instruction is invalid, so the processor discards the result.

154

Speculative loadingSpeculative loading

Load data from memory well before the program needs it, and thus to effectively minimize the impact of memory latency.

Speculative loading is a combination of compile-time and run-time optimizations. compiler-controlled speculation

The compiler is looking for any instructions that will need data from memory and, whenever possible, hoists a load at an earlier point in the instruction stream, ahead of the instruction that will actually use the data.

Today's superscalar processors: – load can be hoisted up to the first branch instruction which represents a

barrier Speculative loading combined with predication gives the compiler more

flexibility to reorder instructions and to shift loads above branches.

155

Speculative loading - “control speculation”Speculative loading - “control speculation”

156

Speculative loadingSpeculative loading

speculative load instruction ld.sspeculative check instruction chk.s

The compiler:– inserts the matching check immediately before the particular instruction

that will use the data,– rearranges the surrounding instructions so that the processor can issue

them in parallel. At run-time:

– the processor encounters the ld.s instruction first and tries to retrieve the data from the memory.

– ld.s performs memory fetch and exception detection (e.g., checks the validity of the address).

– If an exception is detected, ld.s does not deliver the exception.– Instead, ld.s only marks the target register (by setting a token bit).

157

Speculative loading Speculative loading “data speculation”“data speculation”

Mechanism can also be used to move a load above a store even if is is not known whether the load and the store reference overlapping memory locations.

Ld.a advanced load ...Chk.a checkuse data

158

Speculative loading/checkingSpeculative loading/checking

Exception delivery is the responsibility of the matching chk.s instruction. – When encountered, chk.s calls the operating system routine if the target

register is marked (i.e, if the corresponding token bit is set), and does nothing otherwise.

Whether the chk.s instruction will be encountered may depend on the outcome of the branch instruction. Thus, it may happen that an exception detected by ld.s is never

delivered.

Speculative loading with ld.s/chk.s machine level instructions resembles the TRY/CATCH statements in some high-level programming languages (e.g., Java).

159

Software Software ppipelining via ipelining via rrotating otating rregistersegisters

Software pipelining - improves performance by overlapping execution of different software loops - execute more loops in the same amount of time

Traditional architectures need complex software loop unrolling for pipelining– Results in code expansion --> Increases cache misses --> Reduces

performance IA-64 utilizes rotating registers to achieve software pipelining

– Avoids code expansion --> Reduces cache misses --> Higher performance

Sequential Loop ExecutionSequential Loop ExecutionT

ime

Tim

eSoftware Pipelining Loop ExecutionSoftware Pipelining Loop Execution

Tim

eT

ime

160

SW SW ppipelining by ipelining by mmodulo odulo sschedulingcheduling

(Cydrome) Specialized branch and rotating registers eliminate code replication

Initiation Interval (II)

[2]

Minimum Initiation Interval (MII) = MAX(Resource-constrained MII, Recurrence-constrained MII)

161

SW SW ppipelining by ipelining by rregister egister rrotationotation

Rotating registers– Floating-point: f32-f127– General-purpose: g32-g127; can be set by an alloc imm instruction– Predicate: p16-p63

Additional registers needed:– Current frame marker CFM: describes state of the general register stack plus

three register rename base values used in register rotation:• rr.pr (6 bit)• rr.fr (7 bit)• rr.gr (7 bit)

– within the “Application registers”:• loop count LC (64 bit register): decremented by counted-loop-type

branches• Epilog count EC (6-bit registers): for counting the epilog stages

162

SW SW ppipelining by ipelining by rregister egister rrotationotation- Counted - Counted lloop oop eexamplexample

L1: ld4 r4 = [r5],4 ;; // cycle 0, load postinc 4

add r7 = r4,r9 ;; // cycle 2

st4 [r6] = r7,4 // cycle 3 store postinc 4

br.loop L1 ;;

All instructions from iteration X are executed before iteration X+1

Assume store from iteration x is independent from load from iteration x+1:

conceptual view of a single sw pipelined iteration:

Stage 1: (p16) ld4 r4 = [r5],4

Stage 2: (p17) ------------- // empty stage

Stage 3: (p18) add r7 = r4,r9

Stage 4: (p19) st4 [r6] = r7,4 ;; separatesinstruction groups

163

SW SW ppipelining by ipelining by rregister egister rrotationotation- Counted - Counted lloop oop eexamplexample

Stage 1: (p16) ld4 r4 = [r5],4

Stage 2: (p17) ------------- // empty stage

Stage 3: (p18) add r7 = r4,r9

Stage 4: (p19) st4 [r6] = r7,4

is translated to:

mov lc = 199 // LC = loop count - 1

mov ec = 4 // EC = epilog stages + 1

mov pr.rot = 1<<16 ;; // PR16 = 1, rest = 0

L1:

(p16) ld4 r32 = [r5],4 // Cycle 0

(p18) add r35 = r34,r9 // Cycle 0

(p19) st4 [r6] = r36,4 // Cycle 0

br.ctop L1 ;; // Cycle 0

164

SW SW ppipelining by ipelining by rregister egister rrotationotation- - Optimizations and Optimizations and llimitationsimitations

Register rotation removes the requirement that kernel loops be unrolled to allow software renaming of the registers.

Speculation can further increase loop performance by removing dependence barriers.

Technique works also for while loops. Works also with predicated instructions (instead of assigning stage

predicates). Also possible for multiple-exit loops (epilog get more complicated). Limitation:

– Loops with very small trip counts may decrese performance when pipelined.

– Not desirable to pipeline a floating-point loop that contains a function call (number of fp registers is not known and it may be hard to find empty slots for instructions needed to save and restore the caller-saver floating-point registers across the function call).

165

Traditional Register StacksTraditional Register Stacks

IA-64 IA-64 rregister egister sstacktack

AA

BB

CC

DD

AA

BB

CC

DD

Eliminate the need for save / restore by reserving fixed blocks in register

However, fixed blocks waste resources

RegisterRegisterProceduresProcedures

IA-64 Register StackIA-64 Register Stack

AA

BB

CC

AA

BB

CCDD

RegisterRegisterProceduresProcedures

IA-64 able to reserve variable block sizes

No wasted resources

DD?

DD

DD

166

IA-64 support for IA-64 support for pprocedure rocedure ccallsalls

Subset of general registers are organized as a logically infinite set of stack frames that are allocated from a finite pool of physical registers

Stacked registers are GR32 up to a user-configurable maximum of GR127 a called procedure specifies the size of its new stack frame using alloc

instruction output registers of caller are overlapped with input registers of called

procedure

Register Stack Engine: – management of register stack by hardware– moves contents of physical registers between general register file and

memory– provides programming model that looks like unlimited register stack

167

Full Full bbinary IA-32 inary IA-32 iinstruction nstruction ccompatibility ompatibility

IA-32 IA-32 InstructionInstruction

SetSet


SetSet


SetSet


SetSetIntercepts,Intercepts,Exceptions,Exceptions,InterruptsInterrupts

Jump to Jump to IA-64IA-64

Branch toBranch toIA-32IA-32

System System ResourcesResourcesSystem System

ResourcesResources

Execution UnitsRegisters

IA-64 Hardware (IA-64 Mode)

System System ResourcesResourcesSystem System

ResourcesResources

Execution UnitsRegisters

IA-64 Hardware (IA-32 Mode)

• IA-32 instructions supported through shared hardware resourcesIA-32 instructions supported through shared hardware resources• Performance similar to volume IA-32 processorsPerformance similar to volume IA-32 processors

168

Full Full bbinary inary ccompatibility for PA-RISCompatibility for PA-RISC

Transparency: – Dynamic object code translator in HP-UX automatically converts PA-RISC

code to native IA-64 code– Translated code is preserved for later reuse

Correctness:– Has passed the same tests as the PA-8500

Performance:– Close PA-RISC to IA-64 instruction mapping– Translation on average takes 1-2% of the time

Native instruction execution takes 98-99%– Optimization done for wide instructions, predication, speculation, large

register sets, etc.– PA-RISC optimizations carry over to IA-64

169

Delivery of Delivery of sstreaming treaming mmediaedia

Audio and video functions regularly perform the same operation on arrays of data values– IA-64 manages its resources to execute these functions efficiently

• Able to manage general register’s as 8x8, 4x16, or 2x32 bit elements• Multimedia operands/results reside in general registers

IA-64 accelerates compression / decompression algorithms– Parallel ALU, Multiply, Shifts– Pack/Unpack; converts between different element sizes.

Fully compatible with – IA-32 MMXtechnology, – Streaming SIMD Extensions and – PA-RISC MAX2

170

IA-64 3D IA-64 3D ggraphics raphics ccapabilitiesapabilities

Many geometric calculations (transforms and lighting) use 32-bit floating-point numbers

IA-64 configures registers for maximum 32-bit floating-point performance– Floating-point registers treated as 2x32 bit single precision registers– Able to execute fast divide– Achieves up to 2X performance boost in 32-bit data floating-point operations

Full support for Pentium® III processor Streaming SIMD Extensions (SSE)

171

IA-64 for IA-64 for sscientific cientific aanalysisnalysis

Variety of software optimizations supported– Load double pair : doubles bandwidth between L1 and registers– Full predication and speculation support

• NaT Value to propagate deferred exceptions• Alternate IEEE flag sets allow preserving architectural flags

– Software pipelining for large loop calculations High precision & range internal format : 82 bits

– Mixed operations supported: single, double, extended, and 82-bit– Interfaces easily with memory formats

• Simple promotion/demotion on loads/stores– Iterative calculations converge faster– Ability to handle numbers much larger than RISC competition without

overflow

172

IA-64 Floating-Point ArchitectureIA-64 Floating-Point Architecture

128 registers– Allows parallel execution of multiple floating-point operations

Simultaneous Multiply - Accumulate (FMAC) – 3-input, 1-output operation : a * b + c = d– Shorter latency than independent multiply and add– Greater internal precision and single rounding error

MemoryMemory128 FP128 FP

RegisterRegisterFileFile

Multiple read portsMultiple read ports

Multiple write portsMultiple write ports

. . .

. . .

FMAC #1FMAC #1 FMAC #2FMAC #2

AA BB CC

DD

XX ++

(82 bit floating point numbers)(82 bit floating point numbers)

FMACFMAC FMACFMAC

173

Memory Memory ssupport for upport for hhigh igh pperformance erformance ttechnical echnical ccomputingomputing

Scientific analysis, 3D graphics and other technical workloads tend to be predictable & memory bound

IA-64 data pre-fetching of operations allows for fast access of critical information– Reduces memory latency impact

IA-64 able to specify cache allocation– Cache hints from load / store operations allow data to be placed at specific

cache level– Efficient use of caches, efficient use of bandwidth

174

IA IA sserver/erver/wworkstation orkstation rroadmapoadmap

MadisonIA-64 Perf

FutureFutureIA-32IA-32

DeerfieldIA-64 Price/Perf

Pe

rfo

rma

nc

e P

erf

orm

an

ce

’’0202’’0000 ’’0101.25µ.25µ .18µ .18µ .13µ.13µ

. . .. . .. . .. . .

McKinleyMcKinley

’’0303

ItaniumItanium

PentiumPentium®®III Xeon™ Proc.III Xeon™ Proc.

’’9898 ’’9999

PentiumPentium®® II Xeon II XeonTMTM

ProcessorProcessor

FosterFoster

175

ItaniumItanium

64-bit processor not in the Pentium, PentiumPro, Pentium II/III-line Targeted at servers with moderate to large numbers of processors full compatibility with Intel’s IA-32 ISA EPIC (explicitly parallel instruction computing) is applied. 6-wide (3 EPIC instructions) pipeline 10 stage pipeline 4 int, 4 multimedia, 2 load/store, 3 branch, 2 extended floating-point, 2 single-

prec. Floating-point units Multi-level branch prediction besides predication 16 KB 4-way set-associative d- and I-caches 96 KB 6-way set-associative L2 cache 4 MB L3 cache (on package) 800 MHz, 0.18 micro process (at beginning of 2001) shipments end of 1999 or mid-2000 or ??

176

Conceptual Conceptual vview of Itaniumiew of Itanium

177

Itanium Itanium pprocessor rocessor ccore ore ppipelineipeline

ROT: instruction rotation

pipelined access of the large register file:

WDL: word line decode:

REG: register read

DET: exception detection (~retire stage)

178

Itanium processorItanium processor

179

Itanium Itanium ddie ie pplotlot

180

Itanium vs. Willamette (P4)Itanium vs. Willamette (P4)

Itanium announced with 800 MHz P4 announced with 1.2 GHz P4 may be faster in running IA-32 code than Itanium running IA-64 code

Itanium probably won‘t compete with contemporary IA-32 processors but Intel will complete the Itanium design anyway

Intel hopes for the Itanium successor McKinley which will be out only one year later

1 chapter 4 multiple-issue processors. 2 multiple-issue processors this chapter concerns...

Documents