lecture 9 instruction level parallelism topics review appendix c dynamic scheduling scoreboarding...

Lecture 9Instruction Level Parallelism

Lecture 9Instruction Level Parallelism

Topics Topics Review Appendix C Dynamic Scheduling

ScoreboardingTomasulo

Readings: Chapter 3Readings: Chapter 3

October 8, 2014

CSCE 513 Computer Architecture

– 2 – CSCE 513 Fall 2015

OverviewOverviewLast TimeLast Time

Dynamic Scheduling MIPS R4000 Pipeline/ Scoreboard

NewNew Stalls in Diagrams revisited Scoreboard Review Dealing with Control Hazards in the 5-stage

ReferencesReferences Appendix A.8, Chapter 3

– 3 – CSCE 513 Fall 2015

Figure 3.1 ILP techniques referencesFigure 3.1 ILP techniques references

Review: Review:

– 4 – CSCE 513 Fall 2015

C.27 –Forwarding Paths to ALUC.27 –Forwarding Paths to ALU

– 5 – CSCE 513 Fall 2015

Fig C.24 Stalls with forwardingFig C.24 Stalls with forwarding

– 6 – CSCE 513 Fall 2015

Figure C.28 - Branch hazardsFigure C.28 - Branch hazards

Figure C.28 The stall from branch hazards can be reduced by moving the zero test and branch-target calculation into the ID phase of the pipeline. Notice that we

have made two important changes, each of which removes 1 cycle from the 3-cycle stall for branches. The first change is to move both the branch-target address

calculation and the branch condition decision to the ID cycle. The second change is to write the PC of the instruction in the IF phase, using either the branch-

target address computed during ID or the incremented PC computed during IF. In comparison, Figure C.22 obtained the branch-target address from the

EX/MEM register and wrote the result during the MEM clock cycle. As mentioned in Figure C.22, the PC can be thought of as a pipeline register (e.g., as part of

ID/IF), which is written with the address of the next instruction at the end of each IF cycle.

– 7 – CSCE 513 Fall 2015

Fig C.29 Revised IF, IDFig C.29 Revised IF, ID

Pipeline Fetch and Decode Stages for Branch InstructionsPipeline Fetch and Decode Stages for Branch Instructions

FetchFetch IF/ID.IR Mem[PC] IF/ID.NPC (if ((IF/ID.opcode == branch) &

(regs[IF/ID.IR[rs] op 0))

{IF?ID.NPC + sign-ext(IF/ID.IR[Imm] << 2)}

else {PC + 4}

DecodeDecode ID/EX.A Regs[IF/ID.IR[rs] ID/EX.B Regs[IF/ID.IR[rt]] ID/EX.IR IF/ID.IR ID/EX.Imm (IF/ID.IR[16]16 ## IF/ID.IR[16-31])

– 8 – CSCE 513 Fall 2015

Fig C.14 Branch Delay slotsFig C.14 Branch Delay slotsxxxxxx

– 9 – CSCE 513 Fall 2015

C. 4 What Makes Pipelining Hard to Implement? Exceptions!C. 4 What Makes Pipelining Hard to Implement? Exceptions! I/ O device request I/ O device request Invoking an operating system service from a user Invoking an operating system service from a user

program program Tracing instruction execution Tracing instruction execution Breakpoint (programmer-requested interrupt) Breakpoint (programmer-requested interrupt) Integer arithmetic overflow Integer arithmetic overflow FP arithmetic anomaly FP arithmetic anomaly Page fault (not in main memory) Page fault (not in main memory) Misaligned memory accesses (if alignment is Misaligned memory accesses (if alignment is

required) required) Memory protection violation Memory protection violation Using an undefined or unimplemented instruction Using an undefined or unimplemented instruction Hardware malfunctionsHardware malfunctions

– 10 – CSCE 513 Fall 2015Copyright © 2012, Elsevier Inc. All rights reserved.

IntroductionIntroduction

Pipelining became universal technique in 1985Pipelining became universal technique in 1985 Overlaps execution of instructions Exploits “Instruction Level Parallelism”

Beyond this, there are two main approaches:Beyond this, there are two main approaches: Hardware-based dynamic approaches

Used in server and desktop processorsNot used as extensively in PMP processors

Compiler-based static approachesNot as successful outside of scientific applications

Intro

du

ction


Instruction-Level ParallelismInstruction-Level Parallelism

When exploiting instruction-level parallelism, When exploiting instruction-level parallelism, goal is to maximize CPIgoal is to maximize CPI Pipeline CPI =

Ideal pipeline CPI +Structural stalls +Data hazard stalls +Control stalls

Parallelism with basic block is limitedParallelism with basic block is limited Typical size of basic block = 3-6 instructions Must optimize across branches

Intro

du

ction


Data DependenceData Dependence

Loop-Level ParallelismLoop-Level Parallelism Unroll loop statically or dynamically Use SIMD (vector processors and GPUs)

Challenges:Challenges: Data dependency

Instruction j is data dependent on instruction i if» Instruction i produces a result that may be used by

instruction j» Instruction j is data dependent on instruction k and

instruction k is data dependent on instruction i

Dependent instructions cannot be executed Dependent instructions cannot be executed simultaneouslysimultaneously

Intro

du

ction


Data DependenceData DependenceDependencies are a property of programsDependencies are a property of programs

Pipeline organization determines if dependence is Pipeline organization determines if dependence is detected and if it causes a stalldetected and if it causes a stall

Data dependence conveys:Data dependence conveys: Possibility of a hazard Order in which results must be calculated Upper bound on exploitable instruction level

parallelism

Dependencies that flow through memory locations Dependencies that flow through memory locations are difficult to detectare difficult to detect

Intro

du

ction


Name DependenceName Dependence

Two instructions use the same name but no Two instructions use the same name but no flow of informationflow of information Not a true data dependence, but is a problem

when reordering instructions Antidependence: instruction j writes a register or

memory location that instruction i readsInitial ordering (i before j) must be preserved

Output dependence: instruction i and instruction j write the same register or memory locationOrdering must be preserved

To resolve, use renaming techniquesTo resolve, use renaming techniques

Intro

du

ction


Other FactorsOther Factors

Data HazardsData Hazards Read after write (RAW) Write after write (WAW) Write after read (WAR)

Control DependenceControl Dependence Ordering of instruction i with respect to a branch

instructionInstruction control dependent on a branch cannot be

moved before the branch so that its execution is no longer controller by the branch

An instruction not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch

Intro

du

ction


ExamplesExamplesOR instruction dependent OR instruction dependent

on DADDU and DSUBUon DADDU and DSUBU

Assume R4 isn’t used after Assume R4 isn’t used after skipskip Possible to move DSUBU

before the branch

Intro

du

ction

• Example 1:

DADDU R1,R2,R3

BEQZ R4,L

DSUBU R1,R1,R6

L: …

OR R7,R1,R8

•Example 2:

DADDU R1,R2,R3

BEQZ R12,skip

DSUBU R4,R5,R6

DADDU R5,R4,R9

skip:

OR R7,R8,R9


Compiler Techniques for Exposing ILPCompiler Techniques for Exposing ILP

Pipeline schedulingPipeline scheduling Separate dependent instruction from the source instruction

by the pipeline latency of the source instruction

Example:Example:for (i=999; i>=0; i=i-1) x[i] = x[i] + s;

Co

mp

iler Tech

niq

ues


Pipeline StallsPipeline StallsLoop:Loop: L.DL.D F0,0(R1)F0,0(R1)

stallstall

ADD.D F4,F0,F2ADD.D F4,F0,F2

stallstall

stallstall

S.D F4,0(R1)S.D F4,0(R1)

DADDUI R1,R1,#-8DADDUI R1,R1,#-8

stallstall (assume integer load latency is 1) (assume integer load latency is 1)

BNE R1,R2,LoopBNE R1,R2,Loop

Co

mp

iler Tech

niq

ues


Pipeline SchedulingPipeline SchedulingScheduled code:Scheduled code:

Loop:Loop: L.DL.D F0,0(R1)F0,0(R1)



stallstall

stallstall

S.D F4,8(R1)S.D F4,8(R1)


Co

mp

iler Tech

niq

ues


Loop UnrollingLoop UnrollingLoop unrolling (not yet scheduled)Loop unrolling (not yet scheduled)

Unroll by a factor of 4 (assume # elements is divisible by 4) Eliminate unnecessary instructions

Loop:Loop: L.D F0,0(R1)L.D F0,0(R1)


S.D F4,0(R1) S.D F4,0(R1) ;drop DADDUI & BNE;drop DADDUI & BNE

L.D F6,-8(R1)L.D F6,-8(R1)


S.D F8,-8(R1) S.D F8,-8(R1) ;drop DADDUI & BNE;drop DADDUI & BNE

L.D F10,-16(R1)L.D F10,-16(R1)


S.D F12,-16(R1) S.D F12,-16(R1) ;drop DADDUI & BNE;drop DADDUI & BNE

L.D F14,-24(R1)L.D F14,-24(R1)


S.D F16,-24(R1)S.D F16,-24(R1)



Co

mp

iler Tech

niq

ues

note: number of live

registers vs. original

loop


Loop Unrolling/Pipeline SchedulingLoop Unrolling/Pipeline Scheduling

Pipeline schedule the unrolled loop:Pipeline schedule the unrolled loop:Loop:Loop: L.D F0,0(R1)L.D F0,0(R1)

L.D F6,-8(R1)L.D F6,-8(R1)

L.D F10,-16(R1)L.D F10,-16(R1)

L.D F14,-24(R1)L.D F14,-24(R1)





S.D F4,0(R1)S.D F4,0(R1)

S.D F8,-8(R1)S.D F8,-8(R1)


S.D F12,16(R1)S.D F12,16(R1)

S.D F16,8(R1)S.D F16,8(R1)


Co

mp

iler Tech

niq

ues


Strip MiningStrip Mining

Unknown number of loop iterations?Unknown number of loop iterations? Number of iterations = n Goal: make k copies of the loop body Generate pair of loops:

First executes n mod k timesSecond executes n / k times“Strip mining”

Co

mp

iler Tech

niq

ues

– 23 – CSCE 513 Fall 2015

Static Branch PredictionStatic Branch Prediction

Static Branch Prediction in general is a prediction that Static Branch Prediction in general is a prediction that uses information that was gathered before the uses information that was gathered before the execution of the program. execution of the program.

always taken (MIPS-X, Stanford) always taken (MIPS-X, Stanford)

always not taken (Motorola MC88000). always not taken (Motorola MC88000).

Another way is to run the program with a profiler and Another way is to run the program with a profiler and some training input data predict the branch direction some training input data predict the branch direction which was most frequently taken according to profiling. which was most frequently taken according to profiling. This prediction can be indicated by a hint bit in the This prediction can be indicated by a hint bit in the branch opcode. branch opcode.

– 24 – CSCE 513 Fall 2015

Dynamic Branch PredictionDynamic Branch Prediction

Dynamic Branch Prediction or just Branch Prediction Dynamic Branch Prediction or just Branch Prediction uses information about taken or not taken branches uses information about taken or not taken branches gathered at run-time to predict the outcome of a gathered at run-time to predict the outcome of a branchbranch

– 25 – CSCE 513 Fall 2015

2 Bit Branch predictor Fig C.182 Bit Branch predictor Fig C.18

N-bit predictors

– 26 – CSCE 513 Fall 2015

2 Bit Saturating Counter Branch predictor2 Bit Saturating Counter Branch predictor

..

http://en.wikipedia.org/wiki/Branch_predictor

– 27 – CSCE 513 Fall 2015

Branch History Table IndexingBranch History Table Indexing

Instruction addr Instruction addr x x11=0 and x=0 and x00=0 why?=0 why?

xx3131xx3030xx2929xx2828 … x … x12 12 xx11 …11 …xx2 2 0000

PredictionPrediction

What was done last timeWhat was done last time

2-bit 2-bit

n-bit n-bit

prediction

– 28 – CSCE 513 Fall 2015

Correlating Branch PredictorsCorrelating Branch Predictors

If(a == 2)If(a == 2)

a = 0;a = 0;

If(b ==2) If(b ==2)

b = 0;b = 0;

If (a != b) {If (a != b) {

……

– 29 – CSCE 513 Fall 2015

(m, n) predictors(m, n) predictors

m last branches are used to predictm last branches are used to predict

One of 2One of 2mm n-bit predictors n-bit predictors


Branch PredictionBranch Prediction

Basic 2-bit predictor:Basic 2-bit predictor: For each branch:

Predict taken or not taken If the prediction is wrong two consecutive times, change

prediction

Correlating predictor:Correlating predictor: Multiple 2-bit predictors for each branch One for each possible combination of outcomes of preceding

n branches

Local predictor:Local predictor: Multiple 2-bit predictors for each branch One for each possible combination of outcomes for the last n

occurrences of this branch

Tournament predictor:Tournament predictor: Combine correlating predictor with local predictor

Bran

ch P

redictio

n


Branch Prediction PerformanceBranch Prediction Performance Bran

ch P

redictio

n

Branch predictor performance


Dynamic SchedulingDynamic Scheduling

Rearrange order of instructions to reduce stalls while Rearrange order of instructions to reduce stalls while maintaining data flowmaintaining data flow

Advantages:Advantages: Compiler doesn’t need to have knowledge of

microarchitecture Handles cases where dependencies are unknown at compile

time

Disadvantage:Disadvantage: Substantial increase in hardware complexity Complicates exceptions

Bran

ch P

redictio

n


Dynamic SchedulingDynamic Scheduling

Dynamic scheduling implies:Dynamic scheduling implies: Out-of-order execution Out-of-order completion

Creates the possibility for WAR and WAW hazardsCreates the possibility for WAR and WAW hazards

Tomasulo’s ApproachTomasulo’s Approach Tracks when operands are available Introduces register renaming in hardware

Minimizes WAW and WAR hazards

Bran

ch P

redictio

n


Register RenamingRegister Renaming

Example:Example:

DIV.D F0,F2,F4DIV.D F0,F2,F4

ADD.D ADD.D F6F6,F0,F8,F0,F8

S.D F6,0(R1)S.D F6,0(R1)

SUB.D F8,F10,F14SUB.D F8,F10,F14

MUL.D MUL.D F6F6,F10,F8,F10,F8

+ name dependence with F6+ name dependence with F6

Bran

ch P

redictio

n

antidependence

antidependence



Example:Example:

DIV.D F0,F2,F4DIV.D F0,F2,F4

ADD.D ADD.D SS,F0,F8,F0,F8

S.D S.D SS,0(R1),0(R1)

SUB.D SUB.D TT,F10,F14,F10,F14

MUL.D MUL.D F6,F6,F10,F10,TT

Now only RAW hazards remain, which can be strictly Now only RAW hazards remain, which can be strictly orderedordered

Bran

ch P

redictio

n



Register renaming is provided by reservation stations Register renaming is provided by reservation stations (RS)(RS) Contains:

The instructionBuffered operand values (when available)Reservation station number of instruction providing the

operand values RS fetches and buffers an operand as soon as it becomes

available (not necessarily involving register file) Pending instructions designate the RS to which they will

send their output Result values broadcast on a result bus, called the common data bus

(CDB)

Only the last output updates the register file As instructions are issued, the register specifiers are

renamed with the reservation station May be more reservation stations than registers

Bran

ch P

redictio

n


Tomasulo’s AlgorithmTomasulo’s Algorithm

Load and store buffersLoad and store buffers Contain data and addresses, act like reservation stations

Top-level design: Top-level design:

Bran

ch P

redictio

n


Tomasulo’s AlgorithmTomasulo’s Algorithm

Three Steps:Three Steps: Issue

Get next instruction from FIFO queue If available RS, issue the instruction to the RS with operand

values if available If operand values not available, stall the instruction

ExecuteWhen operand becomes available, store it in any reservation

stations waiting for itWhen all operands are ready, issue the instructionLoads and store maintained in program order through effective

addressNo instruction allowed to initiate execution until all branches

that proceed it in program order have completed Write result

Write result on CDB into reservation stations and store buffers» (Stores must wait until address and value are received)

Bran

ch P

redictio

n


ExampleExample Bran

ch P

redictio

n


Hardware-Based SpeculationHardware-Based Speculation

Execute instructions along predicted execution paths Execute instructions along predicted execution paths but only commit the results if prediction was correctbut only commit the results if prediction was correct

Instruction commit: allowing an instruction to update Instruction commit: allowing an instruction to update the register file when instruction is no longer the register file when instruction is no longer speculativespeculative

Need an additional piece of hardware to prevent any Need an additional piece of hardware to prevent any irrevocable action until an instruction commitsirrevocable action until an instruction commits I.e. updating state or taking an execution

Bran

ch P

redictio

n


Reorder BufferReorder Buffer

Reorder buffer – holds the result of instruction between Reorder buffer – holds the result of instruction between completion and commitcompletion and commit

Four fields:Four fields: Instruction type: branch/store/register Destination field: register number Value field: output value Ready field: completed execution?

Modify reservation stations:Modify reservation stations: Operand source is now reorder buffer instead of functional

unit

Bran

ch P

redictio

n


Reorder BufferReorder Buffer

Register values and memory values are not written until Register values and memory values are not written until an instruction commitsan instruction commits

On misprediction:On misprediction: Speculated entries in ROB are cleared

Exceptions:Exceptions: Not recognized until it is ready to commit

Bran

ch P

redictio

n


Multiple Issue and Static SchedulingMultiple Issue and Static Scheduling

To achieve CPI < 1, need to complete multiple To achieve CPI < 1, need to complete multiple instructions per clockinstructions per clock

Solutions:Solutions: Statically scheduled superscalar processors VLIW (very long instruction word) processors dynamically scheduled superscalar processors

Mu

ltiple Issu

e and

Static S

ched

ulin

g


Multiple IssueMultiple IssueM

ultip

le Issue an

d S

tatic Sch

edu

ling


VLIW ProcessorsVLIW Processors

Package multiple operations into one instructionPackage multiple operations into one instruction

Example VLIW processor:Example VLIW processor: One integer instruction (or branch) Two independent floating-point operations Two independent memory references

Must be enough parallelism in code to fill the available Must be enough parallelism in code to fill the available slotsslots

Mu

ltiple Issu

e and

Static S

ched

ulin

g


VLIW ProcessorsVLIW Processors

Disadvantages:Disadvantages: Statically finding parallelism Code size No hazard detection hardware Binary code compatibility

Mu

ltiple Issu

e and

Static S

ched

ulin

g


Dynamic Scheduling, Multiple Issue, and SpeculationDynamic Scheduling, Multiple Issue, and Speculation

Modern microarchitectures:Modern microarchitectures: Dynamic scheduling + multiple issue + speculation

Two approaches:Two approaches: Assign reservation stations and update pipeline control

table in half clock cyclesOnly supports 2 instructions/clock

Design logic to handle any possible dependencies between the instructions

Hybrid approaches

Issue logic can become bottleneckIssue logic can become bottleneck

Dyn

amic S

ched

ulin

g, M

ultip

le Issue, an

d S

pecu

lation


Dyn

amic S

ched

ulin

g, M

ultip

le Issue, an

d S

pecu

lation

Overview of DesignOverview of Design


Limit the number of instructions of a given class that Limit the number of instructions of a given class that can be issued in a “bundle”can be issued in a “bundle” I.e. on FP, one integer, one load, one store

Examine all the dependencies amoung the instructions Examine all the dependencies amoung the instructions in the bundlein the bundle

If dependencies exist in bundle, encode them in If dependencies exist in bundle, encode them in reservation stationsreservation stations

Also need multiple completion/commitAlso need multiple completion/commit

Dyn

amic S

ched

ulin

g, M

ultip

le Issue, an

d S

pecu

lation

Multiple IssueMultiple Issue


Loop:Loop: LD R2,0(R1)LD R2,0(R1) ;R2=array element;R2=array element

DADDIU R2,R2,#1DADDIU R2,R2,#1 ;increment R2;increment R2

SD R2,0(R1)SD R2,0(R1) ;store result;store result

DADDIU R1,R1,#8DADDIU R1,R1,#8 ;increment pointer;increment pointer

BNE R2,R3,LOOPBNE R2,R3,LOOP ;branch if not last element;branch if not last element

Dyn

amic S

ched

ulin

g, M

ultip

le Issue, an

d S

pecu

lation

ExampleExample


Dyn

amic S

ched

ulin

g, M

ultip

le Issue, an

d S

pecu

lation

Example (No Speculation)Example (No Speculation)


Dyn

amic S

ched

ulin

g, M

ultip

le Issue, an

d S

pecu

lation

Dual Issue/Dual Commit ExampleDual Issue/Dual Commit Example


Need high instruction bandwidth!Need high instruction bandwidth! Branch-Target buffers

Next PC prediction buffer, indexed by current PC

Ad

v. Tech

niq

ues fo

r Instru

ction

Delivery an

d S

pecu

lation

Branch-Target BufferBranch-Target Buffer

– 54 – CSCE 513 Fall 2015

Reorder Buffer

– 55 – CSCE 513 Fall 2015

Review of Data HazardsReview of Data Hazards

Assume instruction i comes before instruction jAssume instruction i comes before instruction j

• Instruction j is data dependant on I ifInstruction j is data dependant on I if• i produces a result that j uses• j depends on k and k depends on I

• Name dependence – two instructions use the same Name dependence – two instructions use the same registerregister

• Antidependence when j writes a register that i readsAntidependence when j writes a register that i reads

• Output dependence when both i and j write the same Output dependence when both i and j write the same registerregister

• HazardsHazards• RAW• WAW• WAR

lecture 9 instruction level parallelism topics review appendix c dynamic scheduling scoreboarding...

Documents