ch 13 95 [read-only] [compatibility mode]

William Stallings William Stallings Computer Organization

d A hit tand Architecture

Chapter 13Instruction Level ParallelismInstruction Level Parallelismand Superscalar Processors

What is Superscalar?What is Superscalar?

aCommon instructions (arithmetic, load/store, conditional branch) can be initiated and

t d i d d tlexecuted independentlyaEqually applicable to RISC & CISCaI ti ll RISCaIn practice usually RISC

Why Superscalar?Why Superscalar?

aMost operations are on scalar quantities (see RISC notes)aImprove these operations to get an overall

improvement

General Superscalar OrganizationOrganization

SuperpipelinedSuperpipelined

aMany pipeline stages need less than half a clock cycleaDouble internal clock speed gets two tasks per

external clock cycleaS l ll ll l f t h taSuperscalar allows parallel fetch execute

Superscalar vSuperpipelineSuperpipeline

LimitationsLimitationsaInstruction level parallelismaInstruction level parallelismaCompiler based optimisationaHardware techniquesaHardware techniquesaLimited by`True data dependency`True data dependency`Procedural dependency`Resource conflicts`Output dependency`Antidependency

True Data DependencyTrue Data Dependency

aADD r1, r2 (r1 := r1+r2;)aMOVE r3,r1 (r3 := r1;)aCan fetch and decode second instruction in

parallel with firstaCan NOT execute second instruction until first is

finished

Procedural DependencyProcedural Dependency

aCan not execute instructions after a branch in parallel with instructions before a branchaAlso, if instruction length is not fixed,

instructions have to be decoded to find out how many fetches are neededmany fetches are neededaThis prevents simultaneous fetches

Resource ConflictResource Conflict

aTwo or more instructions requiring access to the same resource at the same time

h`e.g. two arithmetic instructionsaCan duplicate resources` h t ith ti it`e.g. have two arithmetic units

D d iDependencies

Design IssuesDesign Issues

aInstruction level parallelismÌnstructions in a sequence are independentÈxecution can be overlapped`Governed by data and procedural dependency

aMachine Pa allelismaMachine ParallelismÀbility to take advantage of instruction level

parallelismparallelism`Governed by number of parallel pipelines

Instruction Issue PolicyInstruction Issue Policy

aOrder in which instructions are fetchedaOrder in which instructions are executedaOrder in which instructions change registers and

memory

In-Order Issue In Order CompletionIn-Order Completion

aIssue instructions in the order they occuraNot very efficientaMay fetch >1 instructionaInstructions must stall if necessary

In-Order Issue In-Order Completion (Diagram)Completion (Diagram)

In-Order Issue Out of Order CompletionOut-of-Order Completion

aOutput dependency`R3:= R3 + R5; (I1)`R4:= R3 + 1; (I2)`R3:= R5 + 1; (I3)Ì2 depends on result of I1 data dependencyÌ2 depends on result of I1 - data dependencyÌf I3 completes before I1, the result from I1 will be

wrong - output (read-write) dependencyg p ( ) p y

In-Order Issue Out-of-Order Completion (Diagram)Completion (Diagram)

Out-of-Order IssueOut of Order CompletionOut-of-Order Completion

aDecouple decode pipeline from execution pipelineaCan continue to fetch and decode until this

pipeline is fullaWh f ti l it b il blaWhen a functional unit becomes available an

instruction can be executedaSi i t ti h b d d daSince instructions have been decoded, processor

can look ahead

Out-of-Order Issue Out-of-Order Completion (Diagram)Completion (Diagram)

AntidependencyAntidependency

aWrite-write dependency`R3:=R3 + R5; (I1)`R4:=R3 + 1; (I2)`R3:=R5 + 1; (I3)`R7:=R3 + R4; (I4)`R7:=R3 + R4; (I4)`I3 can not complete before I2 starts as I2 needs a

value in R3 and I3 changes R3g

Register RenamingRegister Renaming

aOutput and antidependencies occur because register contents may not reflect the correct

d i f thordering from the programaMay result in a pipeline stallaR i t ll t d d i llaRegisters allocated dynamically`i.e. registers are not specifically named

Register Renaming exampleRegister Renaming exampleaR3b:=R3a + R5a (I1)aR3b: R3a + R5a (I1)aR4b:=R3b + 1 (I2)aR3c:=R5a + 1 (I3)aR3c:=R5a + 1 (I3)aR7b:=R3c + R4b (I4)aWithout subscript refers to logical register inaWithout subscript refers to logical register in

instructionaWith subscript is hardware register allocatedaWith subscript is hardware register allocatedaNote R3a R3b R3c

Machine ParallelismMachine Parallelism

aDuplication of ResourcesaOut of order issueaRenamingaNot worth duplication functions without register

renamingaNeed instruction window large enough (more

than 8)

Branch PredictionBranch Prediction

a80486 fetches both next sequential instruction after branch and branch target instructionaGives two cycle delay if branch taken

RISC Delayed BranchRISC - Delayed Branch

aCalculate result of branch before unusable instructions pre-fetchedaAlways execute single instruction immediately

following branchaK i li f ll hil f t hi i t tiaKeeps pipeline full while fetching new instruction

streamaN t d f laNot as good for superscalar`Multiple instructions need to execute in delay slot`Instruction dependence problems`Instruction dependence problems

aRevert to branch prediction

Superscalar ExecutionSuperscalar Execution

Superscalar ImplementationSuperscalar Implementation

aSimultaneously fetch multiple instructionsaLogic to determine true dependencies involving

register valuesaMechanisms to communicate these valuesaMechanisms to initiate multiple instructions in

parallelaResources for parallel execution of multiple

instructionsaM h i f i i iaMechanisms for committing process state in

correct order

Required ReadingRequired Reading

aStallings chapter 13aManufacturers web sitesaIMPACT web site`research on predicated execution

ch 13 95 [read-only] [compatibility mode]

Documents

order completionadecouple

memoryinorder issue

necessaryinorder issue

result of i1 data dependency

r3 r5 i1

r1 r2amove r3

riscwhy superscalar

instruction length