cs 352h: computer systems architecturefussell/courses/cs352h/lectures/10...university of texas at...

25
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell CS 352H: Computer Systems Architecture Topic 10: Instruction Level Parallelism (ILP) October 6 - 8, 2009

Upload: others

Post on 21-Sep-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 352H: Computer Systems Architecturefussell/courses/cs352h/lectures/10...University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 2 Instruction-Level

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell

CS 352H: Computer Systems Architecture

Topic 10: Instruction Level Parallelism (ILP)

October 6 - 8, 2009

Page 2: CS 352H: Computer Systems Architecturefussell/courses/cs352h/lectures/10...University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 2 Instruction-Level

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 2

Instruction-Level Parallelism (ILP)

Pipelining: executing multiple instructions inparallelTo increase ILP

Deeper pipelineLess work per stage ⇒ shorter clock cycle

Multiple issueReplicate pipeline stages ⇒ multiple pipelinesStart multiple instructions per clock cycleCPI < 1, so use Instructions Per Cycle (IPC)E.g., 4GHz 4-way multiple-issue

16 BIPS, peak CPI = 0.25, peak IPC = 4But dependencies reduce this in practice

Page 3: CS 352H: Computer Systems Architecturefussell/courses/cs352h/lectures/10...University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 2 Instruction-Level

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 3

Multiple Issue

Static multiple issueCompiler groups instructions to be issued togetherPackages them into “issue slots”Compiler detects and avoids hazards

Dynamic multiple issueCPU examines instruction stream and chooses instructions to issueeach cycleCompiler can help by reordering instructionsCPU resolves hazards using advanced techniques at runtime

Page 4: CS 352H: Computer Systems Architecturefussell/courses/cs352h/lectures/10...University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 2 Instruction-Level

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 4

Speculation

“Guess” what to do with an instructionStart operation as soon as possibleCheck whether guess was right

If so, complete the operationIf not, roll-back and do the right thing

Common to static and dynamic multiple issueExamples

Speculate on branch outcomeRoll back if path taken is different

Speculate on loadRoll back if location is updated

Page 5: CS 352H: Computer Systems Architecturefussell/courses/cs352h/lectures/10...University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 2 Instruction-Level

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 5

Compiler/Hardware Speculation

Compiler can reorder instructionse.g., move load before branchCan include “fix-up” instructions to recover from incorrect guess

Hardware can look ahead for instructions to executeBuffer results until it determines they are actually neededFlush buffers on incorrect speculation

Page 6: CS 352H: Computer Systems Architecturefussell/courses/cs352h/lectures/10...University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 2 Instruction-Level

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 6

Speculation and Exceptions

What if exception occurs on a speculatively executedinstruction?

e.g., speculative load before null-pointer checkStatic speculation

Can add ISA support for deferring exceptionsDynamic speculation

Can buffer exceptions until instruction completion (which may notoccur)

Page 7: CS 352H: Computer Systems Architecturefussell/courses/cs352h/lectures/10...University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 2 Instruction-Level

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 7

Static Multiple Issue

Compiler groups instructions into “issue packets”Group of instructions that can be issued on a single cycleDetermined by pipeline resources required

Think of an issue packet as a very long instructionSpecifies multiple concurrent operations⇒ Very Long Instruction Word (VLIW)

Page 8: CS 352H: Computer Systems Architecturefussell/courses/cs352h/lectures/10...University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 2 Instruction-Level

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 8

Scheduling Static Multiple Issue

Compiler must remove some/all hazardsReorder instructions into issue packetsNo dependencies with a packetPossibly some dependencies between packets

Varies between ISAs; compiler must know!Pad with nop if necessary

Page 9: CS 352H: Computer Systems Architecturefussell/courses/cs352h/lectures/10...University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 2 Instruction-Level

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 9

MIPS with Static Dual Issue

Two-issue packetsOne ALU/branch instructionOne load/store instruction64-bit aligned

ALU/branch, then load/storePad an unused instruction with nop

n + 20n + 16n + 12n + 8n + 4nAddress

WBMEM

EXIDIFLoad/storeWBME

MEXIDIFALU/branch

WBMEM

EXIDIFLoad/storeWBME

MEXIDIFALU/branch

WBMEM

EXIDIFLoad/storeWBME

MEXIDIFALU/branchPipeline StagesInstruction

type

Page 10: CS 352H: Computer Systems Architecturefussell/courses/cs352h/lectures/10...University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 2 Instruction-Level

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 10

MIPS with Static Dual Issue

Page 11: CS 352H: Computer Systems Architecturefussell/courses/cs352h/lectures/10...University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 2 Instruction-Level

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 11

Hazards in the Dual-Issue MIPS

More instructions executing in parallelEX data hazard

Forwarding avoided stalls with single-issueNow can’t use ALU result in load/store in same packet

add $t0, $s0, $s1load $s2, 0($t0)Split into two packets, effectively a stall

Load-use hazardStill one cycle use latency, but now two instructions

More aggressive scheduling required

Page 12: CS 352H: Computer Systems Architecturefussell/courses/cs352h/lectures/10...University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 2 Instruction-Level

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 12

Scheduling Example

Schedule this for dual-issue MIPS

Loop: lw $t0, 0($s1) # $t0=array element addu $t0, $t0, $s2 # add scalar in $s2 sw $t0, 0($s1) # store result addi $s1, $s1,–4 # decrement pointer bne $s1, $zero, Loop # branch $s1!=0

4sw $t0, 4($s1)bne $s1, $zero, Loop3nopaddu $t0, $t0, $s22nopaddi $s1, $s1,–41lw $t0, 0($s1)nopLoop

:

cycleLoad/storeALU/branch

IPC = 5/4 = 1.25 (c.f. peak IPC = 2)

Page 13: CS 352H: Computer Systems Architecturefussell/courses/cs352h/lectures/10...University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 2 Instruction-Level

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 13

Loop Unrolling

Replicate loop body to expose more parallelismReduces loop-control overhead

Use different registers per replicationCalled “register renaming”Avoid loop-carried “anti-dependencies”

Store followed by a load of the same registerAka “name dependence”

Reuse of a register name

Page 14: CS 352H: Computer Systems Architecturefussell/courses/cs352h/lectures/10...University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 2 Instruction-Level

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 14

Loop Unrolling Example

IPC = 14/8 = 1.75Closer to 2, but at cost of registers and code size

3lw $t2, 8($s1)addu $t0, $t0, $s24lw $t3, 4($s1)addu $t1, $t1, $s25sw $t0, 16($s1)addu $t2, $t2, $s26sw $t1, 12($s1)addu $t3, $t4, $s2

8sw $t3, 4($s1)bne $s1, $zero, Loop7sw $t2, 8($s1)nop

2lw $t1, 12($s1)nop1lw $t0, 0($s1)addi $s1, $s1,–16Loop

:

cycleLoad/storeALU/branch

Page 15: CS 352H: Computer Systems Architecturefussell/courses/cs352h/lectures/10...University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 2 Instruction-Level

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 15

Dynamic Multiple Issue

“Superscalar” processorsCPU decides whether to issue 0, 1, 2, … each cycle

Avoiding structural and data hazards

Avoids the need for compiler schedulingThough it may still helpCode semantics ensured by the CPU

Page 16: CS 352H: Computer Systems Architecturefussell/courses/cs352h/lectures/10...University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 2 Instruction-Level

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 16

Dynamic Pipeline Scheduling

Allow the CPU to execute instructions out of order toavoid stalls

But commit result to registers in order

Examplelw $t0, 20($s2)addu $t1, $t0, $t2sub $s4, $s4, $t3slti $t5, $s4, 20Can start sub while addu is waiting for lw

Page 17: CS 352H: Computer Systems Architecturefussell/courses/cs352h/lectures/10...University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 2 Instruction-Level

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 17

Dynamically Scheduled CPU

Results also sentto any waiting

reservationstations

Reorders buffer forregister writes Can supply

operands forissued instructions

Preservesdependencies

Hold pendingoperands

Page 18: CS 352H: Computer Systems Architecturefussell/courses/cs352h/lectures/10...University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 2 Instruction-Level

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 18

Register Renaming

Reservation stations and reorder buffer effectively provideregister renamingOn instruction issue to reservation station

If operand is available in register file or reorder bufferCopied to reservation stationNo longer required in the register; can be overwritten

If operand is not yet availableIt will be provided to the reservation station by a function unitRegister update may not be required

Page 19: CS 352H: Computer Systems Architecturefussell/courses/cs352h/lectures/10...University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 2 Instruction-Level

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 19

Speculation

Predict branch and continue issuingDon’t commit until branch outcome determined

Load speculationAvoid load and cache miss delay

Predict the effective addressPredict loaded valueLoad before completing outstanding storesBypass stored values to load unit

Don’t commit load until speculation cleared

Page 20: CS 352H: Computer Systems Architecturefussell/courses/cs352h/lectures/10...University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 2 Instruction-Level

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 20

Why Do Dynamic Scheduling?

Why not just let the compiler schedule code?Not all stalls are predicable

e.g., cache misses

Can’t always schedule around branchesBranch outcome is dynamically determined

Different implementations of an ISA have differentlatencies and hazards

Page 21: CS 352H: Computer Systems Architecturefussell/courses/cs352h/lectures/10...University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 2 Instruction-Level

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 21

Does Multiple Issue Work?

Yes, but not as much as we’d likePrograms have real dependencies that limit ILPSome dependencies are hard to eliminate

e.g., pointer aliasing

Some parallelism is hard to exposeLimited window size during instruction issue

Memory delays and limited bandwidthHard to keep pipelines full

Speculation can help if done well

Page 22: CS 352H: Computer Systems Architecturefussell/courses/cs352h/lectures/10...University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 2 Instruction-Level

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 22

Power Efficiency

Complexity of dynamic scheduling and speculationsrequires powerMultiple simpler cores may be better

70W8No161200MHz

2005UltraSparcT1

90W1No4141950MHz

2003UltraSparcIII

75W2Yes4142930MHz

2006Core103W1Yes3313600MH

z2004P4 Prescott

75W1Yes3222000MHz

2001P4Willamette

29W1Yes310200MHz1997PentiumPro

10W1No2566MHz1993Pentium5W1No1525MHz1989i486

PowerCoresOut-of-order/

Speculation

Issuewidth

Pipeline

Stages

ClockRate

YearMicroprocessor

Page 23: CS 352H: Computer Systems Architecturefussell/courses/cs352h/lectures/10...University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 2 Instruction-Level

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 23

The Opteron X4 Microarchitecture

72 physicalregisters

Page 24: CS 352H: Computer Systems Architecturefussell/courses/cs352h/lectures/10...University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 2 Instruction-Level

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 24

The Opteron X4 Pipeline Flow

For integer operations

FP is 5 stages longerUp to 106 RISC-ops in progress

BottlenecksComplex instructions with long dependenciesBranch mispredictionsMemory access delays

Page 25: CS 352H: Computer Systems Architecturefussell/courses/cs352h/lectures/10...University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 2 Instruction-Level

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 25

Concluding Remarks

Multiple issue and dynamic scheduling (ILP)Dependencies limit achievable parallelismComplexity leads to the power wall