computer architecture spring 2016 - nanjing university · computer architecture spring 2016 shuai...

42
Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University Lecture 04: Pipelining [Slides adapted from CSE 502 Stony Brook University]

Upload: others

Post on 25-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

Computer Architecture

Spring 2016

Shuai Wang

Department of Computer Science and Technology

Nanjing University

Lecture 04: Pipelining

[Slides adapted from CSE 502 Stony Brook University]

Page 2: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

Before there was pipelining…

• Single-cycle control: hardwired

– Low CPI (1)

– Long clock period (to accommodate slowest instruction)

• Multi-cycle control: micro-programmed

– Short clock period

– High CPI

• Can we have both low CPI and short clock period?

Single-cycle

Multi-cycle

insn0.(fetch,decode,exec) insn1.(fetch,decode,exec)

insn0.decinsn0.fetch insn1.decinsn1.fetchinsn0.exec insn1.exec

timetime

Page 3: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

Pipelining

• Start with multi-cycle design

• When insn0 goes from stage 1 to stage 2

… insn1 starts stage 1

• Each instruction passes through all stages

… but instructions enter and leave at faster rate

Multi-cycle insn0.decinsn0.fetch insn1.decinsn1.fetchinsn0.exec insn1.exec

timetime

Pipelinedinsn0.execinsn0.decinsn0.fetch

insn1.decinsn1.fetch insn1.exec

insn2.decinsn2.fetch insn2.exec

Can have as many insns in flight as there are stages

Page 4: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

Processor Pipeline Review

I-cacheRegFile

PC

+4

D-cacheALU

Fetch Decode Memory

(Write-back)

Execute

Page 5: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

Stage 1: Fetch

• Fetch an instruction from memory every cycle

– Use PC to index memory

– Increment PC (assume no branches for now)

• Write state to the pipeline register (IF/ID)

– The next stage will read this pipeline register

Page 6: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

Stage 1: Fetch Diagram

Inst

ruct

ion

bits

IF / IDPipeline register

PC

InstructionCache

en

en

1

+

MUX

PC

+ 1

target

Page 7: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

Stage 2: Decode

• Decodes opcode bits

– Set up Control signals for later stages

• Read input operands from register file

– Specified by decoded instruction bits

• Write state to the pipeline register (ID/EX)

– Opcode

– Register contents

– PC+1 (even though decode didn’t use it)

– Control signals (from insn) for opcode and destReg

Page 8: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

Stage 2: Decode Diagram

ID / EXPipeline register

regA

cont

ents

regB

cont

entsRegister File

regAregB

en

Inst

ruct

ion

bits

IF / IDPipeline register

PC

+ 1

PC

+ 1

Con

trol

sign

als

destReg

data

target

Page 9: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

Stage 3: Execute

• Perform ALU operations

– Calculate result of instruction

• Control signals select operation

• Contents of regA used as one input

• Either regB or constant offset (from insn) used as second input

– Calculate PC-relative branch target

• PC+1+(constant offset)

• Write state to the pipeline register (EX/Mem)

– ALU result, contents of regB, and PC+1+offset

– Control signals (from insn) for opcode and destReg

Page 10: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

Stage 3: Execute Diagram

ID / EXPipeline register

regA

cont

ents

regB

cont

ents

ALU

resu

lt

EX/MemPipeline register

PC

+ 1

Con

trol

sign

als

Con

trol

sign

als

PC

+1

+of

fset

+

regB

cont

ents

ALUM

UX

destRegdata

target

Page 11: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

Stage 4: Memory

• Perform data cache access

– ALU result contains address for LD or ST

– Opcode bits control R/W and enable signals

• Write state to the pipeline register (Mem/WB)

– ALU result and Loaded data

– Control signals (from insn) for opcode and destReg

Page 12: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

Stage 4: Memory Diagram

ALU

resu

lt

Mem/WBPipeline register

ALU

resu

lt

EX/MemPipeline register

Con

trol

sign

als

PC

+1

+of

fset

regB

cont

ents

Load

edda

ta

Data Cache

en R/W

in_data

in_addr

Con

trol

sign

als

destRegdata

target

Page 13: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

Stage 5: Write-back

• Writing result to register file (if required)

– Write Loaded data to destReg for LD

– Write ALU result to destReg for arithmetic insn

– Opcode bits control register write enable signal

Page 14: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

Stage 5: Write-back Diagram

ALU

resu

lt

Mem/WBPipeline register

Con

trol

sign

als

Load

edda

ta

MUX

data

destRegMUX

Page 15: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

Putting It All Together

PC InstCache

Reg

iste

r fil

eMUX

1

DataCache

MUX

IF/ID ID/EX EX/Mem Mem/WB

MUX

op

dest

offset

valB

valA

PC+1PC+1target

ALUresult

op

dest

valB

op

dest

ALUresult

mdata

eq?instruction

0

R2R3R4R5

R1

R6

R0

R7

regAregB

datadest

MUX

Page 16: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

Pipelining Idealism

• Uniform Sub-operations

– Operation can partitioned into uniform-latency sub-ops

• Repetition of Identical Operations

– Same ops performed on many different inputs

• Repetition of Independent Operations

– All repetitions of op are mutually independent

Page 17: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

Pipeline Realism

• Uniform Sub-operations … NOT!

– Balance pipeline stages

• Stage quantization to yield balanced stages

• Minimize internal fragmentation (left-over time near end of cycle)

• Repetition of Identical Operations … NOT!

– Unifying instruction types

• Coalescing instruction types into one “multi-function” pipe

• Minimize external fragmentation (idle stages to match length)

• Repetition of Independent Operations … NOT!

– Resolve data and resource hazards

• Inter-instruction dependency detection and resolution

Pipelining is expensive

Page 18: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

The Generic Instruction Pipeline

Instruction Fetch

Instruction Decode

Operand Fetch

Instruction Execute

Write-back

IFIF

IDID

OFOF

EXEX

WBWB

Page 19: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

Balancing Pipeline StagesTIF= 6 units

TID= 2 units

TID= 9 units

TEX= 5 units

TOS= 9 units

Without pipeliningTcyc≈

TIF+TID+TOF+TEX+TOS

= 31

PipelinedTcyc ≈ max{TIF, TID, TOF,

TEX, TOS}

= 9

Speedup= 31 / 9

IFIF

IDID

OFOF

EXEX

WBWB

Can we do better?

Page 20: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

Balancing Pipeline Stages (1/2)

• Two methods for stage quantization

– Merge multiple sub-ops into one

– Divide sub-ops into smaller pieces

• Recent/Current trends

– Deeper pipelines (more and more stages)

– Multiple different pipelines/sub-pipelines

– Pipelining of memory accesses

Page 21: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

Balancing Pipeline Stages (2/2)Coarser-Grained Machine Cycle:

4 machine cyc / instruction

Finer-Grained Machine Cycle:

11 machine cyc /instruction

TIF&ID= 8 units

TOF= 9 units

TEX= 5 units

TOS= 9 units

IFIFIDID

OFOF

WBWB

EXEX# stages = 11Tcyc= 3 units

IFIF

IFIFIDID

OFOF

OFOF

OFOF

EXEXEXEX

WBWB

WBWB

WBWB

# stages = 4Tcyc= 9 units

Page 22: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

Pipeline Examples

IFIF

RDRD

ALUALU

MEMMEM

WBWB

IF

ID

OF

EX

WB

PC GENPC GEN

Cache ReadCache Read

Cache ReadCache Read

DecodeDecode

Read REGRead REG

Addr GENAddr GEN

Cache ReadCache Read

Cache ReadCache Read

EX 1EX 1

EX 2EX 2

Check ResultCheck Result

Write ResultWrite Result

WB

EX

OF

ID

IFMIPS R2000/R3000

AMDAHL 470V/7

Page 23: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

Instruction Dependencies (1/2)

• Data Dependence

– Read-After-Write (RAW) (only true dependence)

• Read must wait until earlier write finishes

– Anti-Dependence (WAR)

• Write must wait until earlier read finishes (avoid clobbering)

– Output Dependence (WAW)

• Earlier write can’t overwrite later write

• Control Dependence (a.k.a. Procedural Dependence)

– Branch condition must execute before branch target

– Instructions after branch cannot run before branch

Page 24: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

Instruction Dependencies (1/2)# for (;(j<high)&&(array[j]<array[low]);++j);

bge j, high, $36mul $15, j, 4addu $24, array, $15lw $25, 0($24)mul $13, low, 4addu $14, array, $13lw $15, 0($14)bge $25, $15, $36

$35:addu j, j, 1. . .

$36:addu $11, $11, -1. . .

Real code has lots of dependencies

Page 25: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

Hardware Dependency Analysis

• Processor must handle

– Register Data Dependencies (same register)

• RAW, WAW, WAR

– Memory Data Dependencies (same address)

• RAW, WAW, WAR

– Control Dependencies

Page 26: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

Pipeline Terminology

• Pipeline Hazards

– Potential violations of program dependencies

– Must ensure program dependencies are not violated

• Hazard Resolution

– Static method: performed at compile time in software

– Dynamic method: performed at runtime using hardware

– Two options: Stall (costs perf.) or Forward (costs hw.)

• Pipeline Interlock

– Hardware mechanism for dynamic hazard resolution

– Must detect and enforce dependencies at runtime

Page 27: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

Pipeline: Steady State

IFIF IDID RDRD ALUALU MEMMEM WBWB

IFIF IDID RDRD ALUALU MEMMEM WBWB

IFIF IDID RDRD ALUALU MEMMEM WBWB

IFIF IDID RDRD ALUALU MEMMEM WBWB

IFIF IDID RDRD ALUALU MEMMEM WBWB

IFIF IDID RDRD ALUALU MEMMEM

IFIF IDID RDRD ALUALU

IFIF IDID RDRD

IFIF IDID

IFIF

t0 t1 t2 t3 t4 t5

Instj

Instj+1

Instj+2

Instj+3

Instj+4

Page 28: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

Pipeline: Data Hazard

t0 t1 t2 t3 t4 t5

IFIF IDID RDRD ALUALU MEMMEM WBWB

IFIF IDID RDRD ALUALU MEMMEM WBWB

IFIF IDID RDRD ALUALU MEMMEM WBWB

IFIF IDID RDRD ALUALU MEMMEM WBWB

IFIF IDID RDRD ALUALU MEMMEM WBWB

IFIF IDID RDRD ALUALU MEMMEM

IFIF IDID RDRD ALUALU

IFIF IDID RDRD

IFIF IDID

IFIF

Instj

Instj+1

Instj+2

Instj+3

Instj+4

Page 29: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

Option 1: Stall on Data Hazard

IFIF IDID RDRD ALUALU MEMMEM WBWB

IFIF IDID RDRD ALUALU MEMMEM WBWB

IFIF IDID Stalled in Stalled in RDRD ALUALU MEMMEM WBWB

IFIF Stalled in Stalled in IDID RDRD ALUALU MEMMEM WBWB

Stalled in Stalled in IFIF IDID RDRD ALUALU MEMMEM

IFIF IDID RDRD ALUALU

t0 t1 t2 t3 t4 t5

RDRD

IDID

IFIF

IFIF IDID RDRD

IFIF IDID

IFIF

Instj

Instj+1

Instj+2

Instj+3

Instj+4

Page 30: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

Option 2: Forwarding Paths (1/3)

IFIF IDID RDRD ALUALU MEMMEM WBWB

IFIF IDID RDRD ALUALU MEMMEM WBWB

IFIF IDID RDRD ALUALU MEMMEM WBWB

IFIF IDID RDRD ALUALU MEMMEM WBWB

IFIF IDID RDRD ALUALU MEMMEM WBWB

IFIF IDID RDRD ALUALU MEMMEM

IFIF IDID RDRD ALUALU

IFIF IDID RDRD

IFIF IDID

IFIF

t0 t1 t2 t3 t4 t5

Many possible pathsInstj

Instj+1

Instj+2

Instj+3

Instj+4

MEMMEM ALUALU Requires stalling even with forwarding paths

Page 31: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

Option 2: Forwarding Paths (2/3)

IFIF IDIDRegister FileRegister File

src1

src2

ALUALU

MEMMEM

dest

WBWB

Page 32: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

Option 2: Forwarding Paths (3/3)

Deeper pipeline mayDeeper pipeline mayrequire additionalrequire additionalforwarding pathsforwarding paths

IFIFRegister FileRegister File

src1

src2

ALUALU

MEMMEM

dest

==

==

WBWB

==

IDID

Page 33: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

Pipeline: Control Hazard

t0 t1 t2 t3 t4 t5

Insti

Insti+1

Insti+2

Insti+3

Insti+4

IFIF IDID RDRD ALUALU MEMMEM WBWB

IFIF IDID RDRD ALUALU MEMMEM WBWB

IFIF IDID RDRD ALUALU MEMMEM WBWB

IFIF IDID RDRD ALUALU MEMMEM WBWB

IFIF IDID RDRD ALUALU MEMMEM WBWB

IFIF IDID RDRD ALUALU MEMMEM

IFIF IDID RDRD ALUALU

IFIF IDID RDRD

IFIF IDID

IFIF

Page 34: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

Pipeline: Stall on Control Hazard

IFIF IDID RDRD ALUALU MEMMEM WBWB

IFIF IDID RDRD ALUALU MEMMEM WBWB

IFIF IDID RDRD ALUALU MEMMEM

IFIF IDID RDRD ALUALU

IFIF IDID RDRD

IFIF IDID

IFIF

t0 t1 t2 t3 t4 t5

Insti

Insti+1

Insti+2

Insti+3

Insti+4

Stalled in Stalled in IFIF

Page 35: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

Pipeline: Prediction for Control Hazards

t0 t1 t2 t3 t4 t5

Insti

Insti+1

Insti+2

Insti+3

Insti+4

IFIF IDID RDRD ALUALU MEMMEM WBWB

IFIF IDID RDRD ALUALU MEMMEM WBWB

IFIF IDID RDRD ALUALU nopnop nopnop

IFIF IDID RDRD nopnop nopnop

IFIF IDID nopnop nopnop

IFIF IDID RDRD

IFIF IDID

IFIF

New Insti+2

New Insti+3

New Insti+4

Speculative State Cleared

Fetch Resteered

Page 36: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

Going Beyond Scalar

• Scalar pipeline limited to CPI ≥ 1.0

– Can never run more than 1 insn. per cycle

• “Superscalar” can achieve CPI ≤ 1.0 (i.e., IPC ≥ 1.0)

– Superscalar means executing multiple insns. in parallel

Page 37: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

Architectures for Instruction Parallelism

• Scalar pipeline (baseline)

– Instruction/overlap parallelism = D

– Operation Latency = 1

– Peak IPC = 1.0

DD

Suc

cess

ive

Inst

ruct

ions

Time in cycles1 2 3 4 5 6 7 8 9 10 11 12

D different instructions overlapped

Page 38: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

Superscalar Machine

• Superscalar (pipelined) Execution

– Instruction parallelism = D x N

– Operation Latency = 1

– Peak IPC = N per cycle

Suc

cess

ive

Inst

ruct

ions

Time in cycles1 2 3 4 5 6 7 8 9 10 11 12

N

D x N different instructions overlapped

Page 39: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

Superscalar Example: Pentium

PrefetchPrefetch

Decode1Decode1

Decode2Decode2 Decode2Decode2

ExecuteExecute ExecuteExecute

WritebackWritebackWritebackWriteback

4× 32-byte buffers

Decode up to 2 insts

Read operands, Addr comp

Asymmetric pipes

u-pipe v-pipeshift

rotatesome FP

jmp, jcc,call,fxch

bothmov, lea,

simple ALU,push/poptest/cmp

Page 40: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

Pentium Hazards & Stalls• “Pairing Rules” (when can’t two insns exec?)

– Read/flow dependence

• mov eax, 8

• mov [ebp], eax

– Output dependence

• mov eax, 8

• mov eax, [ebp]

– Partial register stalls

• mov al, 1

• mov ah, 0

– Function unit rules

• Some instructions can never be paired

– MUL, DIV, PUSHA, MOVS, some FP

Page 41: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

Limitations of In-Order Pipelines

• If the machine parallelism is increased

– … dependencies reduce performance

– CPI of in-order pipelines degrades sharply

• As N approaches avg. distance between dependent instructions

• Forwarding is no longer effective

– Must stall often

In-order pipelines are rarely full

Page 42: Computer Architecture Spring 2016 - Nanjing University · Computer Architecture Spring 2016 Shuai Wang Department of Computer Science and Technology Nanjing University ... insn0.fetch

The In-Order N-Instruction Limit

• On average, parent-child separation is about ± 5

insn.

– (Franklin and Sohi ’92)Ex. Superscalar degree N = 4

Any dependencybetween theseinstructions willcause a stall

Dependent insnmust be N = 4

instructions away

Average of 5 means there are many Average of 5 means there are many cases when the separation is < 4cases when the separation is < 4……

each of these limits parallelismeach of these limits parallelism

Reasonable in-order superscalar is effectively N=2