real-world pipelines: car washes idea divide process into independent stages move objects through...

Real-World Pipelines: Car Washes

Idea Divide process into independent

stages Move objects through stages in

sequence At any instant, multiple objects

being processed

Sequential Parallel

Pipelined

Computational Example

System Computation requires total of 300 picoseconds Additional 20 picoseconds to save result in register Clock cycle must be at least 320 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 3.12 GOPS

Throughput = 1 operation

320 ps 1 ns

1000 ps 3.12 GOPSx

3-Stage Pipelined Version

System Divide combinational logic into 3 blocks of 100 ps each Can begin new operation as soon as previous one passes through stage A.

Begin new operation every 120 ps Performance impact

Latency increases: 360 ps from start to finish (was 320) Speedup is less than 3x – why?

Reg

Clock

Comb.logic

A

Reg

Comb.logic

B

Reg

Comb.logic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps


Pipelining Unpipelined

Cannot start new operation until previous one completes 3-way pipelined

Up to 3 operations in process simultaneously

Time

OP1

OP2

OP3

Time

A B C

A B C

A B C

OP1

OP2

OP3

Operating a Pipeline

Time

OP1

OP2

OP3

A B C

A B C

A B C

0 120 240 360 480 640

Clock

Reg

Clock

Comb.logic

A

Reg

Comb.logic

B

Reg

Comb.logic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

239

Reg

Clock

Comb.logic

A

Reg

Comb.logic

B

Reg

Comb.logic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

241

Reg

Reg

Reg

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Comb.logic

A

Comb.logic

B

Comb.logic

C

Clock

300

Reg

Clock

Comb.logic

A

Reg

Comb.logic

B

Reg

Comb.logic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

359

Real-World Pipelines: Laundry Assuming you’ve got:

One washer (takes 30 minutes)

One drier (takes 40 minutes)

One “folder” (takes 20 minutes)

It takes 90 minutes to wash, dry, and fold 1 load of laundry. How long does 4 loads take?

The slow way

If each load is done sequentially it takes 6 hours

30 40 20 30 40 20 30 40 20 30 40 20

6 PM 7 8 9 10 11 Midnight

Time

8

Laundry Pipelining Start each load as soon as possible: overlap loads

Pipelined laundry takes 3.5 hours

6 PM 7 8 9 10 11 Midnight

Time

30 40 40 40 40 20

Limitations: Nonuniform Delays

Throughput limited by slowest stage Other stages sit idle for much of the time Challenging to partition system into balanced stages

Reg

Clock

Reg

Comb.logic

B

Reg

Comb.logic

C

50 ps 20 ps 150 ps 20 ps 100 ps 20 ps


Comb.logic

A

Time

OP1

OP2

OP3

A B C

A B C

A B C

Limitations: Register Overhead

As pipeline deepens, overhead of loading registers becomes more significant

Percentage of clock cycle spent loading register: 1-stage pipeline: 6.25% 3-stage pipeline: 16.67% 6-stage pipeline: 28.57%

High speeds of modern processor designs obtained through deep pipelining: high overhead

Delay = 420 ps, Throughput = 14.29 GOPSClock

Reg

Comb.logic

50 ps 20 ps

Reg

Comb.logic

50 ps 20 ps

Reg

Comb.logic

50 ps 20 ps

Reg

Comb.logic

50 ps 20 ps

Reg

Comb.logic

50 ps 20 ps

Reg

Comb.logic

50 ps 20 ps

Data Dependencies

Each operation depends on result from preceding one Not a problem for SEQ implementation

Clock

Combinationallogic

Reg

Time

OP1

OP2

OP3

Limitations: Data Hazards

Result does not feed back around in time for next operation Unless special action is taken, results will be incorrect!

Reg

Clock

Comb.logic

A

Reg

Comb.logic

B

Reg

Comb.logic

C

Time

OP1

OP2

OP3

A B C

A B C

A B C

OP4 A B C

Review: Arithmetic and Logical Operations

Refer to generically as “OPl” Encodings differ only by

“function code” Low-order 4 bits in first

instruction word All set condition codes as side

effect

addl rA, rB 6 0 rA rB

subl rA, rB 6 1 rA rB

andl rA, rB 6 2 rA rB

xorl rA, rB 6 3 rA rB

Add

Subtract (rA from rB)

And

Exclusive-Or

Instruction Code Function Code

Review: Move Operations

Similar to the IA32 movl instruction Simpler format for memory addresses Separated into different instructions to simplify hardware

implementation

rrmovl rA, rB 2 0 rA rB Register --> Register

Immediate --> Registerirmovl V, rB 3 0 F rB V

Register --> Memoryrmmovl rA, D(rB) 4 0 rA rB D

Memory --> Registermrmovl D(rB), rA 5 0 rA rB D

Data Dependencies in Y86 Code

Result from one instruction used as operand for another Read-after-write (RAW) dependency

Very common in actual programs Must make sure our pipeline handles these properly

Essential: results must be correct Desirable: performance impact should be minimized

1 irmovl $50, %eax

2 addl %eax , %ebx

3 mrmovl 100( %ebx ), %edx

SEQ Hardware Stages occur in sequence One operation in process at a

time

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCC ALUALU

Datamemory

Datamemory

NewPC

rB

dstE dstM

ALUA

ALUB

Addr

srcA srcB

read

ALUfun.

Fetch

Decode

Execute

Memory

Write back

data out

Registerfile

Registerfile

A B M

E

Cnd

dstE dstM srcA srcB

icode ifun rA

PC

valC valP

valBvalA

Data

valE

valM

PC

newPC

Stat

dmem_error

instr_valid

imem_error

stat

write

Mem.control

SEQ+ Hardware Still sequential implementation Reorder PC stage to put at

beginning PC stage

Task is to select PC for current instruction

Based on results computed by previous instruction

Processor state PC is no longer stored in

register PC is determined from other

stored information Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCC ALUALU

Datamemory

Datamemory

PC

rB

dstE dstM

ALUA

ALUB

Mem.control

Addr

srcA srcB

read

write

ALUfun.

Fetch

Decode

Execute

Memory

Write back

data out

Registerfile

Registerfile

A BM

E

Cnd

dstE dstM srcA srcB

icode ifun rA

pCnd pValM pValC pValP pIcode

PC

valC valP

valBvalA

Data

valE

valM

PC

Stat

dmem_errorstat

instr_valid

imem_error

Adding Pipeline Registers

Instructionmemory

Instructionmemory

PC increment

PC increment

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode, ifunrA, rB

valC

Registerfile

Registerfile

A B

M

E

PC

valP

srcA, srcBdstE, dstM

valA, valB

aluA, aluB

Cnd

valE

Addr, Data

valM

PC

valE, valM

newPC

PC increment

PC increment

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

Registerfile

RegisterfileA B M

E

d_srcA, d_srcB

valA, valB

aluA, aluB

CndvalE

Addr, Data

valM

W_valE, W_valM, W_dstE, W_dstMW_icode, W_valM

icode, ifun,rA, rB, valC

E

M

W

F

D

valP

f_pc

predPC

Instructionmemory

Instructionmemory

M_icode, M_Cnd, M_valA

Pipeline Stages Fetch

Select current PC Read instruction Compute incremented PC

Decode Read program registers

Execute Perform ALU operation

Memory Read or write data memory

Write back Update register file

PC increment

PC increment

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

Registerfile

RegisterfileA B M

E

d_srcA, d_srcB

valA, valB

aluA, aluB

CndvalE

Addr, Data

valM



E

M

W

F

D

valP

f_pc

predPC

Instructionmemory

Instructionmemory


PIPE- Hardware Pipeline registers hold

intermediate values from instruction execution

Note naming conventions Forward (upward) paths

Values flow from one stage to next

Values cannot skip stages; look at

valC in decode dstE and dstM in execute,

memory

E

M

W

F

D

Instructionmemory

Instructionmemory

PCincrement

PCincrement

Registerfile

Registerfile

ALUALU

Datamemory

Datamemory

SelectPC

rB

dstE dstMSelectA

ALUA

ALUB

Mem.control

Addr

srcA srcB

read

write

ALUfun.

Fetch

Decode

Execute

Memory

icode

data out

data in

A BM

E

M_valA

W_valM

W_valE

M_valA

W_valM

d_rvalA

f_pc

PredictPC

valE valM dstE dstM

Cndicode valE valA dstE dstM

icode ifun valC valA valB dstE dstMsrcA srcB

valC valPicode ifun rA

predPC

CCCC

d_srcBd_srcA

e_Cnd

M_Cnd

stat

stat

stat

stat

statimem_error

instr_valid

stat

dstE

dmem_errorm_stat

W_stat

M_stat

E_stat

D_stat

f_stat

Writeback stat

Stat

PIPE- Example

At clock cycle 4, what are the values of

d_srcA d_srcB E_dstE e_valE M_dstE M_valE m_valM

Highlight the datapath in D, E, M stages.

E

M

W

F

D

Instructionmemory

Instructionmemory

PCincrement

PCincrement

Registerfile

Registerfile

ALUALU

Datamemory

Datamemory

SelectPC

rB

dstE dstMSelectA

ALUA

ALUB

Mem.control

Addr

srcA srcB

read

write

ALUfun.

Fetch

Decode

Execute

Memory

icode

data out

data in

A BM

E

M_valA

W_valM

W_valE

M_valA

W_valM

d_rvalA

f_pc

PredictPC

valE valM dstE dstM




predPC

CCCC

d_srcBd_srcA

e_Cnd

M_Cnd

stat

stat

stat

stat

statimem_error

instr_valid

stat

dstE

dmem_errorm_stat

W_stat

M_stat

E_stat

D_stat

f_stat

Writeback stat

Stat

1: mrmovl 4(%ebx), %ecx2: addl %eax, %edx3: addl %ecx, %edx

Problems:Feedback Paths

Predicted PC Guess value of next PC

Branch information Jump taken/not-taken Fall-through or target

address Return address

Read from memory Register updates

To register file write ports

E

M

W

F

D

Instructionmemory

Instructionmemory

PCincrement

PCincrement

Registerfile

Registerfile

ALUALU

Datamemory

Datamemory

SelectPC

rB

dstE dstMSelectA

ALUA

ALUB

Mem.control

Addr

srcA srcB

read

write

ALUfun.

Fetch

Decode

Execute

Memory

icode

data out

data in

A BM

E

M_valA

W_valM

W_valE

M_valA

W_valM

d_rvalA

f_pc

PredictPC

valE valM dstE dstM




predPC

CCCC

d_srcBd_srcA

e_Cnd

M_Cnd

stat

stat

stat

stat

statimem_error

instr_valid

stat

dstE

dmem_errorm_stat

W_stat

M_stat

E_stat

D_stat

f_stat

Writeback stat

Stat

Predicting the PC

Pipeline timing requires next instruction fetch to begin one cycle after current instruction is fetched Not enough time to determine next

instruction in all cases Can be done for all but conditional

jumps, return Solution for conditional jumps: guess

outcome + continue Recover later if prediction was

incorrectPC

incrementPC

increment

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

Registerfile

RegisterfileA B M

E

d_srcA, d_srcB

valA, valB

aluA, aluB

CndvalE

Addr, Data

valM



E

M

W

F

D

valP

f_pc

predPC

Instructionmemory

Instructionmemory


Simple Prediction Strategy Instructions that don’t transfer control

Next PC is always valP (no guess required) Call and unconditional jumps

Next PC is always valC (no guess required) Conditional jumps

Could predict next PC to be valC (branch target) Correct only if branch is taken (~60% of time)

Could predict next PC to be valP (sequential successor) Correct only if branch not taken (~40% of time)

Could predict taken if forward, not-taken if backward Correct ~65% of time

Return instruction Don’t try to predict (Why? What would be required?)

Branch Misprediction Example

Should execute only first 7 instructions Our predictor will guess the branch will be taken What must be done to recover?

0x000: xorl %eax,%eax 0x002: jne t # Not taken 0x007: irmovl $1, %eax # Fall through 0x00d: nop 0x00e: nop 0x00f: nop 0x010: halt 0x011: t: irmovl $2, %edx # Target (Should not execute) 0x017: irmovl $3, %ebx # Should not execute 0x01d: irmovl $4, %edx # Should not execute

Handling Misprediction: Desired Behavior

Detection Branch is in execute, and it was not taken Two incorrect instructions follow branch in pipe

Action On next cycle, replace mispredicted instructions with “bubbles”

0x000: irmovl Stack,%esp # Initialize stack pointer 0x006: call p # Procedure call 0x00b: irmovl $5,%esi # Return point 0x011: halt 0x020: .pos 0x20 0x020: p: irmovl $-1,%edi # procedure 0x026: ret 0x027: irmovl $1,%eax # Should not be executed 0x02d: irmovl $2,%ecx # Should not be executed 0x033: irmovl $3,%edx # Should not be executed 0x039: irmovl $4,%ebx # Should not be executed 0x100: .pos 0x100 0x100: Stack: # Stack: Stack pointer

Return Example

Without special handling we would fetch three more instructions after ret before we know return address

Instead, we’ll freeze the pipe when ret is fetched

0x026: ret F D E M

Wbubble F D E M

W

bubble F D E M W

bubble F D E M W

0x00b: irmovl $5,%esi # Return F D E M W

# demo-retb

F D E M W

FvalC f 5rB f %esi

FvalC f 5rB f %esi

W

valM = 0x0b

W

valM = 0x0b

•••

Handling Return: Desired Behavior

Detection ret is in decode, execute or

memory Address of next instruction

unavailable Action

Fill pipe with bubbles until ret’s memory access completes

Pipeline Operation

irmovl $1,%eax #I1

1 2 3 4 5 6 7 8 9

F D E M

Wirmovl $2,%ecx #I2 F D E M

W

irmovl $3,%edx #I3 F D E M W

irmovl $4,%ebx #I4 F D E M W

halt #I5 F D E M W

Cycle 5

W

I1

M

I2

E

I3

D

I4

F

I5

Problems with PIPE-: Example with 3 Nops

0x000: irmovl $10,%edx

1 2 3 4 5 6 7 8 9

F D E M WF D E M W

0x006: irmovl $3,%eax F D E M WF D E M W

0x00c: nop F D E M WF D E M W

0x00d: nop F D E M WF D E M W

0x00e: nop F D E M WF D E M W

0x00f: addl %edx,%eax F D E M WF D E M W

10

W

R[%eax] f 3

W

R[%eax] f 3

D

valA f R[%edx] = 10valB f R[%eax] = 3

D


# demo-h3.ys

Cycle 6

11

0x011: halt F D E M WF D E M W

Cycle 7

Example: 2 Nops


1 2 3 4 5 6 7 8 9

F D E M WF D E M W

0x006: irmovl $3,%eax F D E M WF D E M W


0x00d: nop F D E M WF D E M W

0x00e: addl %edx,%eax F D E M WF D E M W

0x010: halt F D E M WF D E M W

10# demo-h2.ys

W

R[%eax] f 3

D

valA f R[ %edx] = 10valB f R[ %eax] = 0

•••

W

R[%eax] f 3

W

R[%eax] f 3

D

valA fR[% edx] = 10valB fR[% eax] = 0

D

valA fR[% edx] = 10valB fR[% eax] = 0

•••

Cycle 6

Error

Example: 1 Nop0x000: irmovl $10,%edx

1 2 3 4 5 6 7 8 9

F D E M

W0x006: irmovl $3,%eax F D E M

W


0x00d: addl %edx,%eax F D E M WF D E M W

0x00f: halt F D E M WF D E M W

# demo-h1.ys

W

R[%edx] f 10

W

R[%edx] f 10

D


D


•••

Cycle 5

Error

MM_valE = 3M_dstE = %eax

Example: 0 Nops


1 2 3 4 5 6 7 8

F D E M

W0x006: irmovl $3,%eax F D E M

W

F D E M W0x00c: addl % edx,% eax

F D E M W0x00e: halt

# demo-h0.ys

E

D


D


Cycle 4

Error

MM_valE = 10M_dstE = %edx

e_valE f 0 + 3 = 3 E_dstE = %eax

Hazard Solution 1: Stalling

If instruction that reads register follows too closely after instruction that writes same register, delay reading instruction

Delay or stall always takes place in decode stage Following instruction also delayed in fetch, but just one “stall cycle” tallied

Stall = dynamically injecting nop (bubble) into execute stage Hazard we get wrong answer without special action


1 2 3 4 5 6 7 8 9

F D E M W

0x006: irmovl $3,%eax F D E M W

0x00c: nop F D E M W

bubble

F

E M W

0x00e: addl %edx,%eax D D E M W

0x010: halt F D E M W

10# demo-h2.ys

F

F D E M W0x00d: nop

11

Stalls

Detection Write is pending for a src

register srcA or srcB (not 0xF)

matches dstE or dstM in execute, memory, or write-back stages

Action Stall instruction in decode

E

M

W

F

D

Instructionmemory

Instructionmemory

PCincrement

PCincrement

Registerfile

Registerfile

ALUALU

Datamemory

Datamemory

SelectPC

rB

dstE dstMSelectA

ALUA

ALUB

Mem.control

Addr

srcA srcB

read

write

ALUfun.

Fetch

Decode

Execute

Memory

icode

data out

data in

A BM

E

M_valA

W_valM

W_valE

M_valA

W_valM

d_rvalA

f_pc

PredictPC

valE valM dstE dstM




predPC

CCCC

d_srcBd_srcA

e_Cnd

M_Cnd

stat

stat

stat

stat

statimem_error

instr_valid

stat

dstE

dmem_errorm_stat

W_stat

M_stat

E_stat

D_stat

f_stat

Writeback stat

Stat

1: addl %eax, %edx2: mrmovl 4(%ebx), %ecx3: addl %ecx, %edx

Detecting Stall Condition


1 2 3 4 5 6 7 8 9

F D E M W



F0x00e: addl %edx,%eax D E M W

0x010: halt D E M W

10# demo-h2.ys

F

F D E M W0x00d: nop

11

Cycle 6

W

D

•••

W_dstE = %eaxW_valE = 3

srcA = %edxsrcB = %eax

Detecting Stall Condition


1 2 3 4 5 6 7 8 9

F D E M W



bubble

F

E M W

0x00e: addl %edx,%eax D D E M W

0x010: halt F D E M W

10# demo-h2.ys

F

F D E M W0x00d: nop

11

Cycle 6

W

D

•••

W_dstE = %eaxW_valE = 3


Multi-cycle Stalls


1 2 3 4 5 6 7 8 9

F D E M W


bubble

F

E M W

bubble

D

E M W

0x00c: addl %edx,%eax D D E M W

0x00e: halt F D E M W

10# demo-h0.ys

F F

D

F

E M W bubble

11

Cycle 4

DsrcA = %edxsrcB = %eax

Multi-cycle Stalls


1 2 3 4 5 6 7 8 9

F D E M W


bubble

F

E M W

bubble

D

E M W



10# demo-h0.ys

F F

D

F

E M W bubble

11

Cycle 4

EE_dstE = %eax


Multi-cycle Stalls


1 2 3 4 5 6 7 8 9

F D E M W


bubble

F

E M W

bubble

D

E M W



10# demo-h0.ys

F F

D

F

E M W bubble

11

Cycle 4

•••D


EE_dstE = %eax


Cycle 5

Multi-cycle Stalls


1 2 3 4 5 6 7 8 9

F D E M W


bubble

F

E M W

bubble

D

E M W



10# demo-h0.ys

F F

D

F

E M W bubble

11

Cycle 4

•••

MM_dstE = %eax


EE_dstE = %eax


Cycle 5

Multi-cycle Stalls


1 2 3 4 5 6 7 8 9

F D E M W


bubble

F

E M W

bubble

D

E M W



10# demo-h0.ys

F F

D

F

E M W bubble

11

Cycle 4 •••


•••

MM_dstE = %eax


EE_dstE = %eax


Cycle 5

Cycle 6

Multi-cycle Stalls


1 2 3 4 5 6 7 8 9

F D E M W


bubble

F

E M W

bubble

D

E M W



10# demo-h0.ys

F F

D

F

E M W bubble

11

Cycle 4 •••

WW_dstE = %eax


•••

MM_dstE = %eax


EE_dstE = %eax


Cycle 5

Cycle 6

Stalling: Another View

Operational rules: Stalling instruction held back in decode stage Following instruction stays in fetch stage Bubbles injected into execute stage

Like dynamically generated nop’s Move through stage-by-stage as regular instructions


0x006: irmovl $3,%eax

0x00c: addl %edx,%eax

Cycle 4

0x00e: halt




# demo-h0.ys

0x00e: halt



bubble


Cycle 5

0x00e: halt


bubble


bubble

Cycle 6

0x00e: halt

bubble

bubble


bubble

Cycle 7

0x00e: halt

bubble

bubble

Cycle 8


0x00e: halt

Write BackMemoryExecuteDecode

Fetch

Summary Pipeline Hazards

Data Hazards Solution #1: Stalling

Control Hazard jmp, ret Solution #1: Simple Prediction

real-world pipelines: car washes idea divide process into independent stages move objects through...

Documents

regreg regreg regreg

ps20 ps regreg comb

logic b regreg comb

ps20 ps slide

regreg clock comb

ps combinational logic

logic b comb

gopsclock regreg comb