pipelining of processors

Lecture 5

Pipelining of Processors

Computer Architecture

Lecturer: Irfan Ali1

Characterize Pipelines

1) Hardware or software implementation – pipelining can be implemented in either software or hardware.

2) Large or Small Scale – Stations in a pipeline can range from simplistic to powerful, and a pipeline can range in length from short to long.

3) Synchronous or asynchronous flow – A synchronous pipeline operates like an assembly line: at a given time, each station is processing some amount of information. A asynchronous pipeline, allow a station to forward information at any time.

4) Buffered or unbuffered flow – One stage of pipeline sends data directly to another one or a buffer is place between each pairs of stages.

5) Finite Chunks or Continuous Bit Streams – The digital information that passes though a pipeline can consist of a sequence or small data items or an arbitrarily long bit stream.

6) Automatic Data Feed Or Manual Data Feed – Some implementations of pipelines use a separate mechanism to move information, and other implementations require each stage to participate in moving information.

2

What is Pipelining

• Pipelining is an implementation technique whereby multiple instructions are overlapped in execution; it takes advantage of parallelism that exists among the actions needed to execute an instruction. Today, pipelining is the key implementation technique used to make fast CPUs.

• A technique used in advanced microprocessors where the microprocessor begins executing a second instruction before the first has been completed.

• A Pipeline is a series of stages, where some work is done at each stage. The work is not finished until it has passed through all stages.

• With pipelining, the computer architecture allows the next instructions to be fetched while the processor is performing arithmetic operations, holding them in a buffer close to the processor until each instruction operation can performed.

3

Pipelining: Its Natural!

• Laundry Example

• Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold

• Washer takes 30 minutes

• Dryer takes 40 minutes

• “Folder” takes 20 minutes

A B C D

4

Sequential Laundry

• Sequential laundry takes 6 hours for 4 loads• If they learned pipelining, how long would laundry take?

A

B

C

D

30 40 20 30 40 20 30 40 20 30 40 20

6 PM 7 8 9 10 11 Midnight

T

a

s

k

O

r

d

e

r

Time

5

Pipelined LaundryStart work ASAP

• Pipelined laundry takes 3.5 hours for 4 loads

A

B

C

D

6 PM 7 8 9 10 11 Midnight

T

a

s

k

O

r

d

e

r

Time

30 40 40 40 40 20

6

Pipelining Lessons• Pipelining doesn’t help latency

of single task, it helps throughput of entire workload

• Pipeline rate limited by slowest pipeline stage

• Multiple tasks operating simultaneously

• Potential speedup = Number pipe stages

• Unbalanced lengths of pipe stages reduces speedup

• Time to “fill” pipeline and time to “drain” it reduces speedup

A

B

C

D

6 PM 7 8 9

T

a

s

k

O

r

d

e

r

Time

30 40 40 40 40 20

7

How Pipelines Works

• The pipeline is divided into segments and each segment can execute it operation concurrently with the other segments. Once a segment completes an operations, it passes the result to the next segment in the pipeline and fetches the next operations from the preceding segment.

8

Before there was pipelining…

• Single-cycle control: hardwired

– Low CPI (1)

– Long clock period (to accommodate slowest instruction)

• Multi-cycle control: micro-programmed

– Short clock period

– High CPI

Single-cycle

Multi-cycle

insn0.(fetch,decode,exec) insn1.(fetch,decode,exec)

insn0.fetch insn0.dec insn0.exec insn1.fetch insn1.dec insn1.exec

time

9

Example

Instruction 1 Instruction 2

Instruction 3Instruction 4

X X

XX

Four sample instructions, executed linearly

10

Four Pipelined Instructions

IF

IF

IF

IF

ID

ID

ID

ID

EX

EX

EX

EX M

M

M

M

W

W

W

W

5

1

1

1

11

Instructions in the pipeline stages

12

PipeliningMulti-cycle insn0.fetch insn1.decinsn1.fetchinsn0.dec insn0.exec insn1.exec

time

Pipelinedinsn0.fetch insn0.dec

insn1.fetch

insn0.exec

insn1.dec

insn2.fetch

insn1.exec

insn2.dec insn2.exec

• Start with multi-cycle design

• When insn0 goes from stage 1 to stage 2… insn1 starts stage 1

• Each instruction passes through all stages… but instructions enter and leave at faster rate

Can have as many insns in flight as there are stages 13

CPI Of Various Microarchitectures

14

Processor Pipeline Review

I-cache(Instruction

Cache)

Reg

FilePC

+4

D-cache

(Data

Cache)

ALU

Fetch Decode Memory

(Write-back)

Execute

15

Instruction Pipeline

• To implement pipelining, a designer s divides a processor's data path into sections (stages), and places pipeline latches (also called buffers) between each section (stage)

16

Unpipelined processor

17

Pipelined five stages processor

18

A Simple Implementation of a RISC Instruction Set

• Every instruction in this RISC subset can be implemented in at most 5 clock cycles. The 5 clock cycles are as follows:

Instructions Fetch(IF)

• The instruction Fetch (IF) stage is responsible for obtaining the requested instruction from memory. The instruction and the program counter (which is incremented to the next instruction) are stored in the IF/ID pipeline register as temporary storage so that may be used in the next stage at the start of the next clock cycle.

• Send the program counter (PC) to memory and fetch the current instruction from memory. Update the PC to the next sequential PC by adding 4 (since each instruction is 4 bytes) to the PC.

19

Stage 1: Fetch

• Fetch an instruction from memory every cycle

– Use PC to index memory

– Increment PC (assume no branches for now)

• Write state to the pipeline register (IF/ID)

– The next stage will read this pipeline register

20

Stage 1: Fetch Diagram

Inst

ructi

on

bit

s

IF / ID

Pipeline register

Instruction

Cache

PC

en

en

1

+

M

U

X

PC

+1

Decode

target

21

Instruction Decode

• The Registers Fetch (REG)and Instruction Decode (ID) stage is responsible for decoding the instruction and sending out the various control lines to the other parts of the processor. The instruction is sent to the control unit where it is decoded and the registers are fetched from the register file.

22

Stage 2: Decode

• Decodes opcode bits

– Set up Control signals for later stages

• Read input operands from register file

– Specified by decoded instruction bits

• Write state to the pipeline register (ID/EX)

– Opcode

– Register contents

– PC+1 (even though decode didn’t use it)

– Control signals (from insn) for opcode and destReg

23

Stage 2: Decode Diagram

ID / EX

Pipeline register

regA

conte

nt

s

regB

conte

nt

s

Register File

regA

regB

en

Inst

ructi

on

bit

s

IF / ID

Pipeline register

PC

+1

PC

+1

Contr

ol

signals

Fetc

h

Execute

destReg

data

target

24

Execution

• The Effective Address/Execution (EX) stage is where any calculations are performed. The main component in this stage is the ALU. The ALU is made up of arithmetic, logic and capabilities.

• The ALU operates on the operands prepared in the prior cycle, performing one of three functions depending on the instruction type.■ Memory reference—The ALU adds the base register and the offset to formthe effective address.■ Register-Register ALU instruction—The ALU performs the operationspecified by the ALU opcode on the values read from the register file.■ Register-Immediate ALU instruction—The ALU performs the operation specified by the ALU opcode on the first value read from the register file and the sign-extended immediate.

• In a load-store architecture the effective address and execution cycles can be combined into a single clock cycle, since no instruction needs to simultaneously calculate a data address and perform an operation on the data.

25

Stage 3: Execute

• Perform ALU operations

– Calculate result of instruction

• Control signals select operation

• Contents of regA used as one input

• Either regB or constant offset (from insn) used as second input

– Calculate PC-relative branch target

• PC+1+(constant offset)

• Write state to the pipeline register (EX/Mem)

– ALU result, contents of regB, and PC+1+offset


26

Stage 3: Execute Diagram

ID / EX

Pipeline register

regA

conte

nt

s

regB

conte

nt

s

ALU

resu

l

t

EX/Mem

Pipeline register

PC

+1

Contr

ol

signals

Contr

ol

signals

PC+1

+off

se

t+

regB

conte

nt

s

A

L

UM

U

X

Decode

Mem

ory

destRegdata

target

27

Memory and IO

• The Memory Access and IO (MEM) stage is responsible for storing and loading values to and from memory. It also responsible for input or output from the processor. If the current instruction is not of Memory or IO type than the result from the ALU is passed through to the write back stage.

• If the instruction is a load, the memory does a read using the effective address computed in the previous cycle. If it is a store, then the memory writes the data from the second register read from the register file using the effective address.

28

Stage 4: Memory

• Perform data cache access

– ALU result contains address for LD or ST

– Opcode bits control R/W and enable signals

• Write state to the pipeline register (Mem/WB)

– ALU result and Loaded data


29

Stage 4: Memory Diagram

ALU

resu

l

t

Mem/WB

Pipeline register

ALU

resu

l

t

EX/Mem

Pipeline register

Contr

ol

signals

PC+1

+off

se

t

regB

conte

nt

s Loade

d

data

Contr

ol

signals

Execute

Wri

te-

back

in_data

in_addr

Data Cache

en R/W

destRegdata

target

30

Write Back

• The Write Back (WB) stage is responsible for writing the result of a calculation, memory access or input into the register file.

• Register-Register ALU instruction or load instruction:

• Write the result into the register file, whether it comes from the memory system (for a load) or from the ALU (for an ALU instruction).

31

Stage 5: Write-back

• Writing result to register file (if required)

– Write Loaded data to destReg for LD

– Write ALU result to destReg for arithmetic insn

– Opcode bits control register write enable signal

32

Stage 5: Write-back Diagram

ALU

resu

l

t

Mem/WB

Pipeline register

Contr

ol

signals

Loade

d

data

M

U

X

data

destRegM

U

X

Mem

ory

33

Putting It All Together

PC Inst

Cache

Regis

ter

file

M

UA

L

U

1

Data

Cache

++

M

U

X

IF/ID EX/Mem Mem/WB

M

U

Xdest

op

ID/EX

offset

valB

valA

PC+1PC+1target

XALU

result

dest

op

valB

dest

op

ALU

result

mdata

eq?instru

ctio

n

R0 0 R1 R2 R3 R4 R5 R6 R7

regA regB

data

dest

M

U

X

34

pipelining of processors

Education