cs2071 new notes 3

21
1 CMageshKumar_AP_AIHT CS2071_Computer Architecture ANAND INSTITUTE OF HIGHER TECHNOLOGY Chennai-603 103 DEPARTMENT OF ELECTRONICS AND INSTRUMENTATION ENGINEERING CS2071 COMPUTER ARCHITECTURE Faculty Name: C.MAGESHKUMAR Class : IV EIE A&B Semester: VII UNIT III DATA PATH AND CONTROL CONTENT Page no. I. Instruction Execution Steps 2 1. A Small Set of Instructions 2 2. The Instruction Execution Unit 3 3. A Single-Cycle Data Path 4 4. Branching and Jumping 6 5. Deriving the Control Signals 6 6. Performance of the Single-Cycle Design 6 II. Control Unit Synthesis 6 7. A Multicycle Implementation 6 8. Choosing the Clock Cycle 8 9. The Control State Machine 9 10. Performance of the Multicycle Design 10 III. Microprogramming 11 IV. Pipelining 13 11. Pipelining Concepts 13 12. Pipeline Stalls or Bubbles 14 13. Pipeline Timing and Performance 16 14. Pipelined Data Path Design 16 15. Pipelined Control 16 16. Optimal Pipelining 16 V. Pipeline Performance 17 17. Data Dependencies and Hazards 17 18. Data Forwarding 18 19. Pipeline Branch Hazards 19 20. Delayed Branch and Branch Prediction 19 21. Advanced Pipelining 21

Upload: intelinsideoc

Post on 29-Jan-2016

12 views

Category:

Documents


0 download

DESCRIPTION

cs2071 new notes

TRANSCRIPT

Page 1: cs2071 new notes 3

1

CMageshKumar_AP_AIHT CS2071_Computer Architecture

ANAND INSTITUTE OF HIGHER TECHNOLOGY Chennai-603 103 DEPARTMENT OF ELECTRONICS AND INSTRUMENTATION ENGINEERING

CS2071 COMPUTER ARCHITECTURE

Faculty Name: C.MAGESHKUMAR Class: IV EIE A&B Semester: VII

UNIT III –DATA PATH AND CONTROL

CONTENT Page no.

I. Instruction Execution Steps 2

1. A Small Set of Instructions 2

2. The Instruction Execution Unit 3

3. A Single-Cycle Data Path 4

4. Branching and Jumping 6

5. Deriving the Control Signals 6

6. Performance of the Single-Cycle Design 6

II. Control Unit Synthesis 6

7. A Multicycle Implementation 6

8. Choosing the Clock Cycle 8

9. The Control State Machine 9

10. Performance of the Multicycle Design 10

III. Microprogramming 11

IV. Pipelining 13

11. Pipelining Concepts 13

12. Pipeline Stalls or Bubbles 14

13. Pipeline Timing and Performance 16

14. Pipelined Data Path Design 16

15. Pipelined Control 16

16. Optimal Pipelining 16

V. Pipeline Performance 17

17. Data Dependencies and Hazards 17

18. Data Forwarding 18

19. Pipeline Branch Hazards 19

20. Delayed Branch and Branch Prediction 19

21. Advanced Pipelining 21

Page 2: cs2071 new notes 3

2

CMageshKumar_AP_AIHT CS2071_Computer Architecture

I. INSTRUCTION EXECUTION STEPS

1. A SMALL SET OF INSTRUCTIONS

MiniMIPS instruction set – 40 instructions

MicroMIPS instruction set – 22 instructions

The instructions in below table can be divided into 5 categories

1. Seven (7) R-format ALU instructions (add, sub, slt, and, or, xor, nor)

2. Six (6) I-format ALU instructions (lui, addi, slti, andi, ori, xori)

3. Two (2) I-format memory access instructions (lw, sw)

4. Three (3) I-format conditional branch instructions (bltz, beq, bne)

5. Four (4) unconditional jump instructions (j, jr, jal, syscall)

Fig.1. MICROMIPS INSTRUCTION FORMATS

Execution sequence of MicroMIPS instructions:

Seven R-format ALU instructions (add, sub, slt, and, or, xor, nor) have following common execution sequence:

1. Read out the contents of source registers „rs‟ & „rt‟ and forward them to ALU as inputs

2. Inform the ALU to perform the desired operation by means of appropriate control signal

3. Write the output of ALU in destination register „rd‟

5 out of 6 I-format ALU instructions (addi, slti, andi, ori, xori) have following common execution sequence:

1. Read out the contents of source registers „rs‟ & „immediate value‟ and forward them to ALU as inputs

2. Inform the ALU to perform the desired operation by means of appropriate control signal

3. Write the output of ALU in destination register „rt‟

5 bits 5 bits

31 25 20 15 0

Opcode Source 1 or base

Source 2 or dest’n

op rs rt

R 6 bits 5 bits

rd

5 bits

sh

6 bits

10 5 fn

jta Jump target address, 26 bits

imm Operand / Offset, 16 bits

Destination Unused Opcode ext I

J

inst Instruction, 32 bits

Page 3: cs2071 new notes 3

3

CMageshKumar_AP_AIHT CS2071_Computer Architecture

The 1 out of 6 I-format ALU instructions (lui) have following common execution sequence:

1. Read out the contents of source register „immediate value‟ and forward them to ALU as input

2. Inform the ALU to perform the desired operation by means of appropriate control signal

3. Write the output of ALU in destination register „rt‟

The Two (2) I-format memory access instructions (lw, sw) have following common execution sequence:

1. Read out the content of „rs‟

2. Add the number of read out from „rs‟ to „immediate value‟ in instruction to form a memory address

3. Read from / write into memory at specified address.

4. In case of „lw‟ instruction, place the word read out from memory into „rt‟

The Three (3) I-format conditional branch instructions (bltz, beq, bne) and Four (4) unconditional jump

instructions (j, jr, jal, syscall) have following common execution sequence:

1. Read out the contents of source registers „rs‟ & „immediate value‟ and forward them to ALU as inputs

2. Inform the ALU to perform the desired operation by means of appropriate control signal

3. The branch target address is specified by an offset relative to increamented program counter value ((PC)+4)

4. To branch back tp previous instruction, the offset value supplied in the immediate field of instruction will be -2,

which in branch target address [ (PC)+4-(2*4) = (PC)-4]

5. For „beq‟, „bne‟ instructions, contents of „rs‟ and „rt‟ are compared to determine wheather branch condition is

satisfied.

6. For „bltz‟, the branch decision is based on the sign bit of content of „rs‟.

7. For 4 jump instructions (j, jr, jal, syscall):

PC is unconditionally modified to allow the next instruction to be fetched from jump target address.

The jump target address comes from instruction itself (j, jal) is read out from register „rs‟ or is a known

constant associated with the location of an operating system routine call (syscall)

2. THE INSTRUCTION EXECUTION UNIT

Step by step execution of all 22 MicroMIPS instructions can be depicted from below block diagram:

1. Beginning at the left end, the content of program counter (PC) is supplied to instruction cache and an

instruction word is read out from specified location.

2. With every clock cycle ticking, a new address is loaded into program counter causing a new instruction to

appear at output of instruction cache after a short access delay

3. Contents of various fields of instruction are sent to relevant blocks including control unit (decides the

operation to be performed)

4. Once an instruction has been read out from instruction cache, its various fields are separated and dispatched

to approx. place.

Example: „op‟ and „fn‟ fields goto control unit, „rs‟, „rt‟, „rd‟ will goto register file

5. The upper input of ALU always comes from register „rs‟ and lower input of ALU is from „rt‟ or „immediate

value‟ of instruction.

6. As the data from register file pass through ALU, the specified operation is performed and the output

appears at ALU output.

7. In case of arithmetic and logic instructions the output of ALU is stored in destination register and thus it

bye-pass data cache, run through feedback line is stored in „rd‟ of register file.

8. In case of memory access instructions, the ALU output data is treated as data address for writing into / read

from data cache

9. Data cache: For many instructions, the output of ALU is stored in a register thus, data cache is byepassed.

For „lw‟ and „sw‟ instructions, the data cache is accessed with the content of „rt‟ written into „rt‟ for „sw‟

instruction and its output sent to register file for „lw‟ instruction

10. In one clock cycle, the content of any 2 registers out of 32 registers (mostly „rs‟ & „rt‟) is read out from read

ports, At the same time, the output from ALU is stored in the register via write port.

11. The flip-flops representing registers are edge-triggered. So, reading / writing into same register in a single

clock cycle does not cause any problem.

Page 4: cs2071 new notes 3

4

CMageshKumar_AP_AIHT CS2071_Computer Architecture

12. For „beq‟ and „bne‟ instructions, contents of „rs‟ and „rt‟ are compared to determine whether the branch

condition is satisfied. The comparison is performed in „next address‟ block.

13. In case of „bltz‟, the branch decision is based on sign bit of content of „rs‟ rather than comparison of two

register contents. This is performed by „next address‟ block.

14. „Next address‟ blocks also choose the jump target address under the guidance of control unit.

15. The jump target address comes from „j‟, „jal‟ instructions is read out from register „rs‟ (jr instruction).

16. The middle part composing program counter, instruction cache, register file, ALU, data cache is known as

data path.

Fig.2. Abstract view of the instruction execution unit for MicroMIPS.

3. A SINGLE-CYCLE DATA PATH

1. The middle part composing program counter, instruction cache, register file, ALU, data cache is known as

data path.

2. The datapath shown above is capable of executing one instruction per clock cycle. Hence the name „single

cycle datapath‟

3. Singlecycle design : clock rate- 125 MHz and CPI- 1

4. There are 3 multiplexers used in datapath,

1. At input side of register file

2. At lower input of ALU

3. At output of ALU and data cache.

5. Multiplexer 1 (At input side of register file) :

i. This multiplexer allows „rt‟, „rd‟ or „$31‟ to be used as the index of destination register into which

results will be written.

ii. The logic signal „RegDst‟ is supplied by control unit directs the selection of „rt‟ or „rd‟ or „$31‟.

iii. „RegDst’ control signals and corresponding selections

S.no Control signal Selection

1 00 rt

2 01 rd

3 10 $31

iv. „RegWrite‟ is declared (asserted) by control unit to write into register file.

6. Registers „rs‟ and „rt‟ are read out for every instruction even it is not needed, so there is no read control signal.

ALU

Data cache

Instr cache

Next addr

Control

Reg file

op

jta

fn

inst

imm

rs,rt,rd (rs)

(rt)

Address

Data

PC

bltz,jr

beq,bne

12 A/L, lui, lw,sw

j,jal

syscall

22 instructions

Harvard

architecture

Harvard architecture

Page 5: cs2071 new notes 3

5

CMageshKumar_AP_AIHT CS2071_Computer Architecture

7. Instruction cache block also won‟t receive any control signal to read the instructions since instructions are

read out in every cycle.

8. Multiplexer 2 (At lower input of ALU):

i. The multiplexer at the lower input of ALU allows the control unit by asserting / deasserting „ALUSrc‟

control signal to choose the content of „rt‟ or ‟32-bit sign-extended version of 16-bit immediate

operand‟ to be used as second ALU input.

1. If „ALUSrc‟ signal = 0 (deasserting), then content of „rt‟ is used as ALU lower input

2. If „ALUSrc‟ signal = 1 (asserting), then content of „‟32-bit sign-extended version of

16-bit immediate operand‟ is used as ALU lower input.

ii. Sign extension of immediate operand is performed by „SE‟ block.

9. Multiplexer 3 (At output of ALU and data cache): The control signal used here is „RegInSrc‟

S.no Control signal Selection

1 00 Data cache output

2 01 ALU output

3 10 Incremented PC value coming from next-address block

10. With every clock cycle ticking, a new address is loaded into program counter causing a new instruction to

appear at output of instruction cache after a short access delay.

11. Contents of various fields of instruction are sent to relevant blocks including control unit (decides the

operation to be performed)

12. As the data from register file pass through ALU, the specified operation is performed by ALUFunc signal and

the output appears at ALU output.

13. In case of arithmetic and logic instructions the output of ALU is stored in destination register and thus it bye-

pass data cache, run through feedback line is stored in „rd‟ of register file.

14. In case of memory access instructions, the ALU output data is treated as data address for writing into

(DataWrite signal ) / read from (DataRead signal) data cache

/

ALU

Data cache

Instr cache

Next addr

Reg file

op

jta

fn

inst

imm

rs (rs)

(rt)

Data addr

Data in 0

1

ALUSrc

ALUFunc DataWrite

DataRead

SE

RegInSrc

rt

rd

RegDst

RegWrite

32 / 16

Register input

Data out

Func

ALUOvfl

Ovfl

31

0 1 2

Next PC

Incr PC

(PC)

Br&Jump

ALU out

PC

0 1 2

Page 6: cs2071 new notes 3

6

CMageshKumar_AP_AIHT CS2071_Computer Architecture

4. BRANCHING AND JUMPING:

(Refer page no. 249,250 in text book B.Parhami)

5. DERIVING THE CONTROL SIGNALS:

(Refer page no. 250-253 in text book B.Parhami)

Control signals for the single-cycle MicroMIPS implementation.

6. PERFORMANCE OF THE SINGLE-CYCLE DESIGN

(Refer page no. 253-255 in text book B.Parhami)

II. CONTROL UNIT SYNTHESIS

7. MULTICYCLE IMPLEMENTATION:

Fig.3. Single-cycle versus multicycle instruction execution.

With multicycle design, a subset of actions required for an instruction is performed in one clock cycle.

Hence the clock cycle can be made much shorter, with several cycles needed to execute a single instruction.

Advantages of multicycle implementation are greater speed and economy

Clock

Clock

Instr 2 Instr 1 Instr 3 Instr 4

3 cycles 3 cycles 4 cycles 5 cycles

Time saved

Instr 1 Instr 4 Instr 3 Instr 2

Time needed

Time needed

Time allotted

Time allotted

Page 7: cs2071 new notes 3

7

CMageshKumar_AP_AIHT CS2071_Computer Architecture

MULTICYCLE DATA PATH:

Fig.4. Abstract view of a multicycle instruction execution unit for MicroMIPS.

1. The datapath in above block diagram is capable of executing one instruction in every 3-5 clock cuycles.

Hence named as “multi-cycle data path”

2. Multicycle design : clock rate- 500 MHz and CPI- approx. 4

3. Cache block = instruction cache + data cache.

4. All instructions will be executed in 5 cycles – refer control state machine

5. When a word is read from cache block, it must be held in a register for use in subsequent cycles.

6. The reason for having 2 registers “Instruction register” and “Data register” between cache and register file

is that once the instruction is read out, it must be kept for all the remaining cycles in its execution to

generate the control signals appropriately.

7. So a second register is needed for data readout associated with „lw‟

8. Three other registers namely, „x‟, „y‟, and „z‟ also serve the same purpose of holding information between

cycles.

9. It is notable that except program counter and Instruction register all other registers are loaded in every clock

cycle.

10. Instruction fetch cycle: Execution of all instruction starts the same way in first cycle. The content of PC is

used to access cache and the retrieved word is placed in instruction register. This is known as instruction

fetch cycle.

11. In second clock cycle, the instructions are decoded and the registers „rs‟ and „rt‟ are accessed.

12. If the instruction executed is one of four jump instructions (j, jr, jal, syscall), its execution terminates in 3rd

cycle by simply writing the appropriate address into PC.

13. If it is a branch instruction (beq, bne, bltz), then the branch condition is checked and the appropriate value is

written into PC in 3rd

cycle.

14. All other instructions proceed to and completed in 4th cycle.

15. „lw‟ instruction requires 5th cycle to write the data retrieved from cache into a register.

FOR DETAILED CONTROL SIGNAL AND MUX EXPLANATION REFER PAGE NO. 260, 261 IN

P.BRAHAMI BOOK

ALU

Cache

Control

Reg file

op

jta

fn

imm

rs,rt,rd (rs)

(rt)

Address

Data

Inst Reg

Data Reg

x Reg

y Reg

z Reg PC

Page 8: cs2071 new notes 3

8

CMageshKumar_AP_AIHT CS2071_Computer Architecture

8. CHOOSING THE CLOCK CYCLE

(Refer page no. 262 in text book B.Parhami)

Page 9: cs2071 new notes 3

9

CMageshKumar_AP_AIHT CS2071_Computer Architecture

9. THE CONTROL STATE MACHINE

CONTROL STATE MACHINE for MULTICYCLE MicroMIPS

The control unit must distinguish between 5 cycles of mutlicycle design and additionally be able to perform

different operations depending on the instruction.

The above diagram depicts the control states and state transitions

The control state machine carries the required information along by moving from state to state. The control

state machine is set to state 0 when program execution begins

Then it moves from state to state until one instruction has been completed, at which it returns to state 0 to

begin the execution of another instruction.

The control state sequences for various MicroMIPS instruction classes are as follows:

ALU – type 0,1,7,8

Load word 0,1,2,3,4

Store word 0,1,2,6

Jump / branch 0,1,5

In each state except state 5 & 7, the control signals are uniquely determined.

Information regarding the current control state and instruction executed is supplied by decoders.

Control signals can be easily determined by using control state machine diagram and decoder diagram

Example of control signals that are uniquely determined by control state information include:

Certain control signals depend only on the control state

ALUSrcX = ControlSt2 ControlSt5 ControlSt7

RegWrite = ControlSt4 ControlSt8

Auxiliary signals identifying instruction classes

addsubInst = addInst subInst addiInst

logicInst = andInst orInst xorInst norInst andiInst oriInst xoriInst

Logic expressions for ALU control signals

AddSub = ControlSt5 (ControlSt7 subInst)

FnClass1 = ControlSt7 addsubInst logicInst

FnClass0 = ControlSt7 (logicInst sltInst sltiInst)

LogicFn1 = ControlSt7 (xorInst xoriInst norInst)

LogicFn0 = ControlSt7 (orInst oriInst norInst)

Page 10: cs2071 new notes 3

10

CMageshKumar_AP_AIHT CS2071_Computer Architecture

Decoders

10. PERFORMANCE OF THE MULTICYCLE DESIGN

(Refer page no. 266 in text book B.Parhami)

jrInst

norInst

sltInst

orInst

xorInst

syscallInst

andInst

addInst

subInst

RtypeInst

bltzInst

jInst

jalInst

beqInst

bneInst

sltiInst

andiInst

oriInst

xoriInst

luiInst

lwInst

swInst

andiInst

1

0

1

2

3

4

5

10

12

13

14

15

35

43

63

8

op

De

cod

er

fn D

ecod

er

/ 6 / 6

op fn

0

8

12

32

34

36

37

38

39

42

63

ControlSt0 ControlSt1 ControlSt2 ControlSt3 ControlSt4 ControlSt5

ControlSt8

ControlSt6 1

st D

ecod

er

/ 4

st

0 1 2 3 4 5

7

12 13 14 15

8 9 10

6

11

ControlSt7

Page 11: cs2071 new notes 3

11

CMageshKumar_AP_AIHT CS2071_Computer Architecture

III. MICROPROGRAMMING

The control state machine resembles a program that has instructions /state, branching, and loops. Such

a hardware program is called as microprogram and its basic steps are microinstructions.

A single instruction in microcode. It is the most elementary instruction in the computer, such as

moving the contents of a register to the arithmetic logic unit (ALU).

It takes several microinstructions to carry out one complex machine instruction (CISC).

Also called a "micro-op" or "µop," microinstructions differ within the same computer family and even

the same vendor.

Microprogrammed control is a control mechanism to generate control signals by using a memory

called control storage (CS), which contains the control signals.

Although microprogrammed control seems to be advantageous to CISC machines, since CISC

requires systematic development of sophisticated control signals, there is no intrinsic difference

between these 2 control mechanism.

Microprogramming is a method of control unit design in which the control unit selection and

sequencing information are stored in ROM and RAM‟s called control store or control memory.

Micro programmed control unit is a general approach used for implementation of control unit. Here

control signals are generated by a program similar to machine language programs

Instead of implementing the control state machine in custom hardware, we can store microinstructions

in locations of control ROM, fetching and executing sequence of microinstructions for each machine

language instruction.

Each microinstruction defines a step in execution of a machine language instruction.

Advantages of ROM-based implementation of control

o Simple hardware

o More regular

o Less dependent on instruction-set architecture details

o Same hardware can be used for different purpose by modifying ROM contents

Microprogramming : Designing a suitable sequence of microinstructions to realize a particular

instruction set architecture is called microprogramming.

Micro programmable machine: if the microprogram is easily modifiable, even by user then the

machine is called Micro programmable machine.

Micro instruction format:

o 23 bit microinstruction format. Each bit has one to one correspondence except sequence

control bits in multicycle datapath.

o The 2-bit sequence control field allows for the control of microinstruction sequencing in same

way that “PC control” affects the sequencing of machine language instructions.

Microprogrammed control unit: Microprogrammed control unit for MicroMIPS diagram shows 4

options (MUX) for choosing next microinstruction.

o Option 0: to advance the next microinstruction in sequence by incrementing

microprogram counter

o Option 1 & 2: allows branching to occur depending on opcode field in machine

instruction being excuted.

o Option 3: is to goto microinstruction 0 corresponding to state 0 (refer control

state machine). This initiates the fetch phase for next machine instruction

Dispatch table 1 : corresponds to multiway branch in going from cycle 2 to cycle 3

Dispatch table 2 : implements the branch between cycles 3 & 4. (refer control state machine)

Page 12: cs2071 new notes 3

12

CMageshKumar_AP_AIHT CS2071_Computer Architecture

23-BIT MICROINSTRUCTION FORMAT FOR MICROMIPS.

Microprogrammed control unit for MicroMIPS

(For detailed explanation with microprogram example please Refer page no. 269 - 271 in text

book B.Parhami)

PC control

Cache control

Register control

ALU inputs

JumpAddr

PCSrc

PCWrite

InstData

MemRead

MemWrite

IRWrite

FnType

LogicFn

AddSub

ALUSrcY

ALUSrcX

RegInSrc

RegDst

RegWrite

Sequence control

ALU function

Microprogram memory or PLA

op (from instruction register) Control signals to data path

Address 1

Incr

MicroPC

Data

0

Sequence control

0

1

2

3

Dispatch table 1

Dispatch table 2

Microinstruction register

Page 13: cs2071 new notes 3

13

CMageshKumar_AP_AIHT CS2071_Computer Architecture

IV. PIPELINING

11. PIPELINING CONCEPTS

2 strategies for achieving greater performance:

Strategy 1: multiple-instruction-issue or superscalar organization: use multiple independent data paths that can

accept several instructions that are read out at once.

Strategy 2: Pipelined or super-pipelined organization: overlap the execution of several instructions in single-

cycle design, starting next instruction before previous instruction has executed.

Pipelining:

Pipelining is an implementation technique where multiple instructions are overlapped in execution. The

computer pipeline is divided in stages.

Each stage completes a part of an instruction in parallel. The stages are connected one to the next to form

a pipe - instructions enter at one end, progress through the stages, and exit at the other end.

Pipelining does not decrease the time for individual instruction execution. Instead, it increases instruction

throughput.

The throughput of the instruction pipeline is determined by how often an instruction exits the pipeline.

5 instruction execution steps / stages in a pipelining of MicroMIPS:

Each step takes 1-2 ns.

1. Instruction Fetch

2. Instruction Decode and register access

3. ALU operation

4. Data memory access

5. Register writeback

Pipelined Instruction Execution (Pipelining in the MicroMIPS instruction execution process.)

In task-time diagram, stages of each task are horizontally aligned and their positions along the horizontal

axis represent the timing of their execution.

In space-time diagram, the vertical axis represents stages in the pipeline (the space dimension) and boxes

representing the various stages of a task are diagonally aligned.

Ideally a „q-stage‟ pipeline can increase instruction execution throughput by a factor of „q‟. But this fact is

not quite the case because of the following:

o Effects of pipeline start-up and drainage

o Wastage due to unequal stage delays.

o Time overhead of saving stage results in registers

o Safety margin in clock period necessitated by clock skew.

Cycle 7 Cycle 6 Cycle 5 Cycle 4 Cycle 3 Cycle 2 Cycle 1 Cycle 8

Reg file

Reg file ALU

Reg file

Reg file ALU

Reg file

Reg file ALU

Reg file

Reg file ALU

Reg file

Reg file ALU

Cycle 9

Instr cache

Instr cache

Instr cache

Instr cache

Instr cache

Data cache

Data cache

Data cache

Data cache

Data cache

Time dimension

Task dimension

In

str

1

In

str

2

In

str

3

In

str

4

In

str

5

Page 14: cs2071 new notes 3

14

CMageshKumar_AP_AIHT CS2071_Computer Architecture

Fig. Two abstract graphical representations of a 5-stage pipeline executing 7 tasks (instructions).

12. PIPELINE STALLS OR BUBBLES

Data dependency in pipeline : “Execution of one instruction depending on completion of a previous

instruction”.

Data dependency in pipeline can cause pipeline stalls which diminish the performance.

Types of data dependency:

o Read-after-compute: register access after updating it with a computed value.

o Read-after-load: register access after updating it with data from memory

Example for Read-after-compute is shown in below diagram, where the 3rd

instruction uses the value that

the 2nd

instruction writes into register $8 & the 4th instruction needs the result of 3

rd instruction in register

$9. Note that write operation in register $8 is completed in cycle 6. Hence, reading the new value from

register $8 is possible beginning with cycle 7. The 3rd

instruction reads out register $8 & $2in cycle 4. The

data dependency problem can be solved by bubble insertion or by data forwarding.

BUBBLE INSERTION:

First detect the type of data dependency

Bubble insertion: The phenomenon of “inserting redundant and harmless instruction (adding 0 to a register /

shifting a register by 0 bit) before the next instruction. Such instruction is called as “no-op” (no-operation)

instruction. Since they didn‟t perform any useful task but use the memory they resembles the bubble in a

water pipe” is called bubble insertion.

Insertion of bubbles in a pipeline implies

o reduced throughput

o hurts the performance when more than 2 bubbles are inserted.

So bubble insertion should be minimized. It can be minimized by relocating an useful instruction in a

program between the data dependent instruction instead of inserting bubbles.

DATA FORWARDING:

“the phenomenon of bypassing the output of ALU of 1st instruction to the input of ALU that is needed as

input for execution of 2nd

instruction without storing the output value of 1st instruction in memory is called

data forwarding ”. please see below diagrams for clear understanding

Control dependency:

When a conditional branch is executed, the location of the next branch instruction depends on whether the branch

condition is satisfied. Since branch instructions are based on testing the register contents, branch condition will be

resolved at the end of 2nd

pipeline stage. Therefore a bubble is required after every conditional branch instruction.

1

2

3

4

5

1

2

3

4

5

6

7

(a) Task-time diagram (b) Space-time diagram

Cycle

Instruction

Cycle

Pipeline stage

1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11

Start-up region

Drainage region

a

a

a

a

a

a

a

w

w

w

w

w

w

w

f

f

f

f

f

f

f

r

r

r

r

r

r

r

d

d

d

d

d

d

d

a a a a a a a

w w w w w w w

d d d d d d d

r r r r r r r

f f f f f f f

f = Fetch r = Reg read a = ALU op d = Data access w = Writeback

Page 15: cs2071 new notes 3

15

CMageshKumar_AP_AIHT CS2071_Computer Architecture

Read-after-write data dependency and its possible resolution through data forwarding .

Read-after-load data dependency and its possible resolution through bubble insertion and data forwarding.

Cycle 7 Cycle 6 Cycle 5 Cycle 4 Cycle 3 Cycle 2 Cycle 1 Cycle 8

Reg file

Reg file ALU

Reg file

Reg file ALU

Reg file

Reg file ALU

Reg file

Reg file ALU

$5 = $6 + $7

$8 = $8 + $6

$9 = $8 + $2

sw $9, 0($3)

Data

forwarding

Instr cache

Instr cache

Instr cache

Instr cache

Data cache

Data cache

Data cache

Data cache

Cycle 7 Cycle 6 Cycle 5 Cycle 4 Cycle 3 Cycle 2 Cycle 1 Cycle 8

Reg file

Reg file ALU

Reg file

Reg file ALU

Reg file

Reg file ALU

Reg file

Reg file ALU

Reg file

Reg file ALU

Cycle 9

Instr cache

Instr cache

Instr cache

Instr cache

Instr cache

Data cache

Data cache

Data cache

Data cache

Data cache

Time dimension

Task dimension

In

str

1

In

str

2

In

str

3

In

str

4

In

str

5

Bubble

Bubble

Bubble

Writes into $8

Reads from $8

Two bubbles, if we assume

that a register can be

updated and read from in

one cycle

Without data forwarding,

three bubbles are needed

to resolve a read-after-

write data dependency

Cycle 7 Cycle 6 Cycle 5 Cycle 4 Cycle 3 Cycle 2 Cycle 1 Cycle 8

Data mem

Instr mem

Reg file

Reg file ALU

Data mem

Instr mem

Reg file

Reg file ALU

Data mem

Instr mem

Reg file

Reg file ALU

sw $6, . . .

lw $8, . . .

Insert bubble?

$9 = $8 + $2

Data mem

Instr mem

Reg file

Reg file ALU

Reorder?

Without data

forwarding, three

(two) bubbles are

needed to resolve a

read-after-load data

dependency

Page 16: cs2071 new notes 3

16

CMageshKumar_AP_AIHT CS2071_Computer Architecture

Control dependency due to conditional branch.

13. PIPELINE TIMING AND PERFORMANCE (Refer page no. 284 in text book B.Parhami)

14. PIPELINED DATA PATH DESIGN (Refer page no. 285-286 for detailed description of each stage in

text book B.Parhami)

The pipelined datapath for MicroMIPS is obtained by inserting latches or registers in single-cycle data path.

The 5 pipeline stages are

1. Instruction Fetch

2. Instruction Decode and register access

3. ALU operation

4. Data memory access

5. Register writeback

15. PIPELINED CONTROL (Refer page no. 289 in text book B.Parhami)

16. OPTIMAL PIPELINING (Refer page no. 291 in text book B.Parhami)

Cycle 7 Cycle 6 Cycle 5 Cycle 4 Cycle 3 Cycle 2 Cycle 1 Cycle 8

Data mem

Instr mem

Reg file

Reg file ALU

Data mem

Instr mem

Reg file

Reg file ALU

Data mem

Instr mem

Reg file

Reg file ALU

$6 = $3 + $5

beq $1, $2, . . .

Insert bubble?

$9 = $8 + $2

Data mem

Instr mem

Reg file

Reg file ALU

Reorder?

(delayed

branch)

Assume branch resolved here

Here would need 1-2 more bubbles

ALU

Data cache

Instr cache

Next addr

Reg file

op fn

inst

imm

rs (rs)

(rt)

Data addr

ALUSrc ALUFunc DataWrite

DataRead

RegInSrc

rt

rd

RegDst

RegWrite

Func

ALUOvfl

Ovfl

IncrPC

Br&Jump

PC

1 Incr

0

1

rt

31

0 1 2

NextPC

0

1

SeqInst

0 1 2

0 1

RetAddr

Stage 1 Stage 2 Stage 3 Stage 4 Stage 5

SE

Page 17: cs2071 new notes 3

17

CMageshKumar_AP_AIHT CS2071_Computer Architecture

V. PIPELINE PERFORMANCE

17. DATA DEPENDENCIES AND HAZARDS

Data dependency in pipeline : “Execution of one instruction depending on completion of a previous

instruction” or “the phenomenon of one instruction requiring data generated by previous instruction is called

data dependency”

The generated data may reside in a register or memory location where the subsequent instruction expects to

find the value.

In the below diagram, each instruction from 2nd

through 5th instruction reads a register written into by the 1

st

instruction.

o The 5th instruction needs the content of $2 register after completion of register writeback by 5

th

instruction.

o The 4th instruction needs the new content of register $2 in the same cycle when the 1

st instruction

produces it which results in a little problem.

o But the 2nd

& 3rd

instruction needs the content of 1st instruction before the 1

st instruction execution.

This results in a major problem of data dependency.

Data dependency in pipeline can cause pipeline stalls which diminish the performance.

Types of data dependency:

o Read-after-compute: register access after updating it with a computed value. This dependency exists

when 1 instruction updates a register with a computed value and a subsequent instruction uses the

content of that register as an operand.

o Read-after-load: register access after updating it with data from memory. This dependency arises

when one instruction loads a new value from memory into a register and a subsequent instruction

uses the content of that register as an operand.

Cycle 7 Cycle 6 Cycle 5 Cycle 4 Cycle 3 Cycle 2 Cycle 1 Cycle 8

Reg f ile

Reg file ALU

Reg f ile

Reg file ALU

Reg f ile

Reg file ALU

Reg f ile

Reg file ALU

Reg f ile

Reg file ALU

Cycle 9

$2 = $1 - $3

Instructions that read register $2

Instr cache

Instr cache

Instr cache

Instr cache

Instr cache

Data cache

Data cache

Data cache

Data cache

Data cache

Page 18: cs2071 new notes 3

18

CMageshKumar_AP_AIHT CS2071_Computer Architecture

SINCE THE BELOW TOPICS ARE CLEAR AND READABLE IN THE BOOK PLEASE REFER PAGE

NO. 298-308 IN TEXT BOOK B.PARHAMI)

18. DATA FORWARDING:

Resolving Data Dependencies via Forwarding: When a previous instruction writes back a value

computed by the ALU into a register, the data dependency can always be resolved through forwarding

Certain Data Dependencies Lead to Bubbles: When the immediately preceding instruction writes a value

read out from the data memory into a register, the data dependency cannot be resolved through forwarding

(i.e., we cannot go back in time) and a bubble must be inserted in the pipeline.

Cycle 7 Cycle 6 Cycle 5 Cycle 4 Cycle 3 Cycle 2 Cycle 1 Cycle 8

Reg file

Reg file ALU

Reg file

Reg file ALU

Reg file

Reg file ALU

Reg file

Reg file ALU

Cycle 9

$2 = $1 - $3

Instructions that read register $2

Instr cache

Instr cache

Instr cache

Instr cache

Data cache

Data cache

Data cache

Data cache

Cycle 7 Cycle 6 Cycle 5 Cycle 4 Cycle 3 Cycle 2 Cycle 1 Cycle 8

Reg file

Reg file ALU

Reg file

Reg file ALU

Reg file

Reg file ALU

Reg file

Reg file ALU

Cycle 9

lw $2,4($12)

Instructions that read register $2

Instr cache

Instr cache

Instr cache

Instr cache

Data cache

Data cache

Data cache

Data cache

Page 19: cs2071 new notes 3

19

CMageshKumar_AP_AIHT CS2071_Computer Architecture

19. PIPELINE BRANCH HAZARDS

Software-based solutions

Compiler inserts a “no-op” after every branch (simple, but wasteful)

Branch is redefined to take effect after the instruction that follows it

Branch delay slot(s) are filled with useful instructions via reordering

Hardware-based solutions

Mechanism similar to data hazard detector to flush the pipeline

Constitutes a rudimentary form of branch prediction:

o Always predict that the branch is not taken, flush if mistaken

o More elaborate branch prediction strategies possible

20. DELAYED BRANCH AND BRANCH PREDICTION

Predicting whether a branch will be taken

Always predict that the branch will not be taken

Use program context to decide (backward branch is likely taken, forward branch is likely not taken)

Allow programmer or compiler to supply clues

Decide based on past history (maintain a small history table); to be discussed later

Apply a combination of factors: modern processors use elaborate techniques due to deep pipelines

Problem with this approach:

Each branch in a loop entails two

mispredictions:

1. Once in first iteration (loop is repeated,

but the history indicates exit from loop)

2. Once in last iteration (when loop is

terminated, but history indicates repetition)

Page 20: cs2071 new notes 3

20

CMageshKumar_AP_AIHT CS2071_Computer Architecture

Other branch prediction algorithms:

Hardware Implementation of Branch Prediction

The mapping scheme used to go from PC contents to a table entry is the same as that used in direct-mapped caches

Not taken

Predict taken

Predict taken again

Predict not taken

Predict not taken

again

Not taken

Taken

Not taken

Taken

Taken Not taken

Taken

Not taken

Predict taken

Predict taken again

Predict not taken

Predict not taken

again

Not taken

Taken

Not taken Taken

Taken Not taken

Taken

Not taken

Predict taken

Predict taken again

Predict not taken

Predict not taken

again

Not taken Taken

Not taken Taken

Taken Not taken

Taken

Compare

Addresses of recent branch instructions

Target addresses

History bit(s) Low-order

bits used as index

Logic From PC

Incremented PC

Next PC

0

1

=

Read-out table entry

Page 21: cs2071 new notes 3

21

CMageshKumar_AP_AIHT CS2071_Computer Architecture

21. ADVANCED PIPELINING (Refer page no. 306-308 in text book B.Parhami)

The Three Hardware Designs for MicroMIPS

/

ALU

Data cache

Instr cache

Next addr

Reg file

op

jta

fn

inst

imm

rs (rs)

(rt)

Data addr

Data in 0

1

ALUSrc

ALUFunc DataWrite

DataRead

SE

RegInSrc

rt

rd

RegDst

RegWrite

32 / 16

Register input

Data out

Func

ALUOvfl

Ovfl

31

0 1 2

Next PC

Incr PC

(PC)

Br&Jump

ALU out

PC

0 1 2

Single-cycle

/

16

rs

0 1

0 1 2

ALU

Cache Reg file

op

jta

fn

(rs)

(rt)

Address

Data

Inst Reg

Data Reg

x Reg

y Reg

z Reg PC

4

ALUSrcX

ALUFunc

MemWrite MemRead

RegInSrc

4

rd

RegDst

RegWrite

/

32

Func

ALUOvfl

Ovfl

31

PCSrc PCWrite

IRWrite

ALU out

0 1

0 1

0 1 2 3

0 1 2 3

InstData

ALUSrcY

SysCallAddr

/

26

4

rt

ALUZero

Zero

x Mux

y Mux

0 1

JumpAddr

4 MSBs

/

30

30

SE

imm

Multicycle

125 MHz

CPI = 1

500 MHz

CPI 4

500 MHz

CPI 1.1 ALU

Data cache

Instr cache

Next addr

Reg file

op fn

inst

imm

rs (rs)

(rt)

Data addr

ALUSrc

ALUFunc

DataWrite

DataRead

RegInSrc

rt

rd

RegDst

RegWrite

Func

ALUOvfl

Ovfl

IncrPC

Br&Jump

PC

1 Incr

0

1

rt

31

0 1 2

NextPC

0

1

SeqInst

0 1 2

0 1

RetAddr

Stage 1 Stage 2 Stage 3 Stage 4 Stage 5

SE

5 3

2

Address

Data