Download - cs2071 new notes 3
![Page 1: cs2071 new notes 3](https://reader033.vdocuments.us/reader033/viewer/2022051402/5695d1051a28ab9b0294d30d/html5/thumbnails/1.jpg)
1
CMageshKumar_AP_AIHT CS2071_Computer Architecture
ANAND INSTITUTE OF HIGHER TECHNOLOGY Chennai-603 103 DEPARTMENT OF ELECTRONICS AND INSTRUMENTATION ENGINEERING
CS2071 COMPUTER ARCHITECTURE
Faculty Name: C.MAGESHKUMAR Class: IV EIE A&B Semester: VII
UNIT III –DATA PATH AND CONTROL
CONTENT Page no.
I. Instruction Execution Steps 2
1. A Small Set of Instructions 2
2. The Instruction Execution Unit 3
3. A Single-Cycle Data Path 4
4. Branching and Jumping 6
5. Deriving the Control Signals 6
6. Performance of the Single-Cycle Design 6
II. Control Unit Synthesis 6
7. A Multicycle Implementation 6
8. Choosing the Clock Cycle 8
9. The Control State Machine 9
10. Performance of the Multicycle Design 10
III. Microprogramming 11
IV. Pipelining 13
11. Pipelining Concepts 13
12. Pipeline Stalls or Bubbles 14
13. Pipeline Timing and Performance 16
14. Pipelined Data Path Design 16
15. Pipelined Control 16
16. Optimal Pipelining 16
V. Pipeline Performance 17
17. Data Dependencies and Hazards 17
18. Data Forwarding 18
19. Pipeline Branch Hazards 19
20. Delayed Branch and Branch Prediction 19
21. Advanced Pipelining 21
![Page 2: cs2071 new notes 3](https://reader033.vdocuments.us/reader033/viewer/2022051402/5695d1051a28ab9b0294d30d/html5/thumbnails/2.jpg)
2
CMageshKumar_AP_AIHT CS2071_Computer Architecture
I. INSTRUCTION EXECUTION STEPS
1. A SMALL SET OF INSTRUCTIONS
MiniMIPS instruction set – 40 instructions
MicroMIPS instruction set – 22 instructions
The instructions in below table can be divided into 5 categories
1. Seven (7) R-format ALU instructions (add, sub, slt, and, or, xor, nor)
2. Six (6) I-format ALU instructions (lui, addi, slti, andi, ori, xori)
3. Two (2) I-format memory access instructions (lw, sw)
4. Three (3) I-format conditional branch instructions (bltz, beq, bne)
5. Four (4) unconditional jump instructions (j, jr, jal, syscall)
Fig.1. MICROMIPS INSTRUCTION FORMATS
Execution sequence of MicroMIPS instructions:
Seven R-format ALU instructions (add, sub, slt, and, or, xor, nor) have following common execution sequence:
1. Read out the contents of source registers „rs‟ & „rt‟ and forward them to ALU as inputs
2. Inform the ALU to perform the desired operation by means of appropriate control signal
3. Write the output of ALU in destination register „rd‟
5 out of 6 I-format ALU instructions (addi, slti, andi, ori, xori) have following common execution sequence:
1. Read out the contents of source registers „rs‟ & „immediate value‟ and forward them to ALU as inputs
2. Inform the ALU to perform the desired operation by means of appropriate control signal
3. Write the output of ALU in destination register „rt‟
5 bits 5 bits
31 25 20 15 0
Opcode Source 1 or base
Source 2 or dest’n
op rs rt
R 6 bits 5 bits
rd
5 bits
sh
6 bits
10 5 fn
jta Jump target address, 26 bits
imm Operand / Offset, 16 bits
Destination Unused Opcode ext I
J
inst Instruction, 32 bits
![Page 3: cs2071 new notes 3](https://reader033.vdocuments.us/reader033/viewer/2022051402/5695d1051a28ab9b0294d30d/html5/thumbnails/3.jpg)
3
CMageshKumar_AP_AIHT CS2071_Computer Architecture
The 1 out of 6 I-format ALU instructions (lui) have following common execution sequence:
1. Read out the contents of source register „immediate value‟ and forward them to ALU as input
2. Inform the ALU to perform the desired operation by means of appropriate control signal
3. Write the output of ALU in destination register „rt‟
The Two (2) I-format memory access instructions (lw, sw) have following common execution sequence:
1. Read out the content of „rs‟
2. Add the number of read out from „rs‟ to „immediate value‟ in instruction to form a memory address
3. Read from / write into memory at specified address.
4. In case of „lw‟ instruction, place the word read out from memory into „rt‟
The Three (3) I-format conditional branch instructions (bltz, beq, bne) and Four (4) unconditional jump
instructions (j, jr, jal, syscall) have following common execution sequence:
1. Read out the contents of source registers „rs‟ & „immediate value‟ and forward them to ALU as inputs
2. Inform the ALU to perform the desired operation by means of appropriate control signal
3. The branch target address is specified by an offset relative to increamented program counter value ((PC)+4)
4. To branch back tp previous instruction, the offset value supplied in the immediate field of instruction will be -2,
which in branch target address [ (PC)+4-(2*4) = (PC)-4]
5. For „beq‟, „bne‟ instructions, contents of „rs‟ and „rt‟ are compared to determine wheather branch condition is
satisfied.
6. For „bltz‟, the branch decision is based on the sign bit of content of „rs‟.
7. For 4 jump instructions (j, jr, jal, syscall):
PC is unconditionally modified to allow the next instruction to be fetched from jump target address.
The jump target address comes from instruction itself (j, jal) is read out from register „rs‟ or is a known
constant associated with the location of an operating system routine call (syscall)
2. THE INSTRUCTION EXECUTION UNIT
Step by step execution of all 22 MicroMIPS instructions can be depicted from below block diagram:
1. Beginning at the left end, the content of program counter (PC) is supplied to instruction cache and an
instruction word is read out from specified location.
2. With every clock cycle ticking, a new address is loaded into program counter causing a new instruction to
appear at output of instruction cache after a short access delay
3. Contents of various fields of instruction are sent to relevant blocks including control unit (decides the
operation to be performed)
4. Once an instruction has been read out from instruction cache, its various fields are separated and dispatched
to approx. place.
Example: „op‟ and „fn‟ fields goto control unit, „rs‟, „rt‟, „rd‟ will goto register file
5. The upper input of ALU always comes from register „rs‟ and lower input of ALU is from „rt‟ or „immediate
value‟ of instruction.
6. As the data from register file pass through ALU, the specified operation is performed and the output
appears at ALU output.
7. In case of arithmetic and logic instructions the output of ALU is stored in destination register and thus it
bye-pass data cache, run through feedback line is stored in „rd‟ of register file.
8. In case of memory access instructions, the ALU output data is treated as data address for writing into / read
from data cache
9. Data cache: For many instructions, the output of ALU is stored in a register thus, data cache is byepassed.
For „lw‟ and „sw‟ instructions, the data cache is accessed with the content of „rt‟ written into „rt‟ for „sw‟
instruction and its output sent to register file for „lw‟ instruction
10. In one clock cycle, the content of any 2 registers out of 32 registers (mostly „rs‟ & „rt‟) is read out from read
ports, At the same time, the output from ALU is stored in the register via write port.
11. The flip-flops representing registers are edge-triggered. So, reading / writing into same register in a single
clock cycle does not cause any problem.
![Page 4: cs2071 new notes 3](https://reader033.vdocuments.us/reader033/viewer/2022051402/5695d1051a28ab9b0294d30d/html5/thumbnails/4.jpg)
4
CMageshKumar_AP_AIHT CS2071_Computer Architecture
12. For „beq‟ and „bne‟ instructions, contents of „rs‟ and „rt‟ are compared to determine whether the branch
condition is satisfied. The comparison is performed in „next address‟ block.
13. In case of „bltz‟, the branch decision is based on sign bit of content of „rs‟ rather than comparison of two
register contents. This is performed by „next address‟ block.
14. „Next address‟ blocks also choose the jump target address under the guidance of control unit.
15. The jump target address comes from „j‟, „jal‟ instructions is read out from register „rs‟ (jr instruction).
16. The middle part composing program counter, instruction cache, register file, ALU, data cache is known as
data path.
Fig.2. Abstract view of the instruction execution unit for MicroMIPS.
3. A SINGLE-CYCLE DATA PATH
1. The middle part composing program counter, instruction cache, register file, ALU, data cache is known as
data path.
2. The datapath shown above is capable of executing one instruction per clock cycle. Hence the name „single
cycle datapath‟
3. Singlecycle design : clock rate- 125 MHz and CPI- 1
4. There are 3 multiplexers used in datapath,
1. At input side of register file
2. At lower input of ALU
3. At output of ALU and data cache.
5. Multiplexer 1 (At input side of register file) :
i. This multiplexer allows „rt‟, „rd‟ or „$31‟ to be used as the index of destination register into which
results will be written.
ii. The logic signal „RegDst‟ is supplied by control unit directs the selection of „rt‟ or „rd‟ or „$31‟.
iii. „RegDst’ control signals and corresponding selections
S.no Control signal Selection
1 00 rt
2 01 rd
3 10 $31
iv. „RegWrite‟ is declared (asserted) by control unit to write into register file.
6. Registers „rs‟ and „rt‟ are read out for every instruction even it is not needed, so there is no read control signal.
ALU
Data cache
Instr cache
Next addr
Control
Reg file
op
jta
fn
inst
imm
rs,rt,rd (rs)
(rt)
Address
Data
PC
bltz,jr
beq,bne
12 A/L, lui, lw,sw
j,jal
syscall
22 instructions
Harvard
architecture
Harvard architecture
![Page 5: cs2071 new notes 3](https://reader033.vdocuments.us/reader033/viewer/2022051402/5695d1051a28ab9b0294d30d/html5/thumbnails/5.jpg)
5
CMageshKumar_AP_AIHT CS2071_Computer Architecture
7. Instruction cache block also won‟t receive any control signal to read the instructions since instructions are
read out in every cycle.
8. Multiplexer 2 (At lower input of ALU):
i. The multiplexer at the lower input of ALU allows the control unit by asserting / deasserting „ALUSrc‟
control signal to choose the content of „rt‟ or ‟32-bit sign-extended version of 16-bit immediate
operand‟ to be used as second ALU input.
1. If „ALUSrc‟ signal = 0 (deasserting), then content of „rt‟ is used as ALU lower input
2. If „ALUSrc‟ signal = 1 (asserting), then content of „‟32-bit sign-extended version of
16-bit immediate operand‟ is used as ALU lower input.
ii. Sign extension of immediate operand is performed by „SE‟ block.
9. Multiplexer 3 (At output of ALU and data cache): The control signal used here is „RegInSrc‟
S.no Control signal Selection
1 00 Data cache output
2 01 ALU output
3 10 Incremented PC value coming from next-address block
10. With every clock cycle ticking, a new address is loaded into program counter causing a new instruction to
appear at output of instruction cache after a short access delay.
11. Contents of various fields of instruction are sent to relevant blocks including control unit (decides the
operation to be performed)
12. As the data from register file pass through ALU, the specified operation is performed by ALUFunc signal and
the output appears at ALU output.
13. In case of arithmetic and logic instructions the output of ALU is stored in destination register and thus it bye-
pass data cache, run through feedback line is stored in „rd‟ of register file.
14. In case of memory access instructions, the ALU output data is treated as data address for writing into
(DataWrite signal ) / read from (DataRead signal) data cache
/
ALU
Data cache
Instr cache
Next addr
Reg file
op
jta
fn
inst
imm
rs (rs)
(rt)
Data addr
Data in 0
1
ALUSrc
ALUFunc DataWrite
DataRead
SE
RegInSrc
rt
rd
RegDst
RegWrite
32 / 16
Register input
Data out
Func
ALUOvfl
Ovfl
31
0 1 2
Next PC
Incr PC
(PC)
Br&Jump
ALU out
PC
0 1 2
![Page 6: cs2071 new notes 3](https://reader033.vdocuments.us/reader033/viewer/2022051402/5695d1051a28ab9b0294d30d/html5/thumbnails/6.jpg)
6
CMageshKumar_AP_AIHT CS2071_Computer Architecture
4. BRANCHING AND JUMPING:
(Refer page no. 249,250 in text book B.Parhami)
5. DERIVING THE CONTROL SIGNALS:
(Refer page no. 250-253 in text book B.Parhami)
Control signals for the single-cycle MicroMIPS implementation.
6. PERFORMANCE OF THE SINGLE-CYCLE DESIGN
(Refer page no. 253-255 in text book B.Parhami)
II. CONTROL UNIT SYNTHESIS
7. MULTICYCLE IMPLEMENTATION:
Fig.3. Single-cycle versus multicycle instruction execution.
With multicycle design, a subset of actions required for an instruction is performed in one clock cycle.
Hence the clock cycle can be made much shorter, with several cycles needed to execute a single instruction.
Advantages of multicycle implementation are greater speed and economy
Clock
Clock
Instr 2 Instr 1 Instr 3 Instr 4
3 cycles 3 cycles 4 cycles 5 cycles
Time saved
Instr 1 Instr 4 Instr 3 Instr 2
Time needed
Time needed
Time allotted
Time allotted
![Page 7: cs2071 new notes 3](https://reader033.vdocuments.us/reader033/viewer/2022051402/5695d1051a28ab9b0294d30d/html5/thumbnails/7.jpg)
7
CMageshKumar_AP_AIHT CS2071_Computer Architecture
MULTICYCLE DATA PATH:
Fig.4. Abstract view of a multicycle instruction execution unit for MicroMIPS.
1. The datapath in above block diagram is capable of executing one instruction in every 3-5 clock cuycles.
Hence named as “multi-cycle data path”
2. Multicycle design : clock rate- 500 MHz and CPI- approx. 4
3. Cache block = instruction cache + data cache.
4. All instructions will be executed in 5 cycles – refer control state machine
5. When a word is read from cache block, it must be held in a register for use in subsequent cycles.
6. The reason for having 2 registers “Instruction register” and “Data register” between cache and register file
is that once the instruction is read out, it must be kept for all the remaining cycles in its execution to
generate the control signals appropriately.
7. So a second register is needed for data readout associated with „lw‟
8. Three other registers namely, „x‟, „y‟, and „z‟ also serve the same purpose of holding information between
cycles.
9. It is notable that except program counter and Instruction register all other registers are loaded in every clock
cycle.
10. Instruction fetch cycle: Execution of all instruction starts the same way in first cycle. The content of PC is
used to access cache and the retrieved word is placed in instruction register. This is known as instruction
fetch cycle.
11. In second clock cycle, the instructions are decoded and the registers „rs‟ and „rt‟ are accessed.
12. If the instruction executed is one of four jump instructions (j, jr, jal, syscall), its execution terminates in 3rd
cycle by simply writing the appropriate address into PC.
13. If it is a branch instruction (beq, bne, bltz), then the branch condition is checked and the appropriate value is
written into PC in 3rd
cycle.
14. All other instructions proceed to and completed in 4th cycle.
15. „lw‟ instruction requires 5th cycle to write the data retrieved from cache into a register.
FOR DETAILED CONTROL SIGNAL AND MUX EXPLANATION REFER PAGE NO. 260, 261 IN
P.BRAHAMI BOOK
ALU
Cache
Control
Reg file
op
jta
fn
imm
rs,rt,rd (rs)
(rt)
Address
Data
Inst Reg
Data Reg
x Reg
y Reg
z Reg PC
![Page 8: cs2071 new notes 3](https://reader033.vdocuments.us/reader033/viewer/2022051402/5695d1051a28ab9b0294d30d/html5/thumbnails/8.jpg)
8
CMageshKumar_AP_AIHT CS2071_Computer Architecture
8. CHOOSING THE CLOCK CYCLE
(Refer page no. 262 in text book B.Parhami)
![Page 9: cs2071 new notes 3](https://reader033.vdocuments.us/reader033/viewer/2022051402/5695d1051a28ab9b0294d30d/html5/thumbnails/9.jpg)
9
CMageshKumar_AP_AIHT CS2071_Computer Architecture
9. THE CONTROL STATE MACHINE
CONTROL STATE MACHINE for MULTICYCLE MicroMIPS
The control unit must distinguish between 5 cycles of mutlicycle design and additionally be able to perform
different operations depending on the instruction.
The above diagram depicts the control states and state transitions
The control state machine carries the required information along by moving from state to state. The control
state machine is set to state 0 when program execution begins
Then it moves from state to state until one instruction has been completed, at which it returns to state 0 to
begin the execution of another instruction.
The control state sequences for various MicroMIPS instruction classes are as follows:
ALU – type 0,1,7,8
Load word 0,1,2,3,4
Store word 0,1,2,6
Jump / branch 0,1,5
In each state except state 5 & 7, the control signals are uniquely determined.
Information regarding the current control state and instruction executed is supplied by decoders.
Control signals can be easily determined by using control state machine diagram and decoder diagram
Example of control signals that are uniquely determined by control state information include:
Certain control signals depend only on the control state
ALUSrcX = ControlSt2 ControlSt5 ControlSt7
RegWrite = ControlSt4 ControlSt8
Auxiliary signals identifying instruction classes
addsubInst = addInst subInst addiInst
logicInst = andInst orInst xorInst norInst andiInst oriInst xoriInst
Logic expressions for ALU control signals
AddSub = ControlSt5 (ControlSt7 subInst)
FnClass1 = ControlSt7 addsubInst logicInst
FnClass0 = ControlSt7 (logicInst sltInst sltiInst)
LogicFn1 = ControlSt7 (xorInst xoriInst norInst)
LogicFn0 = ControlSt7 (orInst oriInst norInst)
![Page 10: cs2071 new notes 3](https://reader033.vdocuments.us/reader033/viewer/2022051402/5695d1051a28ab9b0294d30d/html5/thumbnails/10.jpg)
10
CMageshKumar_AP_AIHT CS2071_Computer Architecture
Decoders
10. PERFORMANCE OF THE MULTICYCLE DESIGN
(Refer page no. 266 in text book B.Parhami)
jrInst
norInst
sltInst
orInst
xorInst
syscallInst
andInst
addInst
subInst
RtypeInst
bltzInst
jInst
jalInst
beqInst
bneInst
sltiInst
andiInst
oriInst
xoriInst
luiInst
lwInst
swInst
andiInst
1
0
1
2
3
4
5
10
12
13
14
15
35
43
63
8
op
De
cod
er
fn D
ecod
er
/ 6 / 6
op fn
0
8
12
32
34
36
37
38
39
42
63
ControlSt0 ControlSt1 ControlSt2 ControlSt3 ControlSt4 ControlSt5
ControlSt8
ControlSt6 1
st D
ecod
er
/ 4
st
0 1 2 3 4 5
7
12 13 14 15
8 9 10
6
11
ControlSt7
![Page 11: cs2071 new notes 3](https://reader033.vdocuments.us/reader033/viewer/2022051402/5695d1051a28ab9b0294d30d/html5/thumbnails/11.jpg)
11
CMageshKumar_AP_AIHT CS2071_Computer Architecture
III. MICROPROGRAMMING
The control state machine resembles a program that has instructions /state, branching, and loops. Such
a hardware program is called as microprogram and its basic steps are microinstructions.
A single instruction in microcode. It is the most elementary instruction in the computer, such as
moving the contents of a register to the arithmetic logic unit (ALU).
It takes several microinstructions to carry out one complex machine instruction (CISC).
Also called a "micro-op" or "µop," microinstructions differ within the same computer family and even
the same vendor.
Microprogrammed control is a control mechanism to generate control signals by using a memory
called control storage (CS), which contains the control signals.
Although microprogrammed control seems to be advantageous to CISC machines, since CISC
requires systematic development of sophisticated control signals, there is no intrinsic difference
between these 2 control mechanism.
Microprogramming is a method of control unit design in which the control unit selection and
sequencing information are stored in ROM and RAM‟s called control store or control memory.
Micro programmed control unit is a general approach used for implementation of control unit. Here
control signals are generated by a program similar to machine language programs
Instead of implementing the control state machine in custom hardware, we can store microinstructions
in locations of control ROM, fetching and executing sequence of microinstructions for each machine
language instruction.
Each microinstruction defines a step in execution of a machine language instruction.
Advantages of ROM-based implementation of control
o Simple hardware
o More regular
o Less dependent on instruction-set architecture details
o Same hardware can be used for different purpose by modifying ROM contents
Microprogramming : Designing a suitable sequence of microinstructions to realize a particular
instruction set architecture is called microprogramming.
Micro programmable machine: if the microprogram is easily modifiable, even by user then the
machine is called Micro programmable machine.
Micro instruction format:
o 23 bit microinstruction format. Each bit has one to one correspondence except sequence
control bits in multicycle datapath.
o The 2-bit sequence control field allows for the control of microinstruction sequencing in same
way that “PC control” affects the sequencing of machine language instructions.
Microprogrammed control unit: Microprogrammed control unit for MicroMIPS diagram shows 4
options (MUX) for choosing next microinstruction.
o Option 0: to advance the next microinstruction in sequence by incrementing
microprogram counter
o Option 1 & 2: allows branching to occur depending on opcode field in machine
instruction being excuted.
o Option 3: is to goto microinstruction 0 corresponding to state 0 (refer control
state machine). This initiates the fetch phase for next machine instruction
Dispatch table 1 : corresponds to multiway branch in going from cycle 2 to cycle 3
Dispatch table 2 : implements the branch between cycles 3 & 4. (refer control state machine)
![Page 12: cs2071 new notes 3](https://reader033.vdocuments.us/reader033/viewer/2022051402/5695d1051a28ab9b0294d30d/html5/thumbnails/12.jpg)
12
CMageshKumar_AP_AIHT CS2071_Computer Architecture
23-BIT MICROINSTRUCTION FORMAT FOR MICROMIPS.
Microprogrammed control unit for MicroMIPS
(For detailed explanation with microprogram example please Refer page no. 269 - 271 in text
book B.Parhami)
PC control
Cache control
Register control
ALU inputs
JumpAddr
PCSrc
PCWrite
InstData
MemRead
MemWrite
IRWrite
FnType
LogicFn
AddSub
ALUSrcY
ALUSrcX
RegInSrc
RegDst
RegWrite
Sequence control
ALU function
Microprogram memory or PLA
op (from instruction register) Control signals to data path
Address 1
Incr
MicroPC
Data
0
Sequence control
0
1
2
3
Dispatch table 1
Dispatch table 2
Microinstruction register
![Page 13: cs2071 new notes 3](https://reader033.vdocuments.us/reader033/viewer/2022051402/5695d1051a28ab9b0294d30d/html5/thumbnails/13.jpg)
13
CMageshKumar_AP_AIHT CS2071_Computer Architecture
IV. PIPELINING
11. PIPELINING CONCEPTS
2 strategies for achieving greater performance:
Strategy 1: multiple-instruction-issue or superscalar organization: use multiple independent data paths that can
accept several instructions that are read out at once.
Strategy 2: Pipelined or super-pipelined organization: overlap the execution of several instructions in single-
cycle design, starting next instruction before previous instruction has executed.
Pipelining:
Pipelining is an implementation technique where multiple instructions are overlapped in execution. The
computer pipeline is divided in stages.
Each stage completes a part of an instruction in parallel. The stages are connected one to the next to form
a pipe - instructions enter at one end, progress through the stages, and exit at the other end.
Pipelining does not decrease the time for individual instruction execution. Instead, it increases instruction
throughput.
The throughput of the instruction pipeline is determined by how often an instruction exits the pipeline.
5 instruction execution steps / stages in a pipelining of MicroMIPS:
Each step takes 1-2 ns.
1. Instruction Fetch
2. Instruction Decode and register access
3. ALU operation
4. Data memory access
5. Register writeback
Pipelined Instruction Execution (Pipelining in the MicroMIPS instruction execution process.)
In task-time diagram, stages of each task are horizontally aligned and their positions along the horizontal
axis represent the timing of their execution.
In space-time diagram, the vertical axis represents stages in the pipeline (the space dimension) and boxes
representing the various stages of a task are diagonally aligned.
Ideally a „q-stage‟ pipeline can increase instruction execution throughput by a factor of „q‟. But this fact is
not quite the case because of the following:
o Effects of pipeline start-up and drainage
o Wastage due to unequal stage delays.
o Time overhead of saving stage results in registers
o Safety margin in clock period necessitated by clock skew.
Cycle 7 Cycle 6 Cycle 5 Cycle 4 Cycle 3 Cycle 2 Cycle 1 Cycle 8
Reg file
Reg file ALU
Reg file
Reg file ALU
Reg file
Reg file ALU
Reg file
Reg file ALU
Reg file
Reg file ALU
Cycle 9
Instr cache
Instr cache
Instr cache
Instr cache
Instr cache
Data cache
Data cache
Data cache
Data cache
Data cache
Time dimension
Task dimension
In
str
1
In
str
2
In
str
3
In
str
4
In
str
5
![Page 14: cs2071 new notes 3](https://reader033.vdocuments.us/reader033/viewer/2022051402/5695d1051a28ab9b0294d30d/html5/thumbnails/14.jpg)
14
CMageshKumar_AP_AIHT CS2071_Computer Architecture
Fig. Two abstract graphical representations of a 5-stage pipeline executing 7 tasks (instructions).
12. PIPELINE STALLS OR BUBBLES
Data dependency in pipeline : “Execution of one instruction depending on completion of a previous
instruction”.
Data dependency in pipeline can cause pipeline stalls which diminish the performance.
Types of data dependency:
o Read-after-compute: register access after updating it with a computed value.
o Read-after-load: register access after updating it with data from memory
Example for Read-after-compute is shown in below diagram, where the 3rd
instruction uses the value that
the 2nd
instruction writes into register $8 & the 4th instruction needs the result of 3
rd instruction in register
$9. Note that write operation in register $8 is completed in cycle 6. Hence, reading the new value from
register $8 is possible beginning with cycle 7. The 3rd
instruction reads out register $8 & $2in cycle 4. The
data dependency problem can be solved by bubble insertion or by data forwarding.
BUBBLE INSERTION:
First detect the type of data dependency
Bubble insertion: The phenomenon of “inserting redundant and harmless instruction (adding 0 to a register /
shifting a register by 0 bit) before the next instruction. Such instruction is called as “no-op” (no-operation)
instruction. Since they didn‟t perform any useful task but use the memory they resembles the bubble in a
water pipe” is called bubble insertion.
Insertion of bubbles in a pipeline implies
o reduced throughput
o hurts the performance when more than 2 bubbles are inserted.
So bubble insertion should be minimized. It can be minimized by relocating an useful instruction in a
program between the data dependent instruction instead of inserting bubbles.
DATA FORWARDING:
“the phenomenon of bypassing the output of ALU of 1st instruction to the input of ALU that is needed as
input for execution of 2nd
instruction without storing the output value of 1st instruction in memory is called
data forwarding ”. please see below diagrams for clear understanding
Control dependency:
When a conditional branch is executed, the location of the next branch instruction depends on whether the branch
condition is satisfied. Since branch instructions are based on testing the register contents, branch condition will be
resolved at the end of 2nd
pipeline stage. Therefore a bubble is required after every conditional branch instruction.
1
2
3
4
5
1
2
3
4
5
6
7
(a) Task-time diagram (b) Space-time diagram
Cycle
Instruction
Cycle
Pipeline stage
1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11
Start-up region
Drainage region
a
a
a
a
a
a
a
w
w
w
w
w
w
w
f
f
f
f
f
f
f
r
r
r
r
r
r
r
d
d
d
d
d
d
d
a a a a a a a
w w w w w w w
d d d d d d d
r r r r r r r
f f f f f f f
f = Fetch r = Reg read a = ALU op d = Data access w = Writeback
![Page 15: cs2071 new notes 3](https://reader033.vdocuments.us/reader033/viewer/2022051402/5695d1051a28ab9b0294d30d/html5/thumbnails/15.jpg)
15
CMageshKumar_AP_AIHT CS2071_Computer Architecture
Read-after-write data dependency and its possible resolution through data forwarding .
Read-after-load data dependency and its possible resolution through bubble insertion and data forwarding.
Cycle 7 Cycle 6 Cycle 5 Cycle 4 Cycle 3 Cycle 2 Cycle 1 Cycle 8
Reg file
Reg file ALU
Reg file
Reg file ALU
Reg file
Reg file ALU
Reg file
Reg file ALU
$5 = $6 + $7
$8 = $8 + $6
$9 = $8 + $2
sw $9, 0($3)
Data
forwarding
Instr cache
Instr cache
Instr cache
Instr cache
Data cache
Data cache
Data cache
Data cache
Cycle 7 Cycle 6 Cycle 5 Cycle 4 Cycle 3 Cycle 2 Cycle 1 Cycle 8
Reg file
Reg file ALU
Reg file
Reg file ALU
Reg file
Reg file ALU
Reg file
Reg file ALU
Reg file
Reg file ALU
Cycle 9
Instr cache
Instr cache
Instr cache
Instr cache
Instr cache
Data cache
Data cache
Data cache
Data cache
Data cache
Time dimension
Task dimension
In
str
1
In
str
2
In
str
3
In
str
4
In
str
5
Bubble
Bubble
Bubble
Writes into $8
Reads from $8
Two bubbles, if we assume
that a register can be
updated and read from in
one cycle
Without data forwarding,
three bubbles are needed
to resolve a read-after-
write data dependency
Cycle 7 Cycle 6 Cycle 5 Cycle 4 Cycle 3 Cycle 2 Cycle 1 Cycle 8
Data mem
Instr mem
Reg file
Reg file ALU
Data mem
Instr mem
Reg file
Reg file ALU
Data mem
Instr mem
Reg file
Reg file ALU
sw $6, . . .
lw $8, . . .
Insert bubble?
$9 = $8 + $2
Data mem
Instr mem
Reg file
Reg file ALU
Reorder?
Without data
forwarding, three
(two) bubbles are
needed to resolve a
read-after-load data
dependency
![Page 16: cs2071 new notes 3](https://reader033.vdocuments.us/reader033/viewer/2022051402/5695d1051a28ab9b0294d30d/html5/thumbnails/16.jpg)
16
CMageshKumar_AP_AIHT CS2071_Computer Architecture
Control dependency due to conditional branch.
13. PIPELINE TIMING AND PERFORMANCE (Refer page no. 284 in text book B.Parhami)
14. PIPELINED DATA PATH DESIGN (Refer page no. 285-286 for detailed description of each stage in
text book B.Parhami)
The pipelined datapath for MicroMIPS is obtained by inserting latches or registers in single-cycle data path.
The 5 pipeline stages are
1. Instruction Fetch
2. Instruction Decode and register access
3. ALU operation
4. Data memory access
5. Register writeback
15. PIPELINED CONTROL (Refer page no. 289 in text book B.Parhami)
16. OPTIMAL PIPELINING (Refer page no. 291 in text book B.Parhami)
Cycle 7 Cycle 6 Cycle 5 Cycle 4 Cycle 3 Cycle 2 Cycle 1 Cycle 8
Data mem
Instr mem
Reg file
Reg file ALU
Data mem
Instr mem
Reg file
Reg file ALU
Data mem
Instr mem
Reg file
Reg file ALU
$6 = $3 + $5
beq $1, $2, . . .
Insert bubble?
$9 = $8 + $2
Data mem
Instr mem
Reg file
Reg file ALU
Reorder?
(delayed
branch)
Assume branch resolved here
Here would need 1-2 more bubbles
ALU
Data cache
Instr cache
Next addr
Reg file
op fn
inst
imm
rs (rs)
(rt)
Data addr
ALUSrc ALUFunc DataWrite
DataRead
RegInSrc
rt
rd
RegDst
RegWrite
Func
ALUOvfl
Ovfl
IncrPC
Br&Jump
PC
1 Incr
0
1
rt
31
0 1 2
NextPC
0
1
SeqInst
0 1 2
0 1
RetAddr
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5
SE
![Page 17: cs2071 new notes 3](https://reader033.vdocuments.us/reader033/viewer/2022051402/5695d1051a28ab9b0294d30d/html5/thumbnails/17.jpg)
17
CMageshKumar_AP_AIHT CS2071_Computer Architecture
V. PIPELINE PERFORMANCE
17. DATA DEPENDENCIES AND HAZARDS
Data dependency in pipeline : “Execution of one instruction depending on completion of a previous
instruction” or “the phenomenon of one instruction requiring data generated by previous instruction is called
data dependency”
The generated data may reside in a register or memory location where the subsequent instruction expects to
find the value.
In the below diagram, each instruction from 2nd
through 5th instruction reads a register written into by the 1
st
instruction.
o The 5th instruction needs the content of $2 register after completion of register writeback by 5
th
instruction.
o The 4th instruction needs the new content of register $2 in the same cycle when the 1
st instruction
produces it which results in a little problem.
o But the 2nd
& 3rd
instruction needs the content of 1st instruction before the 1
st instruction execution.
This results in a major problem of data dependency.
Data dependency in pipeline can cause pipeline stalls which diminish the performance.
Types of data dependency:
o Read-after-compute: register access after updating it with a computed value. This dependency exists
when 1 instruction updates a register with a computed value and a subsequent instruction uses the
content of that register as an operand.
o Read-after-load: register access after updating it with data from memory. This dependency arises
when one instruction loads a new value from memory into a register and a subsequent instruction
uses the content of that register as an operand.
Cycle 7 Cycle 6 Cycle 5 Cycle 4 Cycle 3 Cycle 2 Cycle 1 Cycle 8
Reg f ile
Reg file ALU
Reg f ile
Reg file ALU
Reg f ile
Reg file ALU
Reg f ile
Reg file ALU
Reg f ile
Reg file ALU
Cycle 9
$2 = $1 - $3
Instructions that read register $2
Instr cache
Instr cache
Instr cache
Instr cache
Instr cache
Data cache
Data cache
Data cache
Data cache
Data cache
![Page 18: cs2071 new notes 3](https://reader033.vdocuments.us/reader033/viewer/2022051402/5695d1051a28ab9b0294d30d/html5/thumbnails/18.jpg)
18
CMageshKumar_AP_AIHT CS2071_Computer Architecture
SINCE THE BELOW TOPICS ARE CLEAR AND READABLE IN THE BOOK PLEASE REFER PAGE
NO. 298-308 IN TEXT BOOK B.PARHAMI)
18. DATA FORWARDING:
Resolving Data Dependencies via Forwarding: When a previous instruction writes back a value
computed by the ALU into a register, the data dependency can always be resolved through forwarding
Certain Data Dependencies Lead to Bubbles: When the immediately preceding instruction writes a value
read out from the data memory into a register, the data dependency cannot be resolved through forwarding
(i.e., we cannot go back in time) and a bubble must be inserted in the pipeline.
Cycle 7 Cycle 6 Cycle 5 Cycle 4 Cycle 3 Cycle 2 Cycle 1 Cycle 8
Reg file
Reg file ALU
Reg file
Reg file ALU
Reg file
Reg file ALU
Reg file
Reg file ALU
Cycle 9
$2 = $1 - $3
Instructions that read register $2
Instr cache
Instr cache
Instr cache
Instr cache
Data cache
Data cache
Data cache
Data cache
Cycle 7 Cycle 6 Cycle 5 Cycle 4 Cycle 3 Cycle 2 Cycle 1 Cycle 8
Reg file
Reg file ALU
Reg file
Reg file ALU
Reg file
Reg file ALU
Reg file
Reg file ALU
Cycle 9
lw $2,4($12)
Instructions that read register $2
Instr cache
Instr cache
Instr cache
Instr cache
Data cache
Data cache
Data cache
Data cache
![Page 19: cs2071 new notes 3](https://reader033.vdocuments.us/reader033/viewer/2022051402/5695d1051a28ab9b0294d30d/html5/thumbnails/19.jpg)
19
CMageshKumar_AP_AIHT CS2071_Computer Architecture
19. PIPELINE BRANCH HAZARDS
Software-based solutions
Compiler inserts a “no-op” after every branch (simple, but wasteful)
Branch is redefined to take effect after the instruction that follows it
Branch delay slot(s) are filled with useful instructions via reordering
Hardware-based solutions
Mechanism similar to data hazard detector to flush the pipeline
Constitutes a rudimentary form of branch prediction:
o Always predict that the branch is not taken, flush if mistaken
o More elaborate branch prediction strategies possible
20. DELAYED BRANCH AND BRANCH PREDICTION
Predicting whether a branch will be taken
Always predict that the branch will not be taken
Use program context to decide (backward branch is likely taken, forward branch is likely not taken)
Allow programmer or compiler to supply clues
Decide based on past history (maintain a small history table); to be discussed later
Apply a combination of factors: modern processors use elaborate techniques due to deep pipelines
Problem with this approach:
Each branch in a loop entails two
mispredictions:
1. Once in first iteration (loop is repeated,
but the history indicates exit from loop)
2. Once in last iteration (when loop is
terminated, but history indicates repetition)
![Page 20: cs2071 new notes 3](https://reader033.vdocuments.us/reader033/viewer/2022051402/5695d1051a28ab9b0294d30d/html5/thumbnails/20.jpg)
20
CMageshKumar_AP_AIHT CS2071_Computer Architecture
Other branch prediction algorithms:
Hardware Implementation of Branch Prediction
The mapping scheme used to go from PC contents to a table entry is the same as that used in direct-mapped caches
Not taken
Predict taken
Predict taken again
Predict not taken
Predict not taken
again
Not taken
Taken
Not taken
Taken
Taken Not taken
Taken
Not taken
Predict taken
Predict taken again
Predict not taken
Predict not taken
again
Not taken
Taken
Not taken Taken
Taken Not taken
Taken
Not taken
Predict taken
Predict taken again
Predict not taken
Predict not taken
again
Not taken Taken
Not taken Taken
Taken Not taken
Taken
Compare
Addresses of recent branch instructions
Target addresses
History bit(s) Low-order
bits used as index
Logic From PC
Incremented PC
Next PC
0
1
=
Read-out table entry
![Page 21: cs2071 new notes 3](https://reader033.vdocuments.us/reader033/viewer/2022051402/5695d1051a28ab9b0294d30d/html5/thumbnails/21.jpg)
21
CMageshKumar_AP_AIHT CS2071_Computer Architecture
21. ADVANCED PIPELINING (Refer page no. 306-308 in text book B.Parhami)
The Three Hardware Designs for MicroMIPS
/
ALU
Data cache
Instr cache
Next addr
Reg file
op
jta
fn
inst
imm
rs (rs)
(rt)
Data addr
Data in 0
1
ALUSrc
ALUFunc DataWrite
DataRead
SE
RegInSrc
rt
rd
RegDst
RegWrite
32 / 16
Register input
Data out
Func
ALUOvfl
Ovfl
31
0 1 2
Next PC
Incr PC
(PC)
Br&Jump
ALU out
PC
0 1 2
Single-cycle
/
16
rs
0 1
0 1 2
ALU
Cache Reg file
op
jta
fn
(rs)
(rt)
Address
Data
Inst Reg
Data Reg
x Reg
y Reg
z Reg PC
4
ALUSrcX
ALUFunc
MemWrite MemRead
RegInSrc
4
rd
RegDst
RegWrite
/
32
Func
ALUOvfl
Ovfl
31
PCSrc PCWrite
IRWrite
ALU out
0 1
0 1
0 1 2 3
0 1 2 3
InstData
ALUSrcY
SysCallAddr
/
26
4
rt
ALUZero
Zero
x Mux
y Mux
0 1
JumpAddr
4 MSBs
/
30
30
SE
imm
Multicycle
125 MHz
CPI = 1
500 MHz
CPI 4
500 MHz
CPI 1.1 ALU
Data cache
Instr cache
Next addr
Reg file
op fn
inst
imm
rs (rs)
(rt)
Data addr
ALUSrc
ALUFunc
DataWrite
DataRead
RegInSrc
rt
rd
RegDst
RegWrite
Func
ALUOvfl
Ovfl
IncrPC
Br&Jump
PC
1 Incr
0
1
rt
31
0 1 2
NextPC
0
1
SeqInst
0 1 2
0 1
RetAddr
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5
SE
5 3
2
Address
Data