advanced computer architectures – part 2.2
DESCRIPTION
Part 2.2 of the slides I wrote for the course "Advanced Computer Architectures", which I taught in the framework of the Advanced Masters Programme in Artificial Intelligence of the Catholic University of Leuven, Leuven (B)TRANSCRIPT
![Page 1: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/1.jpg)
Advanced Computer
Architectures
– HB49 –
Part 2.2
Vincenzo De Florio
K.U.Leuven / ESAT / ELECTA
![Page 2: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/2.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/2
Course contents
• Basic Concepts
Computer Design
• Computer Architectures for AI
• Computer Architectures in Practice
![Page 3: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/3.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/3
Computer Design IS
• IS Classification
• Role of the compilers
DLX
![Page 4: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/4.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/4
IS DLX Architecture
• An example RISC architecture designed
by Patterson and Hennessey
• Simple register-register (load-store)
instruction set
• Designed for efficiency
From HW viewpoint
From compiler viewpoint
• Useful as an example of good IS design
![Page 5: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/5.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/5
IS DLX Architecture
• Registers:
32 registers called R0 = 0, R1, …, R31
32 single-precision floating point registers or
16 double-precision floating point registers F0,
F2, …, F30
• Data types
Like in C: 1 byte, 2 byte, 4 byte integers and 4
byte and 8 byte floats
• Addressing modes: just 2
Immediate (example: Add R4, #3)
Displacement (example: Add R4, 100(R1))
16-bit fields
Register deferred: Add R4, 0(R1)
Absolute: Add R4, 100(R0)
![Page 6: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/6.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/6
IS DLX Architecture
• Big endian
• DLX instruction format
Just two modes easily to encode in the
opcode
All instructions have the same length and start
with a 6 bit opcode
easier decoding algorithm
faster processing
shorter cycle is possible
• Layout: P&H p.99
![Page 7: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/7.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/7
IS DLX Architecture
• Mnemonics: L=load S=store followed by
B=byte H=half word W=word
F=float D=double
Examples
LB R1, 50(R9)
SF 50(R0), F2
• ADD…(Arithmetic op’s),
• SL... (shift left, logical op’s),
• J…, B… (jump and branch op’s)
![Page 8: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/8.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/8
IS DLX Architecture
• How good is the DLX architecture?
• DLX is a RISC architecture
• What’s a RISC architecture, and what’s
the difference between a RISC and a non-
RISC architecture?
![Page 9: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/9.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/9
IS DLX Architecture
• CISC = complex IS architecture
Architecture of the ’70s
Axioms:
(1) the IS must be easy to program with
(2) the IS must be easy to compile for
IS not too far away from a HLL
IS includes high level constructs
Loop instructions vs. gotoes
Complex CALL instructions preserving the
register file
Case/switch instructions
Large set of addressing modes
All addressing modes are available with all the
instructions
Key requirement of the ’70s: Minimize code size
![Page 10: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/10.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/10
IS DLX Architecture
• Why?
• Because, in the ’70s, RAM memories were
1000 times smaller than today
• Code space was a key factor
![Page 11: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/11.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/11
IS DLX Architecture
• RISC = restricted IS architecture
Key architecture today
Axioms:
(1) the IS must be simple,
(2) easy to implement in HW,
(3) should match with clever design solutions
(e.g., pipelining)
(4) should be a good target for nowadays
optimising compilers
![Page 12: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/12.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/12
IS DLX Architecture
• RISC = restricted IS architecture
Simple instructions
A few simple addressing modes
Fixed-length instructions
“Many” general purpose registers
Key goal: Help the machine go fast
In general, RISCs increase the number of
instructions executed (IC)…
Recall: CPUTIME
(p) = IC(p) CPI(p)
clock rate
…but at the same time they decrease CPI
The decrease rate of CPI is higher than the
increase rate of IC shorter CPUTIME
![Page 13: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/13.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/13
IS DLX Architecture
![Page 14: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/14.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/14
IS DLX Architecture
• Clock cycles: assumed to be the same
• Results:
• ICMIPS
@ 2 x ICVAX
• CPIMIPS
@ CPIVAX
/ 6
• The performance of the MIPS M2000 is
about 3 times the performance of the VAX
8700
![Page 15: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/15.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/15
Computer Design
• Quantitative assessments
• Instruction sets
Pipelining
• Parallelism
![Page 16: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/16.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/16
Pipelining
• Pipelining =
“an implementation technique whereby
multiple instruction are overlapped in
execution” (P&H)
• An assembly line:
Different steps (pipe stages) …
are completing different parts …
of different instructions …
in parallel
![Page 17: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/17.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/17
° Four persons (A, B, C,
and D) have to perform
a certain job on 4 sets
of items. The job
consists of 4 phases.
Pipelining
C D A B
° Phase 1 (washing)
takes 30’
° Phase 2 (drying),
another 30’
° Phase 3 (packaging),
other 30’
° Phase 4 (delivering)
also takes 30’
![Page 18: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/18.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/18
Doing the job sequentially takes 8 hours
30
B
C
D
A
Time
30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
6 PM 7 8 9 10 11 12 1 2 AM
Pipelining
![Page 19: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/19.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/19
The whole job is now finished in just
3.5 hours
12 2 AM 6 PM 7 8 9 10 11 1
B
C
D
A
30 30 30 30 30 30 30
Key idea: one
starts a new
phase as soon as
possible
Pipelining
![Page 20: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/20.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/20
What if they had more job to do?
6 PM
B
C
D
A
30 30 30 30 30 30 30
Between 7.30
and 8pm, each
person is busy
Pipelining
12 2 AM 7 8 9 10 11 1
![Page 21: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/21.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/21
Between 7.30 and 9.30pm, a whole job
is completed every 30’
6 PM
B
C
D
A
30 30 30 30 30
Pipelining
…
…
…
…
During that period, each worker is
permanently at work…
…but a new input must arrive within 30’
12 2 AM 7 8 9 10 11 1
![Page 22: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/22.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/22
Pipelining
• Important issues in this example
• Each phase has the same complexity
Each phase takes the same amount of
time!
• In the sequential processing example, the
requirement was: a new input must be
ready for processing every four phases
• Now, a new input must be available every
phase time!
The means that brings the input needs to
be fourfold as fast
One gets more from the system; though
one also asks more to it
![Page 23: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/23.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/23
Pipelining
• Also in the execution of, e.g., DLX
instructions, we distinguish a number of
distinct phases – we call them cycles,
because each one takes one clock cycle
time
• In DLX, an instructions is completed in at
most five cycles
• A number of special purpose registers are
used for this:
PC (program counter)
= address of the instruction to
be executed
IR (instruction register)
= instruction to be executed = *(PC)
NPC (next program counter), etc.
![Page 24: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/24.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/24
Memory and special purpose
registers in DLX
100
104
108
10C
110
114
118
52 71 73 10
45
52 71 75 52
71 00 96
… … … …
… … … …
… … … …
… … … …
…
… … … …
…
… … … …
BEQ R1, R3, eq3
BEQ R1, R5, eq5
BGT R1, #0, positive
…
…
NPC
IMM
PC
IR
ALUOUT
COND
LMD
TMP1
TMP2
![Page 25: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/25.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/25
Executing DLX Instructions:
Phase 1: Instruction Fetch (IF)
100
104
108
10C
110
114
118
52 71 73 10
45
52 71 75 52
71 00 96
… … … …
… … … …
… … … …
… … … …
…
… … … …
PC
IR
NPC
IMM
…
… … … …
BEQ R1, R3, eq3
BEQ R1, R5, eq5
BGT R1, #0, positive
…
…
00 00 01 00
52 71 73 10 IR
ALUOUT
COND
LMD
TMP1
TMP2
![Page 26: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/26.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/26
Executing DLX Instructions:
Phase 1: Instruction Fetch (IF)
100
104
108
10C
110
114
118
52 71 73 10
45
52 71 75 52
71 00 96
… … … …
… … … …
… … … …
… … … …
…
… … … …
PC
IR
NPC
IMM
…
… … … …
BEQ R1, R3, eq3
BEQ R1, R5, eq5
BGT R1, #0, positive
…
…
00 00 01 00
52 71 73 10 +4
00 00 01 04 NPC
ALUOUT
COND
LMD
TMP1
TMP2
![Page 27: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/27.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/27
Executing DLX Instructions:
Phase 2: Instruction Decode and
Register Fetch (ID)
100
104
108
10C
110
114
118
52 71 73 10
45
52 71 75 52
71 00 96
… … … …
… … … …
… … … …
… … … …
…
… … … …
IR
NPC
IMM
…
… … … …
BEQ R1, R3, eq3
BEQ R1, R5, eq5
BGT R1, #0, positive
…
…
00 00 01 00
52 71 73 10
00 00 01 04
ALUOUT
COND
LMD
TMP1
TMP2
PC
(R1)
(R3) 00 00 00 10
![Page 28: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/28.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/28
Executing DLX Instructions:
Phase 3: Execution (EX, branch)
100
104
108
10C
110
114
118
52 71 73 10
45
52 71 75 52
71 00 96
… … … …
… … … …
… … … …
… … … …
…
… … … …
IR
NPC
IMM
…
… … … …
BEQ R1, R3, eq3
BEQ R1, R5, eq5
BGT R1, #0, positive
…
…
00 00 01 00
52 71 73 10
00 00 01 04
ALUOUT
COND
LMD
TMP1
TMP2
PC
(R1)
(R3) 00 00 00 10
+
00 00 01 14
![Page 29: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/29.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/29
Executing DLX Instructions:
Phase 3: Execution (EX, branch)
100
104
108
10C
110
114
118
52 71 73 10
45
52 71 75 52
71 00 96
… … … …
… … … …
… … … …
… … … …
…
… … … …
IR
NPC
IMM
…
… … … …
BEQ R1, R3, eq3
BEQ R1, R5, eq5
BGT R1, #0, positive
…
…
00 00 01 00
52 71 73 10
00 00 01 04
ALUOUT
COND
LMD
TMP1
TMP2
PC
(R1)
(R3) 00 00 00 10
00 00 01 14
=
(R1) == (R
3)
![Page 30: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/30.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/30
Executing DLX Instructions:
Phase 3: Execution
• An instruction only enters an active phase
when it reaches state EX
• At that point, the instruction is said to
have issued or to have committed
• The machine state is only changed when
an instruction has committed
![Page 31: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/31.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/31
Executing DLX Instructions:
Phase 4: Memory access/branch
completion (MEM, branch)
100
104
108
10C
110
114
118
52 71 73 10
45
52 71 75 52
71 00 96
… … … …
… … … …
… … … …
… … … …
…
… … … …
IR
NPC
IMM
…
… … … …
BEQ R1, R3, eq3
BEQ R1, R5, eq5
BGT R1, #0, positive
…
…
00 00 01 14
52 71 73 10
00 00 01 04
ALUOUT
LMD
TMP1
TMP2
PC
(R1)
(R3) 00 00 00 10
00 00 01 14
COND (R1) == (R
3)
114
![Page 32: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/32.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/32
Executing DLX Instructions
• DLX branch instructions have only 4
phases
• The fifth phase is the write-back (WR), in
which registers are loaded with an
output from the ALU (ALUOUT
) or from
LMD (see P&H Chapter 3)
• For instance, when the instruction is
LW R1, 100(R0)
phases 3 – 5 are as follows:
3. ALUOUT
TMP1 + IMM /* i.e., R0 + 100 */
4. LMD Mem[ALUOUT
]
5. R1 LMD
![Page 33: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/33.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/33
Pipelined
Cache/
memory
Fetch
unit
Decode
unit
Execute
unit
Reg
file
Fe
tch
D
ec
od
e
Exe
cu
te
W
rite
ba
ck
Instr. 1
Instr. 1
Instr. 1
Instr. 1
T1 T2 T3 T4 T5 T6
Instr 1 F1 D1 E1 W1
Instr 2 F2 D2 E2 W2
Instr 3 F3 D3 E3 W3
Instr 4 F4 D4 E4
Instr 5 F5 D5
Instr 6 F6
T1 T2 T3 T4 T5 T6
Instr 1 F1 D1 E1 W1
Instr 2 F2 D2 E2 W2
Instr 3 F3 D3 E3 W3
Instr 4 F4 D4 E4
Instr 5 F5 D5
Instr 6 F6
Instr. 2
T1 T2 T3 T4 T5 T6
Instr 1 F1 D1 E1 W1
Instr 2 F2 D2 E2 W2
Instr 3 F3 D3 E3 W3
Instr 4 F4 D4 E4
Instr 5 F5 D5
Instr 6 F6
Instr. 2
Instr. 3
T1 T2 T3 T4 T5 T6
Instr 1 F1 D1 E1 W1
Instr 2 F2 D2 E2 W2
Instr 3 F3 D3 E3 W3
Instr 4 F4 D4 E4
Instr 5 F5 D5
Instr 6 F6
Instr. 2
Instr. 3
Instr. 4
T1 T2 T3 T4 T5 T6
Instr 1 F1 D1 E1 W1
Instr 2 F2 D2 E2 W2
Instr 3 F3 D3 E3 W3
Instr 4 F4 D4 E4
Instr 5 F5 D5
Instr 6 F6
Instr. 2
Instr. 3
Instr. 4
Instr. 5
Instr. 3
Instr. 4
Instr. 5
Instr. 6
T1 T2 T3 T4 T5 T6
Instr 1 F1 D1 E1 W1
Instr 2 F2 D2 E2 W2
Instr 3 F3 D3 E3 W3
Instr 4 F4 D4 E4
Instr 5 F5 D5
Instr 6 F6
![Page 34: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/34.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/34
Pipelining
• With respect to a non-pipelined machine,
the memory system must deliver n times
that bandwidth (n being the number of
pipeline stages)
• In pipelined operation, n instructions are
concurrently being processed: on average
n memory accesses per clock cycle
This worsens the memory bottleneck: even
apart from technological advances, this
architectural modification increases the
number of memory accesses per clock cycle
![Page 35: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/35.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/35
Pipelining
• In DLX, each instruction takes 5 clock
cycles to complete…
• …but during each clock cycle, the HW
initiates a new instruction and is
executing some part of 5 different
instructions
![Page 36: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/36.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/36
Pipelining
• Clearly pipelining increases the
complexity of the HW
Each stage involves a set of HW resources; we
need to guarantee that the same HW resource
be scheduled for execution in at most one
pipeline stage
When the pipelined is in steady state, in each
cycle the register file is accessed twice:
in ID (for reading),
in WB (for writing)
Each clock cycle, we need to perform two
reads and one write
We need to guarantee consistent operation
even when we read from and write to, e.g., the
same register
![Page 37: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/37.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/37
Pipelining
In order to realize the pipeline, values and
control information must “move through” the
pipeline from one stage to the next
Special registers, called pipeline registers or
pipeline latches, convey that information
This because, instead of having, e.g., a single
NPC register, we need to have
NPC’, NPC’’, NPC’’’…
representing the values of NPC during the
different stages of different instructions
For instance,
bwIDandEX.NPC bwIFandID.NPC
![Page 38: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/38.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/38
Pipelining
Stage Actions and pipeline registers
IF bwIFandID.IR *PC
if (bwEXandMEM.COND == TRUE)
bwIFandID.NPC bwEXandMEM.NPC
else
bwIFandID.NPC PC + 4
ID bwIDandEX.TMP1 RbwIFandID.IR[1]
bwIDandEX.TMP2 RbwIFandID.IR[2]
52 71 73 10
BEQ R1, R3, eq3
![Page 39: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/39.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/39
Pipelining
Stage Actions and pipeline registers
IF bwIFandID.IR *PC
if (bwEXandMEM.COND == TRUE)
bwIFandID.NPC bwEXandMEM.NPC
else
bwIFandID.NPC *PC + 4
ID bwIDandEX.TMP1 RbwIFandID.IR[1]
bwIDandEX.TMP2 RbwIFandID.IR[2]
New reg
old reg
bwIDandEX.NPC bwIFandID.NPC
bwIDandEX.IR bwIFandID.IR
![Page 40: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/40.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/40
Pipelining
Stage Actions and pipeline registers
IF bwIFandID.IR *PC
if (bwEXandMEM.COND == TRUE)
bwIFandID.NPC bwEXandMEM.NPC
else
bwIFandID.NPC *PC + 4
ID bwIDandEX.TMP1 RbwIFandID.IR[1]
bwIDandEX.TMP2 RbwIFandID.IR[2]
bwIDandEX.NPC bwIFandID.NPC
bwIDandEX.IR bwIFandID.IR
bwIDandEX.IMM bwIFandID.IR[3]
52 71 73 10
BEQ R1, R3, eq3
![Page 41: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/41.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/41
Pipelining
Stage Actions and pipeline registers
EX bwEXandMEM.ALUOUT
bwIDandEX.NPC + bwIDandEX.Imm
bwEXandMEM.cond
bwIDandEX.TMP1 rel bwIDandEX.TMP
2
…and so forth (see P&H, p.136)
![Page 42: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/42.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/42
Pipelining
• More registers are required a more
complex design is to be carried out
• More complex algorithm takes more
time to complete
![Page 43: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/43.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/43
Pipelining
• Indeed, implementing an instruction
pipeline
increases the instruction throughput
(average number of instructions
completed in one time unit)…
…though it slightly
increases the execution time of each
instruction
Overhead for controlling the pipeline
Overhead for avoiding “hazards” (to be
discussed later on)
![Page 44: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/44.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/44
Pipelining
Quantitative measurements
• U be an unpipelined machine
• Clock cycle of U = ccU = 10 ns
• Cycle distribution of U is as follows:
ALU instructions (40%) take 4 cycles
Branches (20%) take 4 cycles
Memory operations (40%) take 5 cycles
• P = pipelined version of U
• Clock cycle of P = ccP = 11 ns
(overhead: 1 ns per cycle)
• How fast is P w.r.t. U?
(Assumption: continuous flow is available,
no pipeline stalls...)
![Page 45: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/45.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/45
Pipelining
Quantitative measurements
• Average Instruction Execution Time = T
• TU = cc
U x average CPI
= 10 ns x ( (40% + 20%) x 4 + 40% x 5 )
ALU BRANCH take 4
cycles
MEM takes 5 cycles
= 10 ns x 4.4 = 44 ns
• TP = cc
P x average CPI = cc
P x 1
• Speedup = TU / T
P = 44 ns / 11 ns = 4
![Page 46: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/46.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/46
Pipelining Hazards
• Ideally, pipelines should continuously
“crunch” instructions without being
interrupted
• This way, the speedup is maximum
• In reality there exist three classes of
impediments that prevent the next
instruction from being executed:
Structural Hazards
Data Hazards
Control hazards
to be described in what follows
![Page 47: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/47.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/47
Pipelining Hazards
• Hazards are a problem because they
require to stall the pipeline (see later)
• Later on we will show some techniques
for hazard prevention
![Page 48: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/48.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/48
Pipelining Structural Hazards
• Structural Hazards are resource conflicts
• Not every combination of instructions is
allowed
because not every functional unit is fully
pipelined
Or because of other resource conflicts
A problem of cost-effectiveness
Consequence: a stall (“bubble”) floats
through the pipeline
![Page 49: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/49.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/49
Pipelining Structural Hazards
Cycles
1 2 3 4 5 6 7 8 LO
AD
Instr2
Instr3
Instr4 Mem
Mem
If the machine has just one memory port, this is
a structural hazard
![Page 50: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/50.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/50
Pipelining Structural Hazards
LO
AD
Instr2
Instr3
Cycles
1 2 3 4 5 6 7 8
Instr4 bubble
![Page 51: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/51.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/51
Pipelining Structural Hazards
• One of the keywords of computer design:
make the common case fast,
and the rare case correct
• If a particular structural hazard
does not occur very frequently,
it may not be worth the cost to avoid it
![Page 52: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/52.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/52
Pipelining Structural Hazards
• Avoiding a conflict has a cost due to the
extra redundancy,
but also a cost due to
extra control
• Compare for instance fig. 3.1 and fig.3.4
of P&H
• One must be careful so that this overhead
does not trigger a need for a higher clock
cycle lower clock rate
Recall: CPUTIME
(p) = IC(p) CPI(p)
clock rate
![Page 53: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/53.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/53
Pipelining Data Hazards
• Pipelining overlaps the execution of a set
of instructions
• Data Hazards are hazards due to
data dependencies between these
overlapped executions
ADD R1, R2, R3
SUB R4, R5, R1
AND R6, R1, R7
OR R8, R1, R9
XOR R10, R1, R11
ADD requires
5 cycles to
complete!
SUB may
use the
wrong value!
![Page 54: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/54.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/54
Pipelining Data Hazards
• Pipelining overlaps the execution of a set
of instructions
• Data Hazards are hazards due to
data dependencies between these
overlapped executions
ADD R1, R2, R3
SUB R4, R5, R1
AND R6, R1, R7
OR R8, R1, R9
XOR R10, R1, R11
ADD requires
5 cycles to
complete!
SUB, AND,
and OR requi-
re R1 sooner
XOR is
“far”
enough
![Page 55: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/55.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/55
Pipelining Data Hazards
Cycles
1 2 3 4 5 6 7 8
AD
D R
1, R
2, R
3
SUB R4,
R1, R5
AND R6,
R1, R7
OR R8,
R1, R9
XOR R10,
R1, R11
DATA
HAZARDS
NOT A DATA
HAZARD
![Page 56: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/56.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/56
Pipelining Minimizing or
Avoiding Data Hazards
• Let us consider again ADD R1, R2, R3
• “ADD requires 5 cycles to complete”
means
“the sum of R2 and R3 will be stored into
R1 only at the 5th cycle”
Why should we wait for this to happen?
Forwarding: using a pipeline register that
holds the right value
SUB R4, bwEXandMEM.ALUOUT
, R5
SUB R4, R1, R5
becomes
![Page 57: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/57.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/57
Pipelining Minimizing or
Avoiding Data Hazards
• How forwarding is realized?
• By propagating the result of the ALU
directly to an input latch of the ALU
• A custom circuit selects the right value to
be input to the ALU: the named register or
the propagated value
![Page 58: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/58.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/58
Pipelining Minimizing or
Avoiding Data Hazards
• Sometimes forwarding can be avoided by
very simple techniques
• For instance, let us assume that, during
each cycle,
writes into the register file occur in the
first half of the cycle, while
reads occur in the second half
W
R
![Page 59: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/59.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/59
Pipelining Minimizing or
Avoiding Data Hazards Cycles
1 2 3 4 5 6 7 8
AD
D R
1, R
2, R
3
SUB R4,
R1, R5
AND R6,
R1, R7
OR R8,
R1, R9
XOR R10,
R1, R11
3, 4: Forwarding
5: F. Avoidance
![Page 60: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/60.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/60
Pipelining Classification of
Data Hazards
• Let ( Ik)1 k IC(p)
be the ordered series of
instructions executed during the run of
program p
• Let i < j two integers, 1 i < j IC(p)
• So Ii
occurs before Ij
• Let us represent predicate
“instruction i writes in memory location v”
as Ii v
• Let us represent predicate
“instruction i reads from location v”
as Ii v
![Page 61: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/61.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/61
Pipelining Classification of
Data Hazards
1. RAW HAZARD (Read-After-Write hazard)
t
Ii v
Ij v
• RAW data dependency on an operand
that needs first to be written by Ii, and
then read by Ij
• If, due to pipelining,
Ij reads v before I
i writes it,
a RAW hazard occurs : Ij erroneously
gets a stale value
![Page 62: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/62.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/62
Pipelining Classification of
Data Hazards
2. WAW HAZARD (Write-After-Write hazard)
t
Ii v
Ij v
• WAW data dependency on an operand
that must be written in a certain order
while it is written in the wrong one
• If, due to pipelining,
Ij writes v before I
i writes it,
a WAW hazard occurs : the wrong value
gets stored in v
![Page 63: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/63.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/63
Pipelining Classification of
Data Hazards
• WAW hazards may happen in pipelines
such that the write-back stage happens in
different positions
LW R1, 0(R2)
ADD R1,R2,R3
IF
IF
ID
ID EX
EX MEM1 MEM2 WB
WB WB
WB
WB
WB
• This cannot happen with instruction sets
such as, e.g., DLX, where
each instruction takes
the same amount of cycles
• Less tricky design less complexity to
handle less pitfalls
![Page 64: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/64.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/64
Pipelining Classification of
Data Hazards
3. WAR HAZARD (Write-After-Read hazard)
t
Ii v
Ij v
• WAR data dependency on an operand
that needs first to be read by Ii, and then
written by Ij
• If, due to pipelining,
Ij writes v before I
i reads it,
a WAR hazard occurs : the wrong value is
read from v
• Ii erroneously gets the NEW value of v,
the one produced by Ij
![Page 65: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/65.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/65
Pipelining Classification of
Data Hazards
• This cannot happen with instruction sets
such as, e.g., DLX, where
all reads are early (ID stage) and
all writes are late (WB stage)
• WAR hazards occur when there are
instructions that write results early in the
instruction pipeline, as well as
instructions that read a source late in the
pipeline
• For instance, this may happen with the
autoincrement addressing mode
![Page 66: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/66.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/66
Pipelining Hazards
Cycles
1 2 3 4 5 6 7 8
AD
D R
1, R
2, R
3
SUB R4,
R1, R5
AND R6,
R1, R7
OR R8,
R1, R9
• In some cases, forwarding and subcycling
can prevent a stall
![Page 67: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/67.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/67
Pipelining Hazards
Cycles
1 2 3 4 5 6 7 8
LW
R
1, 0
(R
2)
SUB R4,
R1, R5
AND R6,
R1, R7
OR R8,
R1, R9
• In some cases, forwarding and subcycling
cannot prevent a stall
IMPOSSIBLE!
![Page 68: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/68.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/68
Pipelining Hazards
1 2 3 4 5 6 7 8
LW
R
1, 0
(R
2)
SUB R4,
R1, R5
AND R6,
R1, R7
OR R8,
R1, R9
• A special HW, called the pipeline
interlock, detects the hazard and stalls
the pipeline until the hazard is cleared
bubble
bubble
bubble
![Page 69: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/69.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/69
Pipelining Hazards
• Pipeline interlock penalty:
one or more clock cycles
• Consequences:
the CPI for the stalled instruction
increases by the length of the stall
![Page 70: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/70.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/70
Pipelining Pipeline Scheduling
• Classical solution: pipeline scheduling
• The compiler re-arranges the instructions
in order to (try to) avoid stalls
• Example: the compiler tries to avoid
generating code like
LW x, …
INSTR …, x
that is, a load followed by the immediate
use of the load destination register
![Page 71: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/71.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/71
LW R1, b
LW R2, c
ADD R3, R1, R2
SW a, R3
LW R4, e
LW R5, f
SUB R6, R4, R5
SW d, R6
Pipelining Pipeline Scheduling
1. Generate DLX code for the expressions
a = b + c
d = e – f
Basic block
![Page 72: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/72.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/72
LW R1, b
LW R2, c
ADD R3, R1, R2
SW a, R3
LW R4, e
LW R5, f
SUB R6, R4, R5
SW d, R6
Pipelining Pipeline Scheduling
2. We make a graph of the dependences
among the instructions and we order the
instructions so as to minimize the stalls
LW R1, b
LW R2, c
LW R4, e
ADD R3, R1, R2
LW R5, f
SW a, R3
SUB R6, R4, R5
SW d, R6
![Page 73: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/73.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/73
Pipelining Control Hazards
• Control hazards are hazards due to the
execution of branches
• Let us call
TAKEN BRANCH
a branch that sets the PC to its target
address
• Let us call
UNTAKEN BRANCH
a branch that does not force the PC to be
set; as far as PC is concerned, it behaves
like a NOP
![Page 74: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/74.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/74
Pipelining Control Hazards
• The problem with branches is that their
nature is only known at run-time
• Simplest method to deal with branches:
as soon as we detect a branch,
we stall the pipeline
• What does exactly mean “as soon as”?
![Page 75: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/75.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/75
DLX Branch
1: IF (1/2)
100
104
108
10C
110
114
118
52 71 73 10
45
52 71 75 52
71 00 96
… … … …
… … … …
… … … …
… … … …
…
… … … …
PC
IR
NPC
IMM
…
… … … …
BEQ R1, R3, eq3
BEQ R1, R5, eq5
BGT R1, #0, positive
…
…
00 00 01 00
52 71 73 10 IR
ALUOUT
COND
LMD
TMP1
TMP2
![Page 76: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/76.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/76
DLX Branch:
1: IF (2/2)
100
104
108
10C
110
114
118
52 71 73 10
45
52 71 75 52
71 00 96
… … … …
… … … …
… … … …
… … … …
…
… … … …
PC
IR
NPC
IMM
…
… … … …
BEQ R1, R3, eq3
BEQ R1, R5, eq5
BGT R1, #0, positive
…
…
00 00 01 00
52 71 73 10 +4
00 00 01 04 NPC
ALUOUT
COND
LMD
TMP1
TMP2
At this point, we’ve just
fetched an instruction; but
we don’t know yet WHICH ONE!
![Page 77: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/77.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/77
DLX Branch
2: ID
100
104
108
10C
110
114
118
52 71 73 10
45
52 71 75 52
71 00 96
… … … …
… … … …
… … … …
… … … …
…
… … … …
IR
NPC
IMM
…
… … … …
BEQ R1, R3, eq3
BEQ R1, R5, eq5
BGT R1, #0, positive
…
…
00 00 01 00
52 71 73 10
00 00 01 04
ALUOUT
COND
LMD
TMP1
TMP2
PC
(R1)
(R3) 00 00 00 10
At this point, we’ve decoded the
instruction and found that it’s
indeed a branch
![Page 78: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/78.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/78
DLX Branch
3: EX (1/2)
100
104
108
10C
110
114
118
52 71 73 10
45
52 71 75 52
71 00 96
… … … …
… … … …
… … … …
… … … …
…
… … … …
IR
NPC
IMM
…
… … … …
BEQ R1, R3, eq3
BEQ R1, R5, eq5
BGT R1, #0, positive
…
…
00 00 01 00
52 71 73 10
00 00 01 04
ALUOUT
COND
LMD
TMP1
TMP2
PC
(R1)
(R3) 00 00 00 10
+
00 00 01 14
Here we get the next PC of
the taken branch
![Page 79: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/79.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/79
DLX Branch
3: EX (2/2)
100
104
108
10C
110
114
118
52 71 73 10
45
52 71 75 52
71 00 96
… … … …
… … … …
… … … …
… … … …
…
… … … …
IR
NPC
IMM
…
… … … …
BEQ R1, R3, eq3
BEQ R1, R5, eq5
BGT R1, #0, positive
…
…
00 00 01 00
52 71 73 10
00 00 01 04
ALUOUT
COND
LMD
TMP1
TMP2
PC
(R1)
(R3) 00 00 00 10
00 00 01 14
=
(R1) == (R
3)
Only at this point we now the
nature of the branch:
brnch = (cond)? Taken:Untaken;
![Page 80: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/80.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/80
Pipelining Control Hazards
• The problem with branches is that their
nature is only known at run-time
• Simplest method to deal with branches:
as soon as we detect a branch,
we stall the pipeline
1. “As soon as” means
after the IF stage, during stage ID
IF: first stall
2. Then we need to reach the EX stage
to know the address where to branch to
ID: second stall
3. The nature of a branch is revealed at the
end of EX, in MEM
EX: third stall
• At this point, the pipeline restarts
![Page 81: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/81.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/81
Pipelining Control Hazards
• With a 30% branch frequency and an ideal
CPI of 1, three clock cycles of penalty
means that the machine only achieves
about HALF the ideal speedup from
pipelining
• What can we do to reduce the three cycle
penalty?
![Page 82: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/82.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/82
Pipelining Control Hazards
1. Uncover the nature of the branch earlier
in the pipeline:
in DLX, this means adding a test to the ID
stage
2. Compute the taken PC earlier:
at the cost of an additional adder,
we can anticipate the addition that gives
the taken PC
3. (For untaken branches): do not repeat
the IF stage
• These strategies can reduce the branch
penalty to one clock cycle
![Page 83: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/83.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/83
Pipelining Control Hazards
• How to deal with branch penalties
• Four simple compile-time schemes
Static, fixed, per-branch predictions
Compile-time guesses
• Simplest: freezing or flushing the pipeline
Penalty: one clock cycle
• Predict not taken:
The HW continues as if the branch was not
taken (next IR = *(PC + 4))
If the branch is taken, the fetched instruction
is invalidated (turned into a NOP)
Penalty: no penalty if untaken,
one cycle if taken
![Page 84: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/84.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/84
Pipelining Control Hazards
Predict not taken
IF Untaken branch
i + 1 IF
ID
i + 2
EX
ID
IF
i + 3
MEM
EX
ID
IF
i + 4
WB
MEM
EX
ID
IF
WB
MEM
EX
ID
WB
MEM
EX
WB
MEM WB
Taken branch IF
i + 1 IF
ID EX
idle
Branch target IF
MEM
idle
ID
Branch target + 1 IF
idle
MEM
EX
ID
WB
MEM
EX
WB
MEM WB
WB
idle
EX
ID
IF Branch target + 2
![Page 85: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/85.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/85
Pipelining Control Hazards
• Predict taken:
Hypothesis: the taken branch address is known
very early, long before the outcome of the
branch is known
The HW assumes the branch is taken
Penalty: no penalty if taken,
one cycle if untaken
Due to loops, taken branches are more than
untaken branch
![Page 86: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/86.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/86
Pipelining Control Hazards
• Delayed branch
• Hypothesis: a branch implies a delay that
adds up to the time required to execute n
instructions
• The branch delay slot is then filled in
with instructions that would be executed
whatever the outcome of the
branch test be
• In DLX, n = 1
![Page 87: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/87.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/87
Pipelining Control Hazards
Delayed branch
Untaken branch
Branch delay
IF
IF
ID EX MEM WB
ID EX MEM WB
i + 2 IF ID EX MEM WB
i + 3 IF ID EX MEM WB
i + 4 IF ID EX MEM WB
Taken branch
Branch delay
IF
IF
ID EX MEM WB
Branch target IF ID EX MEM WB
Branch target + 1 IF ID EX MEM WB
IF ID EX MEM WB Branch target + 2
ID EX MEM WB
![Page 88: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/88.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/88
Pipelining Control Hazards
Slot schedule
• Problem: how to schedule
the branch-delay slot
• Three ways
• Best choice: an independent instruction
from before the branch
INSTR1
INSTR2
IF TEST THEN
Delay slot
…
INSTR N
INSTR1
INSTR2
IF TEST THEN
…
INSTR N
INSTR2
• Penalty: none
![Page 89: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/89.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/89
Pipelining Control Hazards
Slot schedule
• If the best choice is not possible, e.g.,
due to a dependency, then one may
choose among the following two
methods:
1. From target :
If it is not possible to select an
independent instruction from before the
branch (a sure one!), then you must
guess: If the chance that the branch is
taken is felt as higher, then you fill the
delay slot with an instruction from the
target of the branch
![Page 90: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/90.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/90
Pipelining Control Hazards
Slot schedule
INSTR1
INSTR2
IF TEST THEN
…
• Penalty: none if the branch is a taken one,
1 clock cycle if it’s untaken
Delay slot
INSTR1
INSTR2
IF TEST THEN
…
• From target
INSTR 1
• Assumption: no side effect from
executing INSTR 1 when branch is
mispredicted (no undo required!)
INSTR1
![Page 91: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/91.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/91
Pipelining Control Hazards
Slot schedule
2. From fall through :
If it is not possible to select an
independent instruction from before the
branch (a sure one!), and if the chance
that the branch is not taken is felt as
higher, then you fill the delay slot with
the instruction at PC+4
INSTR1
INSTR2
IF TEST THEN
Delay slot
…
INSTR N
INSTR1
…
INSTR N
IF TEST THEN
INSTR2
![Page 92: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/92.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/92
Pipelining Control Hazards
Slot schedule
• Again, the instruction selected to be
placed in the delay slot must be side
effect free
• That instruction must be such that
no undo is required if the branch goes in
the unexpected direction
BEQ R2, R3, Skip
LW R1, #100
. . .
Skip LW R1, #200
. . .
The second load
overwrites the
first one
![Page 93: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/93.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/93
Pipelining Control Hazards
• The LW example is clearly an ideal one. In
reality, it is very difficult to select an
instruction for the delay slot
• Furthermore, these schemes are compile-
time predictions that may be found to be
false at run-time
![Page 94: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/94.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/94
Pipelining Control Hazards
• Improvements are possible :
Cancelling branches :
the branch instructions include a
prediction bit (taken vs. untaken).
If the prediction bit is false, the branch
instruction “cancels” the instruction in
the delay slot by writing the NOP bit(s)
• This makes it easier to select
instructions for the delay slot:
the side-effect free requirement can be
relaxed
![Page 95: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/95.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/95
Pipelining Control Hazards
The use of delayed and cancelling
branches resulted in no penalty in 70% of
the time on average with 10 programs of
the SPECint92 benchmarks (5 int., 5 f.p.)
Delayed branches have an extra cost:
an interrupt may occur also during the
execution of the instruction in the branch
delay slot (BDSI).
If the branch was taken, then both the
address of the BDSI and that of the
branch target need to be preserved and
restored when the interrupt has been
served
![Page 96: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/96.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/96
Pipelining Control Hazards
• The longer the pipeline, the more pipeline
stages are required
(1) to uncover the current branch target
address and
(2) to tell the nature of the current branch
• In DLX, one clock cycle (very small)
• In R4000, it is 3 clock cycles (1) and 1
clock cycle (2)
![Page 97: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/97.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/97
Pipelining Static Branch
Prediction & Compiler Support
• The effectiveness of delayed branch
depends on the truth value of our guess
• Static branch prediction: predicting the
outcome of a branch at compile time
(vs. dynamic prediction: prediction based
on runtime program behaviour)
• Static prediction method 1:
observing and analysing the program
behaviour
• Static prediction method 2:
using profile information collected from
earlier runs of the program
![Page 98: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/98.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/98
Pipelining Static Branch
Prediction & Compiler Support
• Static prediction method 1:
observing and analysing the program
behaviour
• Observations (10 SPECint92 benchmark
programs) show that most branches are
taken
On average, 62% in integer programs, 70% in
f.p. programs (total @ 67%)
Of taken branches, backward branches are at
least 1.5 times more than forward branches
Loop unrolling is a reason for this
![Page 99: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/99.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/99
Pipelining Static Branch
Prediction & Compiler Support
• Simplest method: predict-as-taken (1.1)
• In our benchmark, a minority of these
predictions is wrong (34%)
• Note: On the average!
Worst misprediction is 59%, best is 9%
(in the worst case, predict-as-untaken
would give better performance!)
![Page 100: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/100.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/100
Pipelining Static Branch
Prediction & Compiler Support
• Method 1.2:
predict-bw-as-taken
predict-fw-as-untaken
• For some programs and compilers,
n (fw branches) 50%
• In this case only, M1.2 is better than M1.1
• This is not true for the 10 SPECint92
programs and in most cases
![Page 101: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/101.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/101
Pipelining Static Branch
Prediction & Compiler Support
• Static prediction method 2:
using profile information collected from
earlier runs of the program
• You see what happened in the past and
consider this as a good model for the
future
• Per branch prediction
• Key observation and principle: “often,”
a given branch has a high-probability
behaviour
A privileged attribute
It is most likely a taken or an untaken branch
![Page 102: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/102.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/102
Pipelining Static Branch
Prediction & Compiler Support
• Average # of instructions between
mispredictions: 20 vs 110
• St.dev: 27 vs. 85 (very large)
![Page 103: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/103.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/103
Performance of the DLX Integer
Pipelining System
• Assumptions:
No misses
No clock overhead
Basic delayed branch + cancelling delayed
branch (1 cycle delay each)
• Results:
(Exercising five SPECint92 programs:)
9% – 23% of the instructions cause a 1 cycle
loss
![Page 104: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/104.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/104
Performance of the DLX Integer
Pipelining System
• Colors: branch / load stalls
• DLX average CPI : 1.11
• Speedup(5 SPECint92 prgs) = 5/1.1 = 4.5
![Page 105: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/105.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/105
Pipelining Exceptions
• An exception is an event that is triggered
at run time due to the interaction with the
environment and results in a (temporary
or permanent) suspension of the current
application so to manage the event
• Examples:
A key has been pressed (interrupt)
The user invokes a service of the OS
A breakpoint is encountered
A division-by-zero condition is encountered
An overflow or underflow condition
A NaN float
Misalignments
Access to protected or non existing memory
areas
Power failures…
![Page 106: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/106.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/106
Pipelining Exceptions
• What happens to the pipeline when
an exception takes place?
• With pipelining, instructions are no more
“atomic”
• An instruction is further subdivided into
“stages”
• The instruction is only completed at the
end of the last stage
• If an interrupt occurs in the middle of a
committed instruction, the result may be
a half-finished instruction
![Page 107: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/107.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/107
Pipelining Exceptions
• Interrupt
An external event asks for immediate attention
(service) by raising an input line (the INT line)
The main program is interrupted wherever it is
A jump is made to the interrupt service routine
(ISR)
After processing the ISR, the main program
resumes where it was broken off
• A pipeline (or machine) is said to be
restartable if it can handle an exception
(e.g. an interrupt), save the state, and
restart without affecting the execution of
the program being interrupted
![Page 108: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/108.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/108
Pipelining Exceptions
• Precise exceptions: a property of a
pipelined machine such that
instructions just before the exceptions
are completed and
instructions after the exceptions can be
restarted from scratch
• Often precise exceptions imply a huge
penalty
• The IBM PowerPc and others adopts two
modes:
Precise exceptions mode (slow, for debugging)
Performance mode (inprecise, fast)
![Page 109: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/109.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/109
Pipelining Exceptions
• In the DLX integer pipeline no instruction
updates the machine state before the end
of the MEM stage
• This makes realising precise exceptions
very easy
• The instructions later in the pipeline have
not committed yet
• This is not true, e.g., for the
autodecrement mode instructions of the
VAX, which cause the update of registers
in the middle of the execution of an
instruction
![Page 110: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/110.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/110
Pipelining Exceptions
• If such an instruction is aborted due to an
exception, the machine state would be
left altered
• Machines with these instructions often
have the ability to back out any state
change before the instruction has
committed
• If an exception occurs, the machine uses
this feature to reset the state of the
machine to its value before the
interrupted instruction started
![Page 111: Advanced Computer Architectures – Part 2.2](https://reader034.vdocuments.us/reader034/viewer/2022052410/554a1053b4c9058c5d8b4a37/html5/thumbnails/111.jpg)
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.2/111
Pipelining Exceptions
• On VAX and the 360 family, special
instructions use the general purpose
registers as working storage
• In such machines, g.p. registers are
always saved on exception and restored
after the exception
• The state of partially completed
instructions lies in these registers, which
makes the exceptions precise