lecture 1: introduction instruction level parallelism & processor architectures
DESCRIPTION
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures. Instruction Level Parallelism (ILP). Simultaneous execution of multiple instructions. do { Swap = 0; for (I = 0; I Tab[I+1]) { Temp = Tab[I]; - PowerPoint PPT PresentationTRANSCRIPT
Lecture 1: IntroductionInstruction Level
Parallelism& Processor Architectures
2
Instruction Level Parallelism (ILP)
Simultaneous execution of multiple instructions.
do { Swap = 0; for (I = 0; I<Last; I++) { if (Tab[I] > Tab[I+1]) { Temp = Tab[I]; Tab[I] = Tab[I+1]; Tab[I+1] = Temp; Swap = 1; } } } while (Swap);
3Barriers to detecting ILP
Control dependences
• Arise due to conditional branches
Data dependences
• Register dependences
• Memory dependences
4Branches
j = 0;*q = false; while ((*q == false) && (j != 8)) { j = j + 1; *q = false; if ((b[j] == true) && (a[i+j] == true) && (c[i-j + 7] == true)) { x[i] = j; b[j] = false; a[i+j] = false; c[i-j + 7] = false;
if ( ….
if (b[j])
if (a[i+j])
while ((*q
if (c[i-j+7])
x[i] = j; ...
5Frequent Branches
Sequence of branch instructions in the dynamic stream separated by at most one non-branch instruction.
0
10
20
30
40
50
60
70
go
m88
ksim gc
c
com
pres
s li
ijpeg
perl
vort
ex
INT
Dyn
am
ic B
ran
ch
es [
%]
6Branch Prediction Accuracy of gshare
0
20
40
60
80
100
go
m88ks
im gcc
xlis
p
perl
vort
ex
Pre
dic
tion
Accu
racy [
%]
7Memory Dependences
Reordering of memory instructions, loads and stores, is not always possible.
Store R1, addrLoad R2, addr’Add R1, R2
Store R5, addrStore R2, addr’Load R1, addr’Add R1,R3
Load R2, addr’Store R1, addrAdd R1, R2
If addr!=addr’
Store R2, addr’Load R1, addr’Store R5, addrAdd R1,R3
If addr!=addr’
8Memory Disambiguation
0
5
10
15
20
8 16 32
I ssue Width
Inst
ruct
ions
per
cyc
lePerfect Simple
9Value based Store-set disambiguator
0
2
4
6
8
10
12
14
16
18
20
8 16 32
I ssue Width
IPCs
Perfect Value- based
10Register Dependences
• True data dependences
• False data dependences
Add R2, R3
Load R2, .
Add R1, R2
Load R1, ..
Sub R1, R2
Load R1, .
Load R3, .
Add R2, R3
Load R2, .
Add R1, R2
Load R4, ..
Sub R4, R2
Load R1, .
Load R3, .
11
Window Size vs ILP (issue width = 16)
3
4
5
6
7
8
9
8 16 32 64 128 256 512 1024
Window Size
Inst
ruct
ions
per
cycl
e
12
Parallelism Study - ILP in Spec95
0
5
10
15
20
25
30
8 32 128 512 2048 8192
Window Size
Inst
ruct
ions
per
cyc
le8- issue
16- issue
32- issue
64- issue
13Conclusions
• There is ample amount of parallelism to scale the issue width.
• Very large instruction windows must be implemented.
• A highly accurate memory disambiguation mechanism is required.
• Highly accurate branch prediction must be performed.
• Register dependences should be avoided.
14Processors
• Pipelined
• Advanced Pipelining
• Superscalars
• Very Long Instruction Word (VLIW)
• Multiprocessors/Multicores
15Pipelined Processors
In-order, overlapped execution of instructions. Eg. 5-stage pipeline instruction fetch, decode and register operand fetch, execute, memory operand fetch, and write-back results.
F D M WBE
F D E WBM
F ED WBM
MIPS R4000 has an 8 stage pipeline.
16Causes of Pipeline Delays
Data dependences - RAW hazards register bypass and code reordering by the compiler.
Register hazards WAW hazards -instructions may reach the WB stage
out-of-order. No WAR hazards.
Branch delays Compiler fills branch delay slots vs hardware performs
branch prediction.
Structural hazards due to nonpipelined units. Register writes when multiple instructions reach WB
stage at the same time (issue vs retire rate).
17Advanced Pipelining
In-order issue but Out-of-order execution
DIVD F0, F2, F4
ADDD F10, F0, F8
SUBD F8, F8, F14
Execute SUBD before ADDD
Dynamic scheduling – Scoreboard, Tomasulo’s
18Superscalar Processors
• Multiple instructions can be issued in each cycle.
• Speculative Execution is incorporated (commit or discard results).
AMD-K7 is a 9-issue superscalar.
F D E WBM
F D E WBM
F D E WBM
F D E WBM
F D E WBM
F D E WBM
PowerPC is a 4-issue superscalar.
19VLIW
• Each long instruction contains multiple operations that are executed in parallel.
• Compiler performs speculation and recovery.
F D E WBEEE
F D E WBEEE
Multiflow 500 can issue up to 28 operations in each instruction (instructions can be up to 1024-bits).Itanium – 128 bit instruction, 3 operations (40-bit), template (8-bits)
20Control Dependences -Instruction Window
Superscalar
Hardware branch prediction guides fetching of instructions to fill up the processor’s instruction window.
VLIW
Programs are first profiled.
The compiler uses the profiles to trace out likely paths. A trace is a software instruction window.Instructions are issued
from the window as they become ready, that is, out-of-order execution is possible.
Instruction reordering is performed by the compiler within the trace.
21
Data Dependences - Exploiting ILP
Superscalar
Memory dependences: HW load-store disambiguation techniques used for enabling out-of-order execution.
VLIW
Memory dependences: Detected by the compiler using dependency analysis or using address profiling.
False register dependences: Avoided using register renaming. True data dependences: Must be honored. Value prediction for out-of-order execution of dependent instructions.
False data dependences: Avoided by the compiler through renaming (memory) and register allocation.True data dependences: Are strictly followed. Reordering is possible with HW support.