evolution of the ilp processing
Post on 02-Jan-2016
31 Views
Preview:
DESCRIPTION
TRANSCRIPT
Evolution of the ILP Processing
Dezső Sima
Fall 2007
(Ver. 2.0) Dezső Sima, 2007
Foreword
The steady demand for higher processor performance has provoked the successive introduction of temporal, issue and intra-instruction parallelism into processor operation. Consequently, traditional sequential processors, pipelined processors, superscalar processors and superscalar processors with multimedia and 3D support mark subsequent evolutionary phases of microprocessors.
On the other hand the introduction of each basic technique mentioned gave rise to specific system bottlenecks whose resolution called for innovative new techniques. Thus, the emergence of pipelined instruction processing stimulated the introduction of caches and of speculative branch processing. The debut of superscalar instruction issue gave rise to more advanced memory subsystems and to more advanced branch processing. The desire to further increase per cycle performance of first generation superscalars called for avoiding their issue bottleneck by the introduction of shelving, renaming and a concerted enhancement of all relevant subsystems of the microarchitecture. Finally, the utilization of intra-instruction parallelism through SIMD instructions required an adequate extension of the ISA and the system architecture.
With the main dimensions of the parallelism - more or less exhausted in the second generation superscalars for general purpose applications -, increasing the clock frequency remained the single major possibility to increase performance further on. The rapid increase of the clock frequencies, however led to limits of evolution, as discussed in Chapter II.
Structure
1. Paradigms of ILP-processing•
2. Introduction of temporal parallelism•
3. Introduction of issue parallelism•
3.1. VLIW processing•
3.2. Supercalar processing•
4. Introduction of data parallelism•
5. The main road of evolution •
6. Outlook•
1. Paradigms of ILP-processing
mainframe
1950 1960 1970 1980 1990 2000
minicomputer
microcomputer
x
UNIVAC
4004
/370 /390 z/900
server/workstation
desktop PC
value PC
80808088
80286
80386
80486
Pentium
PII PIII P4
Celeron
/360
PDP-8 PDP-11 VAX
RS/6000
PPro
Xeon
super-computer
ENIAC CDC-6600
?Cray-1 Cray-2NORC Cray-3Cray T3E
Cray-4
8088Altair
Figure 1.1: Evolution of computer classes
1.1. Introduction (1)
1.2. ábra: The integer performance of Intel’s x86 line of processors
SPECint92
5
10
50
Year86 8879 1980 81 82 83 84 85 87 89 1990 91 92 93 94 95 96 97 98 99
*
*
*
**
*
**
2
386/16
*
* *
*
*
* 8088/5
*0.5
100
8088/8
80286/10
80286/12
386/20 386/25
386/33
500
*
*
*1000
20
200
1
0.2
*
***
**
*
486/25
486/33486/50 486-DX2/66
Pentium/66
Pentium/100 Pentium/120
Pentium Pro/200
PII/450
PIII/600
486-DX4/100
Pentium/133 Pentium/166
Pentium/200
PII/300PII/400 PIII/500
486-DX2/50*
2000 01 02 03
5000
2000*
*
*
*
*
** *
*
PIII/1000
P4/1500P4/1700
P4/2000 P4/2200P4/2400 P4/2800
P4/3060
P4/3200
~ 100*/10 years
*
*
***
04 05
Northwood B
10000
Prescott (1M)Prescott (2M)
Leveling off
1.1. Introduction (2)
Pipeline
processors
Temporal parallelism Issue parallelism
Paradigms of ILP-processing
Static dependency resolution
VLIW processors
1.2. Paradigms of ILP-processing (1)
VLIW processing
FE
FE
FE
VLIW: Very Large Instruction Word
Independent instructions (static dependency resolution)
Processor
Instructions
Pipeline
processors
Temporal parallelism Issue parallelism
Paradigms of ILP processing
Staticdependency resolution
Dynamic dependency resolution
VLIW processors
Superscalar processors
1.2. Paradigms of ILP processing (1)
Instructions
VLIW processing
FE
FE
FE
VLIW: Very Large Instruction Word
Independent instructions (static dependency resolution)
Processor
Superscalar processing
FE
FE
FE
Dynamicdependency resolution
Processor
Dependent instructions
SIMD
extension
Data parallelism
1.2. Paradigms of ILP processing (1)
Pipeline processors
Temporal parallelism Issue parallelism
Paradigms of ILP processing
Staticdependency resolution
Dynamic dependency resolution
VLIW processors
Superscalar processors
1.2. Paradigms of ILP-processing (2)
~ ‘90~ ‘85 ~ ’95 -‘00
Superscalar processors
Pipeline processors.
VLIW processors EPIC processors
Superscalar proc.swith SIMD extension
Figure 1.3: The emergence of ILP-paradigms and processor types
Sequential processing
Temporal parallelism
Issue parallelism Data parallelism
Static dependency resolution
Dynamic dependency resolution
1.3. Performance potential of ILP-processors (1)
Absolute performance Ideal case Real case
CPIfP Cai
1Sequential
iCPI
PipelineCPI
fP Cai
1
iCPI
VLIW/superscalar
IPCPI
fP Cai 1iIPiIP
SIMD extension OPIIP
CPIfP Cao
1iOPI
1.3. ILP processzorok teljesítménypotenciálja (2)
Clock frequencyDepends on technology/
μarchitecture
Per cycle efficiencyDepends on ISA, μarchitecture, systemarchitecture, OS, compiler, application
OPIIPCPI
fP Cao
1
Clockfrequency
Temporalparall.
Issueparall.
Dataparall.
Efficiency of spec. exec.
Performance components of ILP-processors:
effCao IPCfP
OPIIPCPI
IPCeff
1with:
2. Introduction of temporal parallelism
(F: fetch cycle, D: decode cycle, E: execute cycle, W: write cycle)
Mainframes
Microprocessors
Sequentialprocessing
ii +1iiF D E W
Earlymainframes
F D
Overlapping the fetch and further phases
F D E Wii
+1ii F D E W
+2ii
Prefetching
i80286 (1982)39
M68020 (1985)40
Stretch (1961)34
Overlapping the executephases through pipelining
ii
+1ii
+3ii
+2ii
E E E1 2 3
PipelinedEUs
IBM 360/91 (1967)CDC 7600 (1969)
35
36
Overlapping all phases
ii
+1ii
+3ii
+2ii
F E WD
Pipelineprocessors
Atlas (1963)37
IBM 360/91 (1967)38
R2000 (1988)41
i80386 (1985)42
M68030 (1988)43
2.1. Introduction (1)
Types of temporal parallelism in ILP processors
Figure 2.1: Implementation alternatives of temporal parallelism
2.1. Introduction (2)
Figure 2.2: The appearance of pipeline processors
x86
M68000
MIPS R
1980 81 82 83 84 85 86 87 88 89 1990 91 92
80386 80486
68030 68040
R3000 R6000 R4000
Pipeline (scalar) processors
R2000
68020
80286
2.2. Processing bottlenecks evoked and their resolution
The problem of branch processing
The scarcity of memory bandwidth
(2.2.2)
(2.2.3)
•
•
2.2.1. Overview
2.2.2. The scarcity of memory bandwidth (1)
Larger memory bandwidth
Sequential processing Pipeline processing
More instructions and data
need to be fetched per cycle
2.2.2. The scarcity of memory bandwidth (2)
Figure 2.3: Introduction of caches
x86
M68000
MIPS R
1980 81 82 83 84 85 86 87 88 89 1990 91 92
80386 80486
68030 68040
R3000 R6000 R4000
C(8)
C(1/4,1/4) C(4,4)
C(4,4) C(4,4) C(16) C(8,8)
C(0,1/4)
R2000
68020
80286
Pipeline (scalar) processors with cache(s)
C(n) Universal cache (size in kB)
C(n/m) Instruction/data cache (sizes in kB)
Pipeline (scalar) processors without cache(s)
ii+2
ii+1
2.2.3. The problem of branch processing (1)
(E.g. in case of conditional branches)
Figure 2.4: Processing of a conditional branch on a 4-stage pipeline
F bti ii+4
D
E
W
Brach addresscalculation
F
D
E
Conditionchecking(branch!)
D
F
Decode
Fbc ii
clock cycles
Branch target instruction bti
Conditional branchbc
2.2.3. The problem of branch processing (2)
Figure 2. 5: Principle of branch prediction in case of a conditional branch
Conditional branchesInstructions other than conditional branches
Guessed path
Basic block
Basic block
Approved path
2.2.3. The problem of branch processing (3)
x86
M68000
MIPS R
1980 81 82 83 84 85 86 87 88 89 1990 91 92
80386 80486
68030 68040
R3000 R6000 R4000
C(8)
C(1/4,1/4) C(4,4)
C(4,4) C(4,4) C(16) C(8,8)
(Scalar) pipeline processors
Speculative execution of branches
C(0,1/4)
R2000
68020
80286
Figure 2.6: Introduction of branch prediction in (scalar) pipeline processors
2. generation pipelined
1.5. generation pipelined
1. generation pipelined
Cache Speculative branch processing
no no
yes no
yes yes
2.3. Generations of pipeline processors (1)
2.3. Generations of pipeline processors (2)
x86
M68000
MIPS R
1980 81 82 83 84 85 86 87 88 89 1990 91 92
80386 80486
68030 68040
R3000 R6000 R4000
C(8)
C(1/4,1/4) C(4,4)
C(4,4) C(4,4) C(16) C(8,8)
2. generation pipelined (cache, speculative branch processing)
C(0,1/4)
R2000
68020
80286
1.5. generation pipelined (cache, no speculative branch processing)
1. generation pipelined (no cache, no speculative branch processing)
Figure 2. 7: Generations of pipeline processors
2. generation pipeline processors already exhaust the available temporal parallelism
2.4. Exhausting the available temporal parallelism
3. Introduction of issue parallelism
Pipeline processing Superscalarinstruction issue
VLIW (EPIC)instruction issue
Static dependency
resolution
(3.2)
Dynamic dependency
resolution
(3.3)
3.1. Options to implement issue parallelism
3.2. VLIW processing (1)
E
U
E
U
E
U
E
U
Memory/cache VLIW instructions
with independent sub-instructions
(static dependency resolution)
VLIW processor
~ (10-30 EUs)
Figure 3.1: Principle of VLIW processing
3.2. VLIW processing (2)
VLIW: Very Long Instruction Word
Term: 1983 (Fisher)
Length of sub-instructions ~32 bit
Instruction length: ~ n*32 bit
n: Number of execution units (EU)
Complex VLIW compiler
Static dependency resulution with parallel optimization
3.2. VLIW processing (3)
Figure 3.2: Experimental and commercially available VLIW processors
The term ‘VLIW’
Source: Sima et al., ACA, Addison-Wesley, 1997
3.2. VLIW processing (4)
Benefits of static dependecy resolution:
Earlier appearance
Either higher fc orlarger ILP
Less complex processors
3.2. VLIW processing (5)
The compiler uses technology dependent parameters (e.g. latencies of EUs and caches, repetition rates of EUs)
for dependency resolution and parallel optimization
Drawbacks of static dependency resolution:
New proc. models require new compiler versions
Completely new ISA
New compilers, OS
Rewriting of applications
Achieving the critical mass to convince the market
3.2. VLIW processing (6)
Drawbacks of static dependency resolution (cont.):
VLIW instructions are only partially filled
Purely utilized memory space and bandwidth
3.2. VLIW processing (7)
Commercial VLIW processors:
In a few years both firms became bankrupt
Developers: to HP, IBM
They became initiators/developers of EPIC processors
Trace (1987) MultiflowCydra-5 (1989) Cydrome
3.2. VLIW processing (8)
Integration of SIMD instructions and advanced superscalar features
VLIW EPIC
1994: Intel, HP announced the cooperation
2001: IA-64 Itanium
1997: The EPIC term was born
3.3. Superscalar processing
3.3.1. Introduction (1)
Pipeline processing Superscalar instruction issue
Main attributes of superscalar processing:
Dynamic dependency resolution
Compatible ISA
3.3.1. Intoduction (2)
Figure 3.3: Experimental superscalar processors Source: Sima et al., ACA, Addison-Wesley, 1997
Intel 960 960KA/KB
M 88000 MC 88100
HP PA PA 7000
SPARC MicroSparc
Mips R R 4000
Am 29000 29040
IBM Power
DEC
PowerPC
87 88 89 90 91 92 93 94 95 96
CISC processors
RISC processors
Intel x86 i486
M 68000 M 68040
Gmicro Gmicro/100p
AMD K5
CYRIX M1
denotes superscalar processors.
960CA (3)
Power1(4)RS/6000
21064(2)
MC 88110 (2)
PA7100 (2)
SuperSparc (3)
R 8000 (4)
PPC 601 (3)PPC 603 (3)
29000 sup (4)
Pentium(2)
M 68060 (2)
Gmicro500(2)
K5 (4)
M1 (2)
3.3.1. Introduction (3)
Figure 3.4: Emergence of superscalar processors Source: Sima et al., ACA, Addison-Wesley, 1997
3.3.2. Attributes of first generation superscalars (1)
Cache:
Width:• 2-3 RISC instructions/cycle or • 2 CISC instructions/cycle
„wide”Core: • Static branch prediction
• Single ported, blocking L1 data caches, Off-chip L2 caches attached via the processor bus
Examples:
• Pentium• PA 7100• Alpha 21064
Consistency of processor features (1)
Dynamic instruction frequencies in gen. purpose applications:
(Wall 1989, Lam, Wilson 1992)
3.3.2. Attributes of first generation superscalars (2)
FX instrtuctions ~ 40 %
Load instructions ~ 30 %
Store instructions ~ 10 %
Branches ~ 20 %FP instrtuctions ~ 1-
5 %
Available parallelism in gen. purpose applications assuming direct issue:
~ 2 instructions / cycle
Source: Sima et al., ACA, Addison-Wesley, 1997
Required EU-s (Each L/S instruction generates an address calculation as well):
2 - 3 instructions/cycle
Single port data caches
Required number of data cache ports (np):
Reasonable core width:
np ~ 0.4 * (2 - 3) = 0.8 – 1.2 instructions/cycle
FX ~ 0.8 * (2 – 3) = 1.6 – 2.4 2 – 3 FX EUsL/S ~ 0.4 * (2 – 3) = 0.8 – 1.2 1 L/S EUBranch ~ 0.2 * (2 – 3) = 0.4 – 0.6 1 B EUFP ~ (0.01 – 0.05) * (2 – 3) 1 FP EU
Consistency of processor features (2)
3.3.2. Attributes of first generation superscalars (3)
The issue bottleneck
(b): The issue process
Executable instructionsDependent instructionsIssue
Ci
C i+1
Ci+2
i4i5i6
i1i2i3
Instr. window
i2i3
Cycles
(a): Simplified structure of the mikroarchitecture assuming direct issue
Icache
I-buffer
Instr. window (3)
Decode,check,issue
Dependent instructions block instruction issue
EU
Issue
EU EU
Decode,check,issue
Figure 3.5: The principle of direct issue
3.3.3. The bottleneck evoked and its resolution (1)
I cache
I-buffer
Decode/IssueInstructions are dispatched withoutchecking for dependences to theshelving buffers (reservation stations)
Shelved not dependent
for execution to the EUs.
Dep. checking/issue
Dep. checking/issue
EU EU EU
Instruction window
instructions are issued
Shelvingbuffer
Shelvingbuffer
Shelvingbuffer
Issue
Dispatch
Dep. checking/issue
Dep. checking/issue
Dep. checking/issue
3.3.3. The bottleneck evoked and its resolution (2)
Figure 3.6: Principle of the buffered (out of order) issue
Eliminating the issue bottleneck
3.3.3. The bottleneck evoked and its resolution (3)
First generation (narrow)
superscalars
Second generation (wide)
superscalars
Elimination of the issue bottleneck and in addition
widening the processing width of all subsystems of the core
3.3.4. Attributes of second generation superscalars (1)
Caches:
Core:
First generation ”narrow” superscalars
Second generation ”wide” superscalars
Width: • 2-3 RISC instructions/cycle or2 CISC instructions/cycle „wide”
• 4 RISC instructions/cycles or3 CISC instruction/cycle „wide”
• Static branch prediction• Buffered (ooo) issue• Predecoding• Dynamic branch prediction• Register renaming• ROB
• Single-ported, blocking L1 data caches• Off-chip L2 caches attached via the processor bus
• Dual-ported, non-blockingL1 data caches
• direct attached off-chip L2 caches
Examples:
• Pentium • Pentium Pro• K6
• PA 7100 • PA 8000• Alpha 21064 • Alpha 21264
Consistency of processor features (1)
Dynamic instruction frequencies in gen. purpose applications:
(Wall 1990)
3.3.4. Attributes of second generation superscalars (2)
FX instrtuctions ~ 40 %
Load instructions ~ 30 %
Store instructions ~ 10 %
Branches ~ 20 %FP instrtuctions ~ 1-
5 %
Available parallelism in gen. purpose applications assuming buffered issue:
~ 4 – 6 instructions / cycle
Source: Sima et al., ACA, Addison-Wesley, 1997
Source: Wall: Limits of ILP, WRL TN-15, Dec. 1990
Figure 3.7: Extent of parallelism available in general purpose applications assuming buffered issue
Required EU-s (Each L/S instruction generates an address calculation as well):
4 - 5 instructions/cycle
Dual port data caches
Required number of data cache ports (np):
Reasonable core width:
np ~ 0.4 * (4 - 5) = 1.6 – 2 instructions/cycle
FX ~ 0.8 * (4 – 5) = 3.2 – 4 3 – 4 FX EUsL/S ~ 0.4 * (4 – 5) = 1.6 – 2 2 L/S EUBranch ~ 0.2 * (4 – 5) = 0.8 – 1 1 B EUFP ~ (0.01 – 0.05) * (4 – 5) 1 FP EU
Consistency of processor features (2)
3.3.4. Attributes of second generation superscalars (3)
In general purpose applications 2. generation („wide”) superscalars
already exhaust the parallelismavailable at the instruction level
3.3.5. Exhausting the issue parallelism
4. Introduction of data parallelism
4.1. Overview (1)
Figure 4.1: Implementation alternatives of data parallelism
Dual-operationinstructions
Possible approaches to introducedata parallelism
OPIn 2
(i=a*b+c)
i: O2 O1
OPI 1+Dedicated use
(for gen.use)
instructionsSIMD
ISA-extension
OPI : Average number of operations per instruction
2/4/8/16/32
(MM-support)
O1O4 O3 O2i:
Dedicated use
>1
i: O2 O1
2/4
(3D-support)
FX-SIMD FP-SIMD
OPIn : Number of operations per instruction
Superscalar issueMultiple operations
within a single instruction
Superscalar extension
EPIC
extension
SIMD instructions(FX/FP)
4.1. Overview (2)
Figure 4.2: Principle of intruducing SIMD instructions in superscalar and VLIW (EPIC) processors
4.2. The appeareance of SIMD instructions in superscalars (1)
Compaq/DEC
Motorola
Sun/Hal
MIPS
HP
Alpha 21064 Alpha 21164 21264
MC88110
R 12000R 10000
PA7100 PA8000 PA 8500 PA-7200 PA-7100LC PA-8200
21164PC
CYRIX /VIA
AMD/NexGen
Intel Pentium PentiumPro
K5 Nx586
Pentium III
K7 K6
MII
Pentium II
Pentium/MMX
K6-2 K6-3
Multimedia support (FX-SIMD)
Support of 3D (FP-SIMD)
1992 1993 1994 1995 1996 1997 1998 1999
RISC processors
MC 88000
PA
Alpha
SPARC
PowerPC
R
Nx/K
80x86
CISC processors
Power PCAlliance
PPC 601 (3)
PPC 603 (3) PPC 602 (2)
PPC 604 (4)
R 80000
G3 (3) Power3 (4)
SuperSparc UltraSparc UltraSparc-2 UltraSparc-3
G4 (3)
19911990
IBM Power Power1(4) Power2(6/4) P2SC(6/4)
PPC 620 (4)
Sparc64
M1 M
2002 200320012000
PA 8600
UltraSparc-3
Pentium 4
Power 4
Alpha 21364
PA 8700
Power 4+
R14000
UltraSparc-3-Cu
R16000
Opteron
P4 with HT
Figure 4.3: The emergence of FX-SIMD and FP-SIMD instructions in superscalars
Intel’s and AMD’s ISA extensions (MMX, SSE, SSE2, SSE3, 3DNow!, 3DNowProfessional)
A 2.5. and 3. generation superscalars (1)
Second generation superscalars
3. generation superscalars
2.5. generation superscalars•
•
FX SIMD (MM)
FX SIMD + FP SIMD (MM+3D)
2.5. and 3. generation superscalars (2)
Compaq/DEC
Motorola
Sun/Hal
MIPS
HP
Alpha 21064 Alpha 21164 21264
MC88110
R 12000R 10000
PA7100 PA8000 PA 8500 PA-7200 PA-7100LC PA-8200
21164PC
CYRIX /VIA
AMD/NexGen
Intel Pentium PentiumPro
K5 Nx586
Pentium III
K7 K6
MII
Pentium II
Pentium/MMX
K6-2 K6-3
Multimedia support (FX-SIMD)
Support of 3D (FP-SIMD)
1992 1993 1994 1995 1996 1997 1998 1999
RISC processors
MC 88000
PA
Alpha
SPARC
PowerPC
R
Nx/K
80x86
CISC processors
Power PCAlliance
PPC 601 (3)
PPC 603 (3) PPC 602 (2)
PPC 604 (4)
R 80000
G3 (3) Power3 (4)
SuperSparc UltraSparc UltraSparc-2 UltraSparc-3
G4 (3)
19911990
IBM Power Power1(4) Power2(6/4) P2SC(6/4)
PPC 620 (4)
Sparc64
M1 M
2002 200320012000
PA 8600
UltraSparc-3
Pentium 4
Power 4
Alpha 21364
PA 8700
Power 4+
R14000
UltraSparc-3-Cu
R16000
Opteron
P4 with HT
Figure 4.4: The emergence of 2.5. and 3. generation superscalars
Bottlenecks evoked by third generation superscalars
System architecture (memory, display)
AGP bus
On-chip L2
Superscalars
First Generation Second Generation 2.5 Generation
2-3 RISC instructions/cycle or 4 RISC instructions/cycle or
Unbuffered issueNo renaming
Single ported data caches
Static branch prediction
Off-chip L2 cachesattached via the processor bus
No MM/3D support
Examples:Alpha 21064
PA 7100
PowerPC 601
SuperSparc
Pentium
Alpha 21264
PA 8000
PowerPC 604
UltraSparc I, II
Pentium Pro
PowerPC 620
Pentium II
Buffered issue (shelving)
Renaming
Predecoding
Dual ported data caches
Dynamic branch prediction
Off-chip direct coupled L2 caches
K6
FX-SIMD
1
1,4
2
No ROB ROB
Blocking L1 data caches or nonblocking caches with up toa single pending cache miss allowed
Nonblocking L1 data caches with multiple cache misses allowed
Features:
("Thin superscalars") ("Wide superscalars") ("Wide superscalars with MM/3D support")
2 CISC instructions/cycle "wide" 3 CISC instructions/cycle "wide"
No predecoding
1 No renaming.
Power2 3
1,4 2
3
4
Dual ported data cache, optional
No off-chip direct coupled L2.Only single ported data cache.
Width:
Core:
Caches:
ISA:instructions
Third Generation
Power 4
Pentium III (0.18 )Pentium 4
Athlon (model 4)
On-chip L2 caches
FX- and FP-SIMD
Athlon MP (model 6)
("Wide superscalars with MM/3D support")
instructions
dynamic branch prediction.
Performance
Complexity
Memory Bandwidth
Branch predictionaccuracy
4.3. Overview of superscalar processor generations
In general purpose applications second generation superscalars already exhaust the parallelism available at the instruction level,
whereas third generation superscalars exhaust already the parallelism available in dedicated applications
(such as MM or 3D applications)at the instruction level as well.
Thus the era of ILP-processors came to an end.
4.4. Exhausting the performance potential of data parallelism
4.5. The introduction of SIMD instructions in EPIC (VLIW) processors
VLIW architectures/processors did not support SIMD instructions
EPIC architectectures/processors inherently support SIMD instructions (like the IA-64 ISA or processors of the Itanium family)
5. Summing up the main road of evolution
Introduction and increase of temporal
parallelism
Introduction and increase of issue
parallelism
Introductionof VLIW processing
Introduction and increase of data
parallelism
Introduction of data parallelism
(EPIC)
a. Evolutionary scenario (Superscalar approach)
(The main road)
b. Radical scenario (VLIW/EPIC approach)
5.1. Main evolution scenarios
Traditional von N. procs.
Superscalar processors with SIMD extension
+ Data parallelism
Superscalar processors
+ Issue parallelism
Pipeline processors
Temporal parallelism
Figure 5.1: The three cycles of the main road of processor evolution
Extent of
opereration level
parallelism
Level of
hardware
redundancy
SequentialILP
~ 1985/88 ~ 1990/93 ~ 1994/00
processing
t
5.2. Main road of processor evolution (1)
introduction of a particular dimension of parallelism
processing bottleneck(s) arise
elimination of the bottleneck(s) evoked by introducing appropriate techniques
as a consequence, parallelism available at the given dimension becomes exhausted,
further performance increase is achievable onlyby introducing a new dimension of parallelism
i: Introduction of temporal issue and data parallelism
i=1:3
5.2. The main road of evolution (2)
Figure 5.2: Three main cycles of the main road
Figure 5.3: New techniques introduced in the three main cycles of processor evolution
Introduction of
Advanced memory subsystemAdvanced branch processing
processorsSuperscalar
Introduction ofissue parallelism
SIMD extension Superscalars with
ISA extension
Introduction of data parallelism
Traditional sequential processing
~ 1985/88 ~ 1990/93 ~ 1994/97
1. generation 2. generation
• Dynamic inst. scheduling• Renaming• Predecoding• Dynamic branch prediction• ROB• Dual ported data caches• Nonblocking L1 data caches
with multiple cache missesallowed
• Off-chip direct coupled L2 caches
2.5. generation• FX SIMD extension• Extension of system architecture
AGP On-chip L2 ...
3. generation• FP SIMD extension ...
Introduction of temporal parallelism
1. generation 1.5. generation
• Caches
2. generation• Branch prediction
Traditional sequential procesors Pipeline processors
5.2. Main road of the evolution (3)
~ 1985t
~ 2000
Memory Bandwidth
HardwareComplexity
Performance
Figure 5.4: Memory bandwidth and hardware complexity vs raising processor performance
5.2. Main road of evolution (4)
~ 1985t~ 2000
Accuracy ofbranch prediction
Number ofpipeline stages
fc
Figure 5.5: Branch prediction accuracy vs raising clock rates
5.2. Main road of evolution (5)
6. Outlook: introduction of thread level par.
6. Outlook: the introduction of thread level parallelism (1)
ILP(instruction-level parallelism)
TP(thread-level parallelism)
Thread(instruction flow)
Multiple threds
Granularity of parallelism
6. Outlook: the introduction of thread level parallelism (2)
Where multiple threads can come from?
from the sameapplications
Multiprogramming Multitasking,Multithreading
from differentapplications
6. Outlook: the introduction of thread level parallelism (3)
Basic implementation alternatives of thread level parallelism
implementation by two or more cores placed on the same chip
implementation by a multithreaded core
Chip
CMP: Chip Multiprocessing(SMP: Symmetric Multiprocessing)
SMT: Simultaneous Multithreading(HT: Hyperthreading (Intel))
L3/Memory
L2/L3Core Core
L3/Memory
SMTcore
L2/L3
6. Outlook: the introduction of thread level parallelism (4)
(Four-way) superscalar
Multithreaded superscalar(four-way/two threads)
Thread Thread 2Thread 1
SMT: Simultaneous Multithreading(HT: Hyperthreading (Intel))
top related