advanced computer architectures – part 2.1
DESCRIPTION
Part 2.1 of the slides I wrote for the course "Advanced Computer Architectures", which I taught in the framework of the Advanced Masters Programme in Artificial Intelligence of the Catholic University of Leuven, Leuven (B)TRANSCRIPT
Advanced Computer
Architectures
– HB49 –
Part 2.1
Vincenzo De Florio
K.U.Leuven / ESAT / ELECTA
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/2
Course contents
• Basic Concepts
Computer Design
• Computer Architectures for AI
• Computer Architectures in Practice
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/3
Computer Design
Quantitative assessments
• Instruction sets
• Pipelining
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/4
Computer design
• First part of the course: a survey of
computer history
• Key aspect of this history:
In the last 60 years computers have
experienced a formidable growth in
performance and a huge costs decrease
A 1000¤ PC today provides its user with more
performance, memory, and disk space of a 1M$
mainframe of the Sixties
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/5
Computer design
• How this could be possible?
• Through
Advances in computer technology
Advances in computer design
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/6
Computer design
• The tasks of a computer designer:
Determine key attributes for a new machine
E.g., design a machine that maximize
performance keeping costs under control
Aspects:
Instruction set design
Functional organization
Logic design
Implementation
(To be defined later)
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/7
Significant improvements
• First 25 years:
From both technology and design
• From the Seventies:
Mainly from IC technology
Main concern = compatibility with the past
(to save investments)
Compatibility at ML
No room for design improvements
20-30% per year for mainframes and minis
• Late Seventies: advent of the mP
Higher rate (35% per year)
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/8
Significant improvements: the mP
• The mP
Mass-produced lower costs
Significant changes in computer
marketplace
Higher level language compatibility (no need
for object code compatibility)
Availability of standard, vendor-independent
OS (less risks and costs in producing a new
architecture)
allowed to develop a new concept:
RISC architectures
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/9
Significant improvements: RISC
RISC architectures
Designed in the Eighties, on the market ca.‘85
Since then, a 50% improvement per year
Sun-4/260
MIPS M/120
MIPS M2000
IBM RS6000/540
HP 9000/750
DEC AXP 3000
0
50
100
150
200
250
300
1987 1988 1989 1990 1991 1992 1993 1994 1995
Year
P
e
r
f
o
r
m
a
n
c
e
IBM Power 2/590
1.54X/yr
1.35X/yr
DEC 21064a
Sun UltraSparc
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/10
Technology Trends
Microprocessors
Minicomputers
Mainframes
Supercomputers
Year
0.1
1
10
100
1000
1965 1970 1975 1980 1985 1990 1995 2000
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/11
Computer design
• The mP allowed a 50% of performance
increase. How was that possible?
Enhanced capability for users
IBM Power 21993
Cray Y-MP1988
The fastest supercomputer in 1988 has approx.
the same performance of the fastest 1993
workstation
Price: 1/10
Computers became more and more mP-based
Mainframes were disappearing or becoming
based on off-the-shelf mPs
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/12
Computer design
• Big consequence
No more market urge for
object code compatibility
Freedom from compatibility with old designs
Renaissance in computer design
Again, significant improvements from both
technology and design
50% of performance growth!
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/13
Computer design
• The highest performance mP in ’95 is
mainly a result of design improvements
(1-to-5)
• In this section we focus on the design
techniques that allowed this state of
facts
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/14
Performance
• What are the aspects to be taken into
account in order to reach a higher
performance?
• How to choose between different
alternatives?
Amdhal’s law
Quantitative assessment
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/15
Amdhal’s law
• Speed-up:
• Amdhal’s law on speed-up:
• Speed up depends on the fraction of time
that may be affected by the enhancement
Execution time for entire task w/o using the “enhancement”
Execution time for entire task using enhancement when possible
S =
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/16
Amdhal’s law
Let us call F the fraction of time
affected by the enhancement
For instance, F=0.40 means that the original program would
benefit of the enhancement for 40% of the time of execution
What do we gain by introducing
the enhancement?
Exec-timeNEW = Exec-timeOLD ((1 -F) + F/SENH)
Where SENH is the speedup in the enhanced mode. Hence,
S = Exec-timeNEW
Exec-timeOLD (1 - F) + F / SENH
1 =
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/17
Amdhal’s law
F = 40%
SENH
grows, but
SOVER
does not
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/18
Amdhal’s law
• Law of diminishing returns
the incremental improvement in speedup
gained by an additional improvement in the
performance of just a portion of the
computation
diminishes as improvements are added
lim SENH S = (1 - F) + F / SENH
1 lim SENH
(1 - F)
1 =
= SMAX
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/19
Amdhal’s law
To reach a maximum speedup = 3,
F must be at least 66%
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/20
Amdhal’s law…
• “…can serve as a guide to how much an
enhancement will improve performance
and how to distribute resources to
improve cost/performance.
• The goal, clearly, is to spend resources
proportional to where time is spent.’’
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/21
• Example 1 (p.30 P&H)
Method allows an improvement by factor 10
That can be exploited for 40% of the time
speeup
fract.fract.
speedup
00.4
10
overal
enhancedenhanced
enhanced
1
1
1
1 4
1 56
.
.
Amdhal’s law
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/22
Amdhal’s law
Example 2 (p.31 P&H)
50% of the instructions of a given benchmark
are floating point instructions
FPSQR applies to 20% of the same benchmark
Alternative 1: extra hardware: FPSQR is 10
times faster
Alternative 2: all the FP instructions go 2 times
faster
speedup
fract.fract.
speedup
speedup
00.2
10
speedup
00.5
2.0
overal
enhancedenhanced
enhanced
FPSQR
FP
1
1
1
1 2
122
1
1 5
133
.
.
.
.
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/23
Quantitative assessment
• CPUTIME
(p) = Time spent by the CPU to run
program p
• Clock cycle time = tcc
, clock rate = 1/ tcc
• CPUTIME
(p) = # clock cycles tcc
= # clock cycles / clock rate
• E.g.: clock cycle time = 2ns
clock rate = 500 MHz
• #CC(p) = number of clock cycles spent in
the execution of p
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/24
Quantitative assessment
• Instruction count
• IC(c,p) = number of instructions that CPU
c executed during the activity of program
p
• Often, IC(p)
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/25
• Clock cycles per instruction
• CPI(p) = #CC(p) / IC(p)
average number of clock cycles needed
to execute one instruction of p
Quantitative assessment
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/26
• CPUTIME
(p) =
= #clock cycles clock cycle time
= #CC(p) tcc
= IC(p) CPI(p) tcc
= IC(p) CPI(p)
clock rate
We can influence the performance of a given
program p by optimizing the three key
variables IC(p), CPI(p), and clock rate.
Quantitative assessment
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/27
Quantitative assessment
• CPU performance is equally dependent
upon three characteristics
Clock rate (the higher, the better)
Clock cycles per instructions (the lesser, the
better)
Instruction count (the lesser, the better)
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/28
Quantitative assessment
• CPU performance is equally dependent
upon three characteristics
Clock rate (HW technology & organization)
Clock cycles per instruction
(organization & instruction set architecture)
Instruction count
(instruction set architecture &
compiler technology)
• Note: technologies are not independent of
each other!
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/29
Quantitative assessment
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
Inst Count CPI Clock Rate
Program X Compiler X (X) Inst. Set. X X Organization X X Technology X
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/30
Quantitative assessment
• Decades long challenge: optimizing
CPUTIME(p) = IC(p) CPI(p)
clock rate
• This is a function of p!
• The choice of benchmarks is
important
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/31
Quantitative assessment
• Which methods to use?
CPUTIME(p) = IC(p) CPI(p)
clock rate
• Method 1: increasing the clock rate
(Note: independent of p!)
• Methods 2: those trying to decrease
IC(p)
• Methods 3: those trying to decrease
CPI(p)
• Each method is equally important
• Some methods are more effective
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/32
Quantitative assessment:
how to calculate CPI?
CPI =
CPI IC
Instr. count CPI
IC
Instr. Count
i i
ii
i
n
i
n1
1
CPIi needs to be measured and not just
read from a table in the Reference
Manual!
That is, we need to take into account
the memory access time! (Cache
misses do count… a lot)
ICi = number of times instruction i is
executed by p
CPIi = average number of clock cycles
for instruction i
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/33
Quantitative assessment
• Example 3: 2 alternatives for a
conditional branch instruction
A: a CMP that sets a condition code (Z bit)
followed by a JZ
B: a single instruction to do CMP and JZ
LD R1, 0 LD R1, 0
L: INC R1 L: INC R1
CMP R1, 5 JRZ R1, 5, L
JZ L RET
RET
Arch. A Arch. B
We assume that JZ and JRZ take 2 cycles,
all the other instructions take 1 cycle
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/34
Quantitative assessment
LD R1, 0 LD R1, 0
L: INC R1 L: INC R1
CMP R1, 5 JRZ R1,5,L
JZ L RET
RET
Arch. A Arch. B
• 20% of the instructions are c.jumps
(instructions such as JZ or JRZ)
• 80% are other instructions
• On A, for each c.jump there is a CMP on
A, 20% are c.jumps and 20% are CMP’s
• 60% are other instructions
Because of the extra complexity in B, the
clock of A is faster (CTB = 1.25 CT
A)
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/35
Quantitative assessment
• CPIA = S
i instr
i x cycles
i / #CC
A =
= #BRA x cycles
BR + #BR
A x cyclesBR
#CC
A #CC
A
= nBRA
x cyclesBR
+ nBRA x cyclesBR
= 20% x 2 + 80% x 1 = 1.2
• CPUA = IC
A x CPI
A x CT
A = IC
A x 1.2 x CT
A
• CPIB = S
i instr
i x cycles
i / #CC
B =
= #BRB x cycles
BR + #BR
B x cyclesBR
#CC
B #CC
B
= nBRB
x cyclesBR
+ nBRB x cyclesBR
n n
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/36
Quantitative assessment
• Now, on B:
One spares 20% of the instructions (the extra
cmp’s), hence:
nBRB
= 20 / (100 – 20) = 0.25 (25%)
Furthermore, ICB = 0.8 IC
A
• Hence CPIB = 0.25 x 2 + 0.75 x 1 = 1.25
• CPUB = IC
B x CPI
B x CT
B =
= 0.8 ICA x 1.25 x 1.25 CT
A
So CPUB = 1.25 x ICA x CTA
CPUA = 1.2 x IC
A x CT
A
So A is faster
(for which P?)
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/38
Performance
• A straightforward enhancement is given
by increasing the clock rate
• The entire program benefits
• Also, independent of the particular
program
• Dependent on the efficiency of the
compiler etc.
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/39
Clock Frequency Growth Rate
• 30% per year
0.1
1
10
100
1,000
19701975
19801985
19901995
20002005
Clo
ck r
ate
(M
Hz)
i4004i8008
i8080
i8086 i80286i80386
Pentium100
R10000
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/40
Transistor Count Growth Rate
• 100 million transistors on chip in early year 2000.
• Transistor count grows much faster than clock rate
Tra
nsis
tors
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
19701975
19801985
19901995
20002005
i4004i8008
i8080
i8086
i80286i80386
R2000
Pentium
R10000
R3000
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/43
Performance
• Another important factor for performance
is given by
Memory accesses
I/O (disk accesses)
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/44
Memory
• Semiconductor DRAM technology
Density: increase of 60% per year
(quadruplicate in 3 years)
Cycle time: much less increase than this!
Capacity Speed
Logic 2x in 3 years 2x in 3 years
DRAM 4x in 3 years 1.4x in 10 years
Disk 2x in 3 years 1.4x in 10 years
Speed increases of memory and I/O have not
kept pace with processor speed increases.
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/45
Memory
size
Year
Bit
s
1000
10000
100000
1000000
10000000
100000000
1000000000
1970 1975 1980 1985 1990 1995 2000
year size(Mb) cyc time
1980 0.0625 250 ns
1983 0.25 220 ns
1986 1 190 ns
1989 4 165 ns
1992 16 145 ns
1996 64 120 ns
2000 256 100 ns
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/46
Basic definitions
1. Bandwidth: the rate at which data can be
transferred. Bandwidth is typically measured in
bytes per second.
2. Block size: the amount of data transferred per
request. Block size is typically measured in bytes.
3. Latency: the time between making a request (e.g.
to read or write a block of data) and completing the
request. Latency is typically measured in seconds.
4. Throughput: The number of requests that can be
completed per unit time. Throughput is typically
measured in requests per second.
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/47
Memory
• DRAM: main memory of all computers
Commodity chip industry: no company >20% share
Packaged in SIMM or DIMM (e.g.,16 DRAMs/SIMM)
• Capacity: 4X/3 years (60%/year)
Moore’s Law
• MB/$: + 25%/year
• Latency: – 7%/year,
Bandwidth: + 20%/year (so far)
source: www.pricewatch.com, 5/21/98
SIMM = single in-line memory chip, a small circuit board that
can hold a group of memory chips. Measured in bytes vs bits
32-bit path to memory
DIMM = dual in-line memory chip. 64-bit to memory
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/48
Processor Limit: DRAM Gap
µProc
60%/yr..
DRAM
7%/yr..
1
10
100
1000
19
80
19
81
19
83
19
84
19
85
19
86
19
87
19
88
19
89
19
90
19
91
19
92
19
93
19
94
19
95
19
96
19
97
1998
19
99
20
00
DRAM
CPU
19
82
Processor-Memory
Performance Gap:
(grows 50% / year)
Pe
rfo
rma
nc
e “Moore’s Law”
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/49
Memory Summary
• DRAM:
rapid improvements in capacity, MB/$, bandwidth;
slow improvement in latency
Processor-memory interface
is a bottleneck to delivered bandwidth
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/50
Disk Components
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/51
Disk Components: Platters
• Platters: the recording surfaces.
i. 1 to 8 inches in diameter (2.5 to 20 cm).
ii. Stacked on a spindle: typical disks have 1-12
platters.
iii. Data can be stored on one or both surfaces.
iv. Spindle and platters rotate at 3600 - 10000 rpm
(60-165 Hz).
v. Recording density depends on applying a
magnetic film with few defects.
vi. Rotation rate limited by bearings and power
consumption.
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/52
Disk Components: Heads
• Heads: write and read data to and from platters.
i. Data stored as presence or absence of
magnetization.
ii. Head “floats” on air-film that rotates with the disk.
Bernoulli effect pulls head toward disk but not into
it. A dust particle can cause a “head crash” where
the disk surface is scratched and any data on it is
lost.
iii. Disk heads are manufactured using thin film
technology. Advancing technology allows smaller
heads and therefore more closely spaced tracks
and bits.
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/53
Disk Components: Actuators
• Actuators: move heads radially over the platters.
i. Actuator arm needs to be light to move quickly.
ii. Actuator arm needs to stiff to prevent flexing.
iii. Smaller platters allow shorter arms: therefore
lighter and stiffer.
iv. Actuators limited by
• power of actuator motor and
• weight and strength of actuator components
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/54
Disks: Data Layout
• Each surface consists of concentric rings called
tracks
• The set of tracks that are a the same relative
position on each surface form a cylinder
• Each track is divided into sectors. Data is written to
and read from the disk a whole sector at a time
cylinder
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/55
Three Components of Disk Access Time
1. Seek time: the time to move the heads to the
desired cylinder
Advertised to be 8 to 12 ms. May be lower in real life
2. Rotational latency: the time for the desired sector
to arrive under the head
4.1 ms at 7200 RPM and 8.3 ms at 3600 RPM
3. Transfer time: the time to read the data from the
disk and send it over the I/O bus to the processor
2 to 12 MB per second
Response time = Queue + Ctrl + Device Service time
Proc
Queue
IOC Device
Ctrl Disk Access Time
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/56
Disk Latency = Queueing Time + Controller time + Seek Time + Rotation Time + Xfer Time
Order of magnitude times for 4K byte transfers:
Average Seek: 8 ms or less Rotate: 4.2 ms @ 7200 rpm Xfer: 1 ms @ 7200 rpm
Hard Disks
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/57
Hard Disks
• Capacity
+ 60%/year (2X / 1.5 yrs)
• Transfer rate (BW)
+ 40%/year (2X / 2.0 yrs)
• Rotation + Seek time
– 8%/ year (1/2 in 10 yrs)
• MB/$
> 60%/year (2X / <1.5 yrs)
Latency = Queuing Time + Controller time + Seek Time + Rotation Time + Size / Bandwidth
source: Ed Grochowski, 1996,
“IBM leadership in disk drive technology”;
www.storage.ibm.com/storage/technolo/grochows/grocho01.htm,
per access
per byte
{ +
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/58
Hard disks
1973:
1. 7 Mbit/sq. in
140 MBytes
1979:
7. 7 Mbit/sq. in
2,300 MBytes
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/59
1
10
100
1000
10000
1970 1980 1990 2000
Year
Are
al D
en
sit
y
Hard Disks
1989:
63 Mbit/sq. in
60,000 MBytes
1997:
1450 Mbit/sq. in
1600 MBytes
1997:
3090 Mbit/sq. in
8100 MBytes
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/60
Hard Disks
• Continued advance in capacity (60%/yr)
and bandwidth (40%/yr.)
• Slow improvement in seek, rotation
(8%/yr)
• Time to read whole disk
Year Sequentially Randomly
1990 4 minutes 6 hours
2000 12 minutes 1 week
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/61
Memory/Disk Summary
• Memory:
DRAM rapid improvements in capacity, MB/$,
bandwidth; slow improvement in latency
• Disk:
Continued advance in capacity, cost/bit,
bandwidth; slow improvement in seek,
rotation
• Huge gap between CPU and external
memories
• How to address this problem?
• Classical way: memory hierarchies
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/62
Memory hierarchies
• Axiom of HW designer: smaller is faster
Larger memories => larger signal delay
More levels are required to encode addresses
In a smaller memory the designer can use more
power per cell => shorter access times
• Crucial features for performance
Huge bandwidth (in MB/sec.)
Short access times
• Principle of locality
The data most recently used is very likely to be
accessed again in the near future (temporal l.)
Memory cells close to the most recently used one
are likely to be accessed in the near future (spatial)
• Combining the above with the Amdhal law, the
“best” enhancement is using hierarchies of
memories
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/63
CPU
Typical memory hierarchy (`95)
Registers
Ca
ch
e
Memory I/O devices
Size: 200B 64KB 32 MB 2 GB
Speed: 5 ns 10 ns 100 ns 5 ms
I/O bus Memory bus
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/64
Memory hierarchies
Instruction Set Architecture
Pipelining, Hazard Resolution, Superscalar, Reordering, Prediction, Speculation, Vector, DSP
Addressing, Protection, Exception Handling
L1 Cache
L2 Cache
DRAM
Disks, WORM, Tape
Coherence, Bandwidth, Latency
Emerging Technologies Interleaving Bus protocols
RAID
VLSI
Input/Output and Storage
Memory
Hierarchy
Pipelining and Instruction
Level Parallelism
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/65
Memory hierarchies
• Registers: smallest and fastest memory
• Size: less than 1KB
• Access time: 2-5 ns
• Bandwidth: 4000-32000 MB/sec
• Managed by the compiler (or the
assembly programmer)
register int a;
• Special purpose vs. general purpose
• Monolithic or double-shaped
Rx = Rl + Rh
• Backed in cache
• Implemented via custom memory with
multiple ports
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/66
Memory hierarchies
• Cache = small, fast memory located close
to the CPU
• The cache holds the most recently
accessed code or data
Managed by HW
No way to tell “put these data in cache” at SW
New research: cache-conscious data
structures
• Size: less than 4 MB
• Access time: 3-10 ns
• Bandwidth: 800-5000 MB/sec
• Backed in main memory
• Implemented with (on- or off-chip) CMOS
SRAM
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/67
• Cache terminology: cache hit, cache
miss, cache block
Cache hit: the CPU has been able to find in
cache the requested data
Cache miss: Cache hit
Cache block: the fixed-size buffer used to load
a portion of memory into the cache
• A cache miss blocks the CPU until the
corresponding memory block gets cached
Memory hierarchies
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/68
Memory hierarchies
• Virtual memory: same principles behind
the use of cache, but implemented
between main memory and disk storage
• At any point in time, not all the data
referenced by p need to be in main
memory
• Address space is partitioned into fixed-
size blocks: pages
• A page is either in memory or on disk
• When CPU references an item within a
page
if ( Check-if-in-cache() == CACHE_MISS )
if ( Check-if-in-memory() == MEM_MISS)
PageFault(); // Loads page in memory
CPU doesn’t stall – switches to other tasks
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/69
Cache performance
• Example: speedup using a cache
Cache 10 times faster than main memory
Cache is used 90% of the cases
speedup
fract.fract.
speedup
00.9
10
enhancedenhanced
enhanced
1
1
1
1 9
5 3
.
.
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/70
Cache performance
CPUtime = (CPU clock cycles + memory
stall cycles) x clock cycle time
Memory stall cycles = #(misses) £(miss)
= IC #(misses per instruction) £(miss)
= IC #(memory references per instr.)
miss-frequency £(miss)
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/71
Cache performance
• Example (P&H, p.43)
A computer has a CPI = 2 when data is in cache
Memory access is only required by load and
store instructions (40% of total #)
£(miss) = 25 clock cycles
Cache misses frequency = 2%
? How faster would the machine be when no
cache miss occurs?
CPU"-hit = (CPU clock cycles + memory stall cycles)
clock cycle time
= (IC CPI + 0) clock cycle time
= IC 2 clock cycle time
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/72
Cache performance
? How fast would the machine be when
cache misses do occur?
1. Compute the memory stall cycles (msc)
msc = IC memory references per instruction
miss rate miss penalty
= IC (1 + 0.4) 0.02 25
= IC 0.7
Instruction access
Data access
2. Compute total performance:
CPUcache
=(CPU clock cycle + msc) clock cycle time
= (IC 2 + IC 0.7) clock cycle time
= 2.7 IC clock cycle time
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/73
Computer Design
• Quantitative assessments
Instruction sets
• Pipelining
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/74
Computer design
• Instruction-set architecture:
The architecture of the machine level
The boundary between SW and HW
• Organization:
High level aspects: memory system, bus
structure, internal CPU design
• Hardware:
The specifics of a machine: detailed logic
design, packaging technology…
• Architecture = I + O + H
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/75
Instruction Sets
• IS = Instruction sets = The architecture of
the machine language
• IS Classification
• Roles of the compilers
• DLX
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/76
Computer Design IS
IS Classification
• Role of the compilers
• DLX
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/77
Computer Design IS
IS Classification
• Key: type of internal storage in the CPU
• Three main classes
Stack architectures
Accumulator architectures
General-purpose register architectures
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/78
Computer Design IS
IS Classification Stack A.
• Stack architecture:
• Operands are implicitly referred to
• Top two items on the system stack
• Example: C = A + B
1. PUSH A A
2. PUSH B B
3. ADD
ADD = PUSH (POP + POP)
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/79
Computer Design IS
IS Classification Stack A.
• Stack architecture:
• Operands are implicitly referred to
• Top two items on the system stack
• Example: C = A + B
1. PUSH A A
2. PUSH B
3. ADD
ADD = PUSH (POP + POP)
ADD = PUSH (B + POP)
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/80
Computer Design IS
IS Classification Stack A.
• Stack architecture:
• Operands are implicitly referred to
• Top two items on the system stack
• Example: C = A + B
1. PUSH A
2. PUSH B
3. ADD
ADD = PUSH (POP + POP)
ADD = PUSH (B + POP)
ADD = PUSH (B + A)
B+A
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/81
Computer Design IS
IS Classification Stack A.
• Stack architecture:
• Operands are implicitly referred to
• Top two items on the system stack
• Example: C = A + B
1. PUSH A
2. PUSH B
3. ADD
C = TOP STACK = A+B
4. POP C
An example: the ARIEL virtual machine (Part 1, Slides 91 –)
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/82
Computer Design IS
IS Classification Accumulator A.
• Accumulator Architectures
• A special register (the accumulator)
plays the role of an implicit argument
• Example: C = A + B
1. LOAD A ; let Acml = A
2. ADD B ; let Acml = Acml + B
3. STORE C ; let C = Acml
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/83
Computer Design IS
IS Classification Register A.
• General-purpose Register Architecture
• Explicit operands only
• Either registers or memory locations
• Two flavors:
Register-memory architectures (RMA)
Register-register architectures (RRA)
• Example: C = A + B
RMA: Load R1, A
Add R1, B ; in C, R1 += B
Store C, R1
RRA: Load R1, A
Load R2, B
Add R3, R1, R2
Store C, R3
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/84
Computer Design IS
IS Classification RRA
• Some old machines used stack or
accumulator architectures
For instance, T800 and 6502/6510
• Today the de facto standard is RRA
Regs are fast
Regs are easier to use (compiler writers)
Do not require to deal with associativity issues
Stacks do!
Regs can hold variables
register int I;
for (I=0; I<1000000;I++)
{ do-stgh(I); … }
Using regs you don’t need a memory address
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/85
Computer Design IS
IS Classification Register A.
• RRA: no memory operands
All instructions are similar in size -> take
similar number of clocks to execute (very
useful property… see later)
No side effect
Higher instruction count
• RMA: one memory operand
One load can be spared
A register operand is destroyed ( R += B )
Clocks per instruction varies by operand
location
• Memory-memory:
Compact
Large variation of work per instruction
Large variation in instruction size
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/86
Computer Design IS
Memory addressing
• How is memory organized?
• What does it mean, e.g., read memory at
address 512?
• What do we read?
Bytes, half words, words, double words
• How are consecutive bytes stored in a
word? (Assumption: word is 4 bytes)
Little endian: &word = &LSB
Big endian: &word = &MSB
XDR routines are needed to exchange data
(&word address of word)
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/87
A memory model for didactics
• Memory can be thought as finite, long
array of cells, each of size 1 byte
0 1 2 3 4 5 6 7 …
• Each cell has a label, called address, and
a content, i.e. the byte stored into it
• Think of a chest of drawers, with a label
on each drawer and possibly something
into it
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/88
1
2
3
4
Address
Content
A memory model for didactics
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/89
• The character * has a special meaning
• It refers to the contents of a cell
A memory model for didactics
• For instance:
*(1)
This character means we’re inspecting the contents
of a cell (we open a drawer and see what’s in it)
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/90
• The character * has a special meaning
• It refers to the contents of a cell
A memory model for didactics
• For instance:
*(1)
This character means we’re writing new contents
into a cell (we open a drawer and change its contents)
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/91
A memory model for didactics
• Memory is (often) byte addressable,
though it is organized into small groups of
bytes: the machine word
• A common size for the machine word is 4
bytes (32 bits)
• Two possible organizations for the bytes
in a word
Little endian
Big endian
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/92
Little endian versus Big endian
MSB LSB
MSB LSB
Big endian (Motorola)
LSB MSB
LSB MSB
Little endian (Intel)
0 1 2 3
4 5 6 7
MSB0
LSB0
MSB1
LSB1
Big endian
0
1
2
3
4
5
6
7
LSB0
MSB0
LSB1
MLSB1
Little endian
3 2 1 0
7 6 5 4
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/93
Little endian versus Big endian
MSB LSB
MSB LSB
Big endian (Motorola)
LSB MSB
LSB MSB
Little endian (Intel)
MSB0
LSB0
MSB1
LSB1
Big endian
0
1
2
3
4
5
6
7
LSB0
MSB0
LSB1
MLSB1
Little endian
Problem: communication
between the two
00 00 00 01
10 00 00 00
00
00
00
01
10
00
00
00
00
00
00
01
10
00
00
00
=1
=268435456
00 00 00 01
10 00 00 00
=16777216
=16
01 00 00 00
00 00 00 10
So they are the same; though, interpreted as if they were…
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/94
Computer Design IS
Memory addressing
• Alignment is mandatory on some
machines
Object O; int t = sizeof(O);
ALIGNED(O) means &O modulo t is 0
“access to O is aligned”
For instance if access to integers (4 bytes) is
aligned, then an integer can only be stored in
addresses divisible by 4
Alignment is sometimes necessary because
prevents hardware complications
Alignment implies faster access
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/95
Computer Design IS
Memory addressing
• Addressing modes: ways to specify the
address of an object in memory
• An addressing mode can specify
A constant
A register
A memory location
In what follows,
A += B means A = A + B
* (x) means return the contents of memory at
address x
x++ means “at the end, let x = x + 1”
--x means “at the beginning, let x = x – 1”
Rx means register x
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/96
Computer Design IS
Memory addressing
Mode Example Meaning
Register Add R4, R3 R4 += R3
Immediate Add R4, #3 R4 += 3
Displacement Add R4, 100(R1) R4 += *(100+R1)
Indirect Add R4, (R1) R4 += *(R1)
Indexed Add R4, (R1 + R2) R4 += *(R1 + R2)
Absolute Add R4, (100) R4 += *(100)
Deferred Add R4, @(R3) R4 += *(*(R3))
Autoincrement Add R4, (R3)+ Indirect, R3++
Autodecrement Add R4, -(R2) R2--, indirect
Scaled Add R4,
100(R2)[R3]
R4 += * ( 100 + R2 +
R3 * d )
d = size of the addressed data (1, 2, 4, 8, or 16)
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/97
Computer Design IS
Memory addressing
• Addressing mode can reduce IC
• Complex addressing modes increase the
complexity of the hardware can
increase CPI
• Displacement, immediate and deferred
represent b/w 75% and 99% of addressing
modes (experiments done with TeX,
spice, and gcc)
• IC(p) = number of instructions that the CPU executed
during the activity of program p
• CPI(p) = clock cycles per instruction = #CC(p) / IC(p)
average number of clock cycles needed to execute one
instruction of p
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/98
Computer Design IS
Operations
• Arithmetical and logical (add, and, sub...)
• Data transfer (move, store)
• Control (br, jmp, call, ret, iret…)
• System (virtual memory mngt…)
• Floating point (add, mul, …)
• Decimal (decimal add, decimal mul…)
• String (str move, str cmp, str search)
• Graphics (pixel operations)
• Benchmarks show that often a small set
of simple instructions account for stg like
95% of instructions executed
(see Fig. 2.11, P&H p.81)
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/99
Computer Design IS
Operations
• Control Flow Instructions
Branch (conditional change)
Jump (unconditional change)
Procedure calls
Procedure returns
• Most of the comparisons in conditional
branches are simple “==“, “!=“ with 0!
• In some cases, the address to go to
is only known at run-time
“Return” uses a stack
Switch statements
Dynamic libraries
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/100
Computer Design IS
Operands
• When we say, e.g.,
“Add R1, #5”
do we work with bytes? Half-words?
Words?
• How do we specify the type of the
operand?
1. Classical method: the type of operand is
part of the opcode
• Add family is coded as ffff…fffvv
where f are fixed bits and v are bits
that specify the type
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/101
Computer Design IS
Operands and types
• 1011010100010000 = Add float words
1011010100010001 = Add words
1011010100010010 = Add half-words
1011010100010011 = Add bytes
• Example: Add family =
10110101000100vv
• Old fashioned method:
operand = data + tag
• Tag describes a type
• Tag is interpreted by HW
• Operation is chosen accordingly
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/102
Computer Design IS
Operands and types
• Which types to support?
• Old fashioned solution: all (bytes, semi-
words, words, f.p., double words, double
precision f.p., …)
• Current trend: Only operations on items
greater than or equal to 32 bits
• On the DEC Alpha one needs multiple
instructions to access objects smaller
than 32 bits
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/103
Computer Design IS
Operands and types
• Floating point numbers:
IEEE standard 754
• In the early ’80, each manufacturer had
its own f.p. representation
• Sometimes string operations are available
(strcmp, strcpy…)
• Sometimes BCD is used to code numbers
Four bits are used to code a decimal digit
A byte codes two decimal digits
Functions for “packing” and “unpacking” are
required
It is unclear if this will stay in the future
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/104
Computer Design IS
• IS Classification
Role of the compilers
• DLX
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/105
Computer Design IS
Role of the compiler
• In the past, the role of Assembly language
was crucial
• Architectural decisions aimed at easing
assembly language programming
• Now, the user interface is a high level
language (C, C++, Java…)
• The user interfaces the machine via the
HLL, though the machine actually
executes some lower level code
• This lower level code is produced by a
compiler
The role of the compiler is fundamental
The IS architecture needs to take the
compiler into strong account
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/106
Computer Design IS
Role of the compiler
• Goals of the compiler writer
Correctness
Performance
…Fast compilation, debugging support, …
• Strategy for writing a compiler
Use a number of “passes”
From high level structures down to
lower levels, until machine level
This way complexity is decomposed in
smaller blocks
Optimizing becomes more difficult
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/107
Computer Design IS
Role of the compiler
Code
generator
Front-end
Dependencies Function
D(language)
D(machine)
Language common
intermediate form
HL Opt
Global Opt
Loop transformations,
function inlining…
Register allocation…
D(language)
D(machine)
Instruction selection,
D(machine) opt.
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/108
Computer Design IS
Role of the compiler
• HL Optimizations: source-level
optimizations (code code’)
• Local optimizations: basic block
optimizations
• Global optimizations: loop optimization
and basic blocks optimizations
• Machine-dependent optimization: using
low level architectural knowledge
• Basic Block = a straight-line code fragment
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/109
Computer Design IS
Role of the compiler
• Compilers have different optimization
levels
-O1 .. -On
• Optimization can have a big impact on
instruction count on performance
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/110
Computer Design IS
Role of the compiler
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/111
Computer Design IS
Role of the compiler
• In some cases, though, optimization may
be counterproductive!
• This happens because there might be
conflicts between local and global
optimization tasks
• Example:
a = sqrt(x*x + y*y) + f()… ;
b = sqrt(x*x + y*y) + g()…;
SAME EXPRESSION
• Idea:
tmp = sqrt(x*x + y*y);
a = tmp + f() …;
b = tmp + g() …;
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/112
Computer Design IS
Role of the compiler
• Effective, but only if tmp can be stored in
a register
• No register in memory cache misses
… bad performance
• Problem is
When the compiler performs, e.g., code
transformations like in the example, it does not
know whether a register will actually be
available
This will only become clear later (at global
optimization level)
• (Phase ordering problem)
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/113
Computer Design IS
Role of the compiler
• Key resource is the register file
• “Intelligent” register allocation
techniques are a must
• Current solution: graph coloring (graph
with possible candidates for allocation to
a register)
• NP-complete, though effective heuristic
algorithms exist
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/114
Computer Design IS
Role of the compiler
• A special class of compilers – Algorithm-
driven software generation
FFTW approach: Software generation system
based on symbolic computation
Objective CamL
Sort of FFT compiler that generates optimal C
code via symbolic computing
Possible future steps (project works, theses…):
Extending the approach going down to code
generation for, e.g., the TI ‘C67 DSP and other
VLIW CPUs
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/115
Exam of 16 Jan 2002
• A program is composed of three classes of
instructions: i1 (integer instructions), i2 (load-
store instructions), and i3 (floating point
instructions)
• The three classes are responsible of r1 = 60%, r
2 =
30% and r3 = 10% of the overall execution time,
respectively
• You can choose between three levels of
optimisation on your computer: O1, O2, and O3:
O1 optimises i1, O2 optimises i
2, and O3 optimises
i3
• The corresponding enhancements would be
e1 = 2, e
2 = 3, e
3 = 10
• Suppose you can only choose one of the three
levels of optimisation. Which one would you
choose? Justify your choice
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/116
Solution
• r1 = 60% e
1 = 2
r2 = 30% e
2 = 3
r3 = 10% e
3 = 10
S = Exec-timeNEW
Exec-timeOLD (1 - r) + r / e
1 =
• s1 = 1.42857
s2 = 1.25
s3 = 1.0989