1 csce 930 advanced computer architecture lecture 1 evaluate computer architectures dr. jun wang
TRANSCRIPT
1
CSCE 930 Advanced Computer Architecture
Lecture 1
Evaluate Computer Architectures
Dr. Jun Wang
2
Computer Architecture Trends
3Figure 1.1 H&P Growth in microprocessor 35% per year
4
Technology Trends
• Smaller feature sizes – higher speed, density
100 nm
1 um
10 um
1972 1976 1980 1984 1988 1992 1996 2000
400410um
80806um 8086
3um286
1.5um4861um 486SL
800nmPent.
600nmPent Pro350nm
Pent II250nm
Pent III180nm
Pent 4130nm
Density is increased by 77 times
5
Technology Trends• Larger chips
– Trend is toward more RAM, less logic per chip– Historically 2x per generation; leveling off?– McKinley has large on-chip caches
=> larger wafers to reduce fabricate costs 4004 8086 286 486 Pent. Pent 4
M cKinleyIA64
6
Moore’s Law• Number of transistors doubles every 18 months (amended
to 24 months)• Combination of both greater density and larger chips
1
100
10000
1972 1976 1980 1984 1988 1992 1996 2000
4004
8080
8086
286
486486SL
Pent.Pent Pro
Pent II
Pent III Pent 4
Th
ou
san
ds
of
Tra
ns
isto
rs
7
Tech. Trends, contd.• More, faster, cheaper transistors have fed
an application demand for higher performance– 1970s -- serial, 1-bit integer microprocessors– 1980s -- pipelined 32-bit RISC
• ISA simplicity allows processor on chip
– 1990s -- large, superscalar processors, even for CISC
– 2000s -- multiprocessors on a chip
...
8
Pipelining and Branch Prediction• Two basic ways of increasing performance• Pipelining:
• Branch Prediction– Speculate on branch outcome to avoid waiting
ME WBEXIF ID
latch latch latch latch latch latch
clock
9
Tech. Trend: memory sizes
• Memories have grown very dense– Feeding application demand for large, complex
software
1980 1984 1988 1992 1996 2000
Bit
s p
er D
RA
M C
hip
64Kb
256Kb
1M b
4M b
16M b
64M b
256M b
10
Tech. Trend: memory speeds
• Main memory speeds have not kept up with processor speeds
1980 1984 2000
Mem
/Pro
c C
ycle
Tim
e
100ns
120ns
140ns
160ns
80ns
200ns
220ns
240ns
260ns
180ns
60ns
1988 1992 1996
40ns
20ns
Memory
Processor
ratio : 1.3 to 1
ratio : 90+ to 1
11
Memory Hierarchies
• Gap between processor and memory performance has led to widespread use of memory hierarchies– 1960s no caches, no virtual memory– 1970s shared I & D-cache, 32-bit virtual memory– 1980s Split I- and D-caches– 1990s Two level caches, 64-bit virtual memory– 2000s Multi-level caches, both on and off-chip
12
Memory Hierarchies
MEMORYSYSTEM
PROCESSOR
Main Memory
L2 Cache
Registers
L1 Cache
L3 Cache
Small/Fast
Large/Slow
13
I/O a key system component• I/O has evolved into a major distinguishing feature
of computer systems– 1960s: disk, tape, punch cards, tty; batch processing– 1970s: character oriented displays– 1980s: video displays, audio, increasing disk sizes,
beginning networking – 1990s: 3D graphics; networking a fundamental element;
high quality audio– 2000s: real-time video, immersion…
14
I/O Systems• A hierarchy that
divides bandwidth
CD ROM
Hard Drives
Controller BufferFrame
Monitor
Expansion
Floppy
interface
DRAM
Proc
Local Bus
LAN
Interface
High Speed I/O bus
Controller
Slow Speed I/O bus
Floppy
Data ratesMemory: 100 MHz, 8 bytes 800 MB/s (peak)PCI: 33 MHz, 4 bytes wide 132 MB/s (peak)SCSI: “Ultra2” (40 MHz), “Wide” (2 bytes) 80 MB/s (peak)
15
Multiprocessors• Multiprocessors have been available for decades…
– 1960s small MPs– 1970s small MPs
• Dream of automatic parallelization
– 1980s small MPs; emergence of servers• Dream of automatic parallelization
– 1990s expanding MPs• Very large MPPs failed• Dream of automatic parallelization fading
– 2000s wide-spread MPs; on-chip multithreading• Many applications have independent threads• Programmers write applications to be parallel in the first place
C C
P
M
Interconnection Network
C
PP
MM . . .
. . .
16
Evaluating Computer Architectures
17
Computation Science
• Computation is synthetic– Many of the phenomena in the computing field are
created by humans rather than occurring naturally in the physical world
– Very different from nature sciences
» When one discovers a fact about nature, it is a contribution, no matter how small
» Creating something new alone does not establish a contribution
– Anyone can create something new in a synthetic field
– Rather, one must show that the creation is better
18
What Means “Better”?
• “Better” can mean many things– Solves a problem in less time (faster)– Solves a larger class of problems (more powerful)– Is more efficient of resources (cheaper)– Is less prone to errors (more reliable)– Is easy to manage/program (lower human cost)
19
Amdahl's Law
• Speedup due to enhancement E:
ExTime w/o E Performance w/ E
Speedup(E) = ------------------- = ------------------------
ExTime w/ E Performance w/o E
F E
• Find how Speedup coming from some enhancement E
• Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected
• Defines the Speedup that can be gained by using a special feature
20
Amdahl’s Law
ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced
Speedupoverall =ExTimeold
ExTimenew
Speedupenhanced
=
1
(1 - Fractionenhanced) + Fractionenhanced
Speedupenhanced
21
• The “better” property is not simply an observation– Rather, the research will postulate that a new idea
• An architecture, algorithm, protocol, data structure, methodology, language, optimization or model, etc.
– Will lead to a “better” result
– Making the connection between the idea and the improvement is as important as quantifying how much the improvement is
• The contribution is the idea, and is generally a component of a larger computational system.
22
How to Evaluate Architecture Ideas
• Measuring/observing/analyzing real systems– Accurate results– Need a working system
» Too expensive to evaluate architecture/system ideas
23
• Analytic models– Fast & easy analysis of relations
– Tprogram = NumOfInst (Tcpu + Tm(1- Cachehit))
– Allows extrapolation to ridiculous parameters, e.g. thousands of processors
– Sometimes infeasible to obtain accuracy (e.g. modeling caches)
– To obtain reasonable accuracy, the models may become very complex (e.g. modeling of network contention)
– Queuing theory is a commonly used technique
24
• Simulation – The most popular method in computer
architecture and system research– Mimic the architecture/system using software – Very flexible: nearly unlimited evaluation – Prototyping of non-existing machines possible– Evaluation of design options (design space
exploration) cheap & flexible– Requires some sort of validation – Can be VERY slow
25
• Tradeoff between accuracy and computational intensity – Low level of abstraction slow (e.g. simulating
at the level of gates) – High level of abstraction fast (e.g. only
simulating processor, cache and memory components)
• Tradeoff may be intensified when modeling parallel architectures as multiple processors need to be simulated
26
Three Simulation Techniques• Profile-based static modeling
– Simplest and least costly– Use hardware counters on the chip or instrumented
execution (such as Beowulf Linux cluster Pgprof, SGI perfex and Alpha ATOM)
• Trace-driven – A more sophisticated technique– How it works (Ex. modeling memory system
performance):• Collect traces generated by ATOM• Trace format: inst address executed, data address accessed• Build the memory hierarchy model• Feed trace in the simulation model and analyze results
27
1. Compile: pgcc –Mprof=func prg.cc2. Run the code: to produce a profile data file called pgprof.out3. View the execution profile: pgprof prprof.out
28
Using Perfex• Usage :
perfex [-e num] [-y] program [program args]
-e num: count only event type num;
-y: generate a “cost report”;
• Example
perfex –e 41 –13 –y a.out
EVENT # Event Events Counted
41 Floating point OP retired 25292884493
13 L2 cache lines loaded 223490870
Statistics: MFLOPS 29.175907
Main memory L2 bandwidth 8.249655 MB/s
29
• Execution–driven– The most accurate and most costly– Trace-driven can not simulate the interaction
between memory system and processor– A detailed of the memory system and the
processor pipeline are done simultaneously by really executing program on top of a simulation framework like Simics, SimOS and SimpleScalar
30
Measuring by Means of Benchmarks
• Micro-benchmarks (e.g. instruction latencies, file system throughput)
• Application benchmarks: general system behavior (e.g. Spec2000 or SPLASH2)
• Only limited evaluation possible (e.g. limited systems support for measurement)
• The machine must be available• Benchmark Suites: Collection of kernels, real and
benchmark programs, lessening the weakness of any one benchmark by the presence of others.
31
Summarize Results• Weighted Arithmetic Mean Execution Time (Wi*Ti)– Summarize the products of weighting factors and
execution times and reflect individual frequency of each workload
– Wi = 1/(Timei * nj=1(1/Timej))
• Geometric Mean Execution Time (Ti/Ni)1/n
– Normalize execution times to a reference machine and take the average of normalized execution times
– Used by SPEC
32
A Report Example (P&H figure 1.17)
Normalized to A Normalized to B Normalized to C
A B C A B C A B C
Program P1 1.0 10.0 20.0 0.1 1.0 2.0 0.05 0.5 1.0
Program P2 1.0 0.1 0.02 10.0 1.0 0.2 50.0 5.0 1.0
Arithmetic mean
1.0 5.05 10.01 5.05 1.0 1.1 25.03 2.75 1.0
Geometric
Mean
1.0 1.0 0.63 1.0 1.0 0.63 1.58 1.58 1.0