mamas – computer architecture 234367
DESCRIPTION
Alex Gontmakher. MAMAS – Computer Architecture 234367. Some of the slides were taken from: (1) Lihu Rapoport (2) Randi Katz and (3) Petterson. General course information. Grade: 20% Exercise – most likely 5 assignments. 80% Final exam. Textbooks: - PowerPoint PPT PresentationTRANSCRIPT
introduction© Avi Mendelson, 3/2005 1
MAMAS – Computer Architecture234367
Alex Gontmakher
Some of the slides were taken from:
(1) Lihu Rapoport (2) Randi Katz and (3) Petterson
introduction© Avi Mendelson, 3/2005 2
General course information
Grade: 20% Exercise – most likely 5 assignments. 80% Final exam.
Textbooks:– Computer Architecture a Quantitative Approach:
Hennessy & Patterson – preferably 3rd edition
– Computer Organization and Design – The Hardware \ Software Interface: Patterson & Hennessy
Other course information: WEB site of the course.
introduction© Avi Mendelson, 3/2005 3
הערות לגבי מערך השיעורים
השיעורים והתרגולים מתואמים ביניהם, לא ניתן להבין אחד ללא השני
הגשת תרגילים
בזוגות או יחידים–
עבודות מודפסות )לא בכתב יד(–
)אפס( לכל המעורבים0במידה ויתפסו העתקות )תרגילים או מבחן( ינתן ציון סופי –
בשבועות שבהם אחד המתרגלים יעדר ו/או במידה ויהיה חופש באחדהתרגולים (ותרגולים האחרים באותו שבוע יתקיימו כרגיל), יושלם
הקורס בהקדם האפשרי, אבל מומלץ לסטודנטים להצטרף לתרגול אחר.
:החומר בקורס אינו זהה לזה שנלמד בסימסטרים קודמים לכן אם חומר לא נלמד )בשעור או בתרגול( לא נשאל עליו במבחן חומר חדש נכלל בחומר למבחן
....השקפים מערבבים אנגלית ועברית כיוון והושאלו ממקורות שונים
introduction© Avi Mendelson, 3/2005 4
Before we start
introduction© Avi Mendelson, 3/2005 5
The paradigm (Patterson)
Every Computer Scientist should master the “AAA”
Architecture Algorithms Applications
introduction© Avi Mendelson, 3/2005 6
Computer Architecture
The goal of Computer Architecture To build “cost effective
systems”– How do we calculate the cost of a
system ?– How we evaluate the effectiveness
of the system? To optimize the system
– What are the optimization points ?
Fact: most of the computer systems still use Von-Neumann principle of operation, even though, internally, they are much different from the computer of that time.
Why Computer Architecture? We, computer architects,
were lucky enough to have real impact on the computer technology
We need to understand the hardware trends
Most of our work are within the fields of performance evaluation and the algorithms which are implemented by the hardware.
introduction© Avi Mendelson, 3/2005 7
Introduction
introduction© Avi Mendelson, 3/2005 8
Computer System Structure
CPU
I/O BUS
Bridge Memory
KeyBoardMouse
Scanner
LAN
LanAdap
USBHub
GraphicAdapt
VideoBuffer
Mem BUSCPU BUS
Cache
Scsi/IDEAdap
Scsi Bus
HardDisk
North
South
introduction© Avi Mendelson, 3/2005 9
Computer systems how it looks in real systems
“North” – CPU + memory subsystem
I/O slots
introduction© Avi Mendelson, 3/2005 10
Class Focus
Performance – How to achieve and how to measure
CPU– CPU design, pipeline, hazards– Out-of-order and speculative
execution
Memory Hierarchy– Main memory– Cache– Virtual Memory
PC Architecture
– Disks I/O
Advance topics
– Software optimizations
We will not focus on
Low level hardware details
Parallel and distributed systems (although we mention some of their basic technologies)
introduction© Avi Mendelson, 3/2005 11
Trends in Computer Technologies
introduction© Avi Mendelson, 3/2005 12
Technology TrendsCapacity Speed
Logic 2x in 3 years 2x in 3 years
DRAM 4x in 3 years 1.4x in 10 years
Disk2x in 3 years 1.4x in 10 years
CPU Performance TrendsLogic Speed: 2x per 3 years
Logic Capacity: 2x per 3 years
Computing capacity: 4x per 3 years
BUT:
–If we could keep all the transistors busy all the time
–Actual: 3.3x per 3 years
X
introduction© Avi Mendelson, 3/2005 13
Technology and Computer Architecture
0
0.5
1
1.5
2
0 50 100 150 200 250 300 350 400
Series1
Series2
Series3
Series4
Series5
Series6
Series7
Series8
Series9
Series10
Series11
Series12
Series13
Series14
Series15
Series16
Speed demons
SPECInt92 = 10050
MHz
ALPHA
X86
PowerPC
21164
21064
1.0
0.5
1.5
2.0
50 100 150 200 250 300 350 400
1
SP
EC
Int9
2 / M
Hz
PENTIUM
PENTIUM PRO
Source: ISCA 95, p. 174
introduction© Avi Mendelson, 3/2005 14
Can it last forever – or – new challenges are coming
100
1 386
486
Pentium Pentium MMX
PentiumPro
Pentium II
10
1.5 1.0 0.8 0.6 0.35 0.25 0.18 Process (microns)
Max
imu
m P
ower
(W
)
1
10
100
1000
Wat
ts2/c
m
i386i486
Pentium processor
Pentium Pro processor
Pentium II processor
Pentium III processor
Hot plate
Nuclear ReactorRocketNozzle
Sun’sSurface
Power density Power
introduction© Avi Mendelson, 3/2005 15
Considerations in computer design
introduction© Avi Mendelson, 3/2005 16
Architecture & Microarchitecture
Architecture (ISA-Instruction Set Architecture):The collection of features of a processor (or a system) as they are seen by the “user”
– User: a binary executable running on the processor, or
– assembly level programmer
Microarchitecture (µarch, uarch):The collection of features or way of implementation of a processor (or a system) that do not affect the user
introduction© Avi Mendelson, 3/2005 17
Architecture & Microarchitecture Elements Architecture:
– Registers data width )8/16/32(– Instruction set– Addressing modes– Addressing methods )Segmentation, Paging, etc...(
Architecture:– Physical memory size– Caches size and structure– Number of execution units, number of execution pipelines
– Branch prediction
– TLB
Timing is considered Arch (though it is user visible!)
Processors with the same arch may have different Arch
introduction© Avi Mendelson, 3/2005 18
Compatibility Backward compatibility
– New hardware can run existing software
– Example: Pentium 4 can run software originally written for Pentium III, Pentium II, Pentium , 486, 386, 286
Forward compatibility– New software can run on existing hardware
– Example: new software written with MMXTM must still run on older Pentium processors which do not support MMXTM
– Less important than backward compatibility
New ideas: architecture independent– JIT – just in time compiler: Java and .NET
– Binary translation
introduction© Avi Mendelson, 3/2005 19
How to compare between different systems?
introduction© Avi Mendelson, 3/2005 20
Benchmarks – Programs for Evaluating Processor Performance
Toy Benchmarks– 10-100 line programs
– e.g.: sieve, puzzle, quicksort
Synthetic Benchmarks– Attempt to match average frequencies of real workloads
– e.g., Winstone, Dhrystone
Real programs– e.g., gcc, spice
SPEC: System Performance Evaluation Cooperative– SPECint )8 integer programs(
– and SPECfp )10 floating point(
introduction© Avi Mendelson, 3/2005 21
CPI – to compare systems with same instruction set architecture (ISA)
The CPU is synchronous - it works according to a clock signal.– Clock cycle is measured in nsec )10-9 of a second(.
– Clock rate )= 1/clock cycle( is measured in MHz )106 cycles/second(.
CPI - cycles per instruction– Average #cycles per Instruction )in a given program(
– IPC )= 1/CPI( : Instructions per cycles
Clock rate is mainly affected by technology, CPI by the architecture
CPI breakdown: how many cycles (in average) the program spends for different causes; e.g., in executing, memory I/O etc.
CPI =#cycles required to execute the program #instruction executed in the program
introduction© Avi Mendelson, 3/2005 22
CPI (cont.)
CPIi - #cycles to execute a given type of instruction– e.g.: CPIadd = 1, CPImul = 3– Independent of a program
Calculating the CPI of a program– ICi - #times instruction of type i was executed in the program
– IC - #instruction executed in the program:
– Fi - relative frequency of instruction of type i : Fi = ICi/IC
– Ncyc - #cycles required to execute the program:
– CPI:
– This calculation does not take into account other delays such as memory, I/O
n
=iii
n
=i
ii
n
=iii
FCPI=IC
ICCPI=
IC
ICCPI=
IC
Ncyc=CPI
11
1
ICCPI=ICCPI=Ncycn
=iii
1
IC=∑i=1
n
IC i
introduction© Avi Mendelson, 3/2005 23
CPU Time
CPU Time– The time required by the CPU to execute a given program:
CPU Time = clock cycle #cyc = clock cycle CPI IC
Our goal: minimize CPU Time– Minimize clock cycle: more MHz )process, circuit, Arch(
– Minimize CPI: Arch )e.g.: more execution units(
– Minimize IC: architecture )e.g.: MMXTM technology(
Speedup due to enhancement E
oEewPerformanc
EewPerformanc=
EExTimew
oEExTimew=ESpeedup
/
/
/
/
introduction© Avi Mendelson, 3/2005 24
Speedupoverall =ExTimeold
ExTimenew
=1
Speedupenhanced
Fractionenhanced(1 - Fractionenhanced) +
ExTimenew = ExTimeold xSpeedupenhanced
Fractionenhanced(1 - Fractionenhanced) +
Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected, then:
Amdahl’s Law
introduction© Avi Mendelson, 3/2005 25
• Floating point instructions improved to run 2X; but only 10% of actual instructions are FP
Speedupoverall =1
0.95= 1.053
ExTimenew = ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold
Corollary:
Make The Common Case Fast
Amdahl’s Law: Example
introduction© Avi Mendelson, 3/2005 26
Amdahl's law: How speedup starts
introduction© Avi Mendelson, 3/2005 27
Amdahl's law: Where speedup ends
introduction© Avi Mendelson, 3/2005 28
instruction set
software
hardware
Instruction Set Design
The ISA is what the user and the compiler sees
The ISA is what the hardware needs to implement
introduction© Avi Mendelson, 3/2005 29
Why ISA is important?
Code size
– long instructions may take more time to be fetched
– Requires larges memory )important in small devices, e.g., cell phones(
Number of instructions (IC)
– Reducing IC reduce execution time )assuming same CPI and frequency(
Code “simplicity”
– Simple HW implementation which leads to higher frequency and lower power
– Code optimization can better be applied to “simple code”
introduction© Avi Mendelson, 3/2005 30
The impact of the ISA
RISC vs CISC
introduction© Avi Mendelson, 3/2005 31
CISC Processors CISC - Complex Instruction Set Computer
The idea: a high level machine language Characteristic
–Many instruction types, with many addressing modes–Some of the instructions are complex:
Perform complex tasks Require many cycles
–ALU operations directly on memory Usually uses limited number of registers
–Variable length instructions Common instructions get short codes save code length
Example: x86
introduction© Avi Mendelson, 3/2005 32
CISC Drawbacks Compilers do not take advantage of the complex instructions
and the complex indexing methods Implement complex instructions and complex addressing modes
complicate the processor slow down the simple, common instructions
contradict Amdahl’s law corollary:
Make The Common Case Fast
Variable length instructions are real pain in the neck:– It is difficult to decode few instructions in parallel
As long as instruction is not decoded, its length is unknown It is unknown where the instruction ends It is unknown where the next instruction starts
– An instruction may not fit into the “right behavior” of the memory hierarchy )will be discussed next lectures(
Examples: VAX, x86 (!?!)
introduction© Avi Mendelson, 3/2005 33
RISC Processors RISC - Reduced Instruction Set Computer The idea: simple instructions enable fast hardware Characteristic
– A small instruction set, with only a few instructions formats
– Simple instructions execute simple tasks require a single cycle )with pipeline(
– A few indexing methods
– ALU operations on registers only Memory is accessed using Load and Store instructions only. Many orthogonal registers Three address machine: Add dst, src1, src2
– Fixed length instructions
Examples: MIPSTM, SparcTM, AlphaTM, PowerPCTM
introduction© Avi Mendelson, 3/2005 34
RISC Processors (Cont.) Simple architecture Simple micro-architecture
–Simple, small and fast control logic
–Simpler to design and validate
–Room for on die caches: instruction cache + data cache Parallelize data and instruction access
–Shorten time-to-market
Using a smart compiler –Better pipeline usage
–Better register allocation
Existing RISC processor are not “pure” RISC –e.g., support division which takes many cycles
introduction© Avi Mendelson, 3/2005 35
RISC and Amdhal’s Law (Example) In compare to the CISC architecture:
– 10% of the static code, that executes 90% of the dynamic has the same CPI
– 90% of the static code, which is only 10% of the dynamic, increases in 60%
– The number of instruction being executed is increased in 50%
– The speed of the processor is doubled This was true for the time the RISC processors were invented
We get
And then
1.061.60.10.91 =+=Speedup
Fraction+Fraction=
CPI
CPI
enhanced
enhancedenhanced
old
new
Speedup overall=CPU TimeoldCPU Timenew
=clockoldclock new
∗CPI oldCPI new
∗IC old
IC new=2/1.06∗1.5=1.26
introduction© Avi Mendelson, 3/2005 36
So, what is better, RISC or CISC
Today CISC architectures (X86) are running as fast as RISC (or even faster)
The main reasons are:– Translates CISC instructions into RISC instructions )ucode(
– CISC architecture are using “RISC like engine”
We will discuss this kind of solutions later on in this course.
introduction© Avi Mendelson, 3/2005 37
Virtual machines (JAVA)
Machine independent ISA– Can be run on different architectures
– Each architectures has an emulation )virtual machine( that forms a “system within the system”
The code can be “compiled for the native code “on the fly”– This process is called JIT: Just-In-Time
.Net allows to combine different formats of code:– e.g., different programming languages
Pros– Portability, Flexibility
Cons– Efficiency
– The JIT can apply only very basic optimization
introduction© Avi Mendelson, 3/2005 38
backup
introduction© Avi Mendelson, 3/2005 39
IC cost = Die cost + Testing cost + Packaging cost Final test yieldDie cost = Wafer cost Dies per Wafer * Die yield
Dies per wafer = š * ( Wafer_diam / 2)2 – š * Wafer_diam – Test dies Die Area ¦ 2 * Die Area
Die Yield = Wafer yield * 1 + Defects_per_unit_area * Die_Area
Integrated Circuits Costs
Die Cost goes roughly with die area4
{
}
introduction© Avi Mendelson, 3/2005 40
Real World Examples
Chip Metal Line Wafer Defect Area Dies/ YieldDie Cost
layers width cost /cm2 mm2 wafer
386DX 2 0.90 $900 1.0 43 360 71%$4
486DX2 3 0.80 $1200 1.0 81 181 54% $12
PowerPC 601 4 0.80 $1700 1.3 121 115 28%$53
HP PA 7100 3 0.80 $1300 1.0 196 66 27%$73
DEC Alpha 3 0.70 $1500 1.2 234 53 19% $149
SuperSPARC 3 0.70 $1700 1.6 256 48 13%$272
Pentium 3 0.80 $1500 1.5 296 40 9% $417
– From "Estimating IC Manufacturing Costs,” by Linley Gwennap, Microprocessor Report, August 2, 1993, p. 15