chapter 1
TRANSCRIPT
CPE 408340Computer Organization
Chapter 1: Computer Abstractionsand Technology
Sa’ed R. Abed[Computer Engineering Department,
Hashemite University][Adapted from Otmane Ait Mohamed Slides &
Computer Organization and Design, Patterson & Hennessy, © 2005, UCB]
1
Course AdministrationInstructor: Sa’ed Rasmi AbedInstructor's e-mail: [email protected]
Office Hours: Mon, wed: 9:00 - 10:00 or by appointment
Lecture Time: Mon, wed: 12:30 - 2:00
Text: Required: Computer Org and Design, 4th
Edition, Patterson and Hennessy ©2008Optional: Computer Organization and Architecture: Designing for Performance, 7th Edition, William Stallings, published by Prentice Hall, July 2005.
Slides : pdf on the course web page (Moodle System)2
Course ContentContent
Principles of computer architecture: CPU datapath and controlunit design (single-issue pipelined, superscalar, VLIW), memoryhierarchies and design, I/O organization and design, advancedprocessor design (multiprocessors).
Course goalsTo learn the organizational paradigms that determine thecapabilities and performance of computer systems. Tounderstand the interactions between the computer’s architectureand its software so that future software designers (compilerwriters, operating system designers, database programmers, …)can achieve the best cost-performance trade-offs and so thatfuture architects understand the effects of their design choices onsoftware applications.
Course prerequisitesCPE 408330: Assembly Language and Microprocessor Systems.
3
What You Should KnowBasic logic design & machine organization:
logical minimization, FSMs, component designprocessor, memory, I/O.
To learn the organizational paradigms that determinesthe capabilities and performance of computer systems.
Create, assemble, run, debug programs in anassembly language:
MIPS preferred.
To explore the memory hierarchy system and how tointerface it to a computer.
4
Course StructureDesign focused class:
• Various homework assignments throughout the semester
Lectures:Computer Abstractions and Technology
Instructions: Language of the Computer
Arithmetic for Computers
Review and First Exam
The processor
Review and Second Exam
Exploiting Memory Hierarchy
Review and Final Exam5
Chapter 1 (2 Weeks)(Sec. 1.1 to 1.4)Chapter 2 (2 1/2 Weeks)(Sec. 2.1 to 2.7 & 2.10)Chapter 3 (1 1/2 Weeks)(Sec. 3.1 to 3.4)(1/2 Week)
Chapter 4 (4 Weeks)(Sec. 4.1 to 4.9)(1/2 Week)
Chapter 5 (3 Weeks)(Sec. 5.1 to 5.3 & 5.5)(1/2 Week)
Grading InformationGrade determinates
• First Exam ~20%Monday, March 12th.
• Second Exam ~25%Monday, April 16th.
•Final Exam ~50%TBD
• Class participation & pop quizzes ~5%
Let me know about any exam conflicts ASAP
6
Ethics and ProfessionalismEthics
Disciplined dealing with moral duty.Moral Principles or Practice.System of right behavior.
ProfessionalismThe conduct, aims or qualities that characterize a professional person.
7
What characterizes a “professional”?
a professional accepts responsibility fully – does not blame others for failure.
a professional is reliable - gets the job done on time.
a professional is competent - gets the correct answer.
a professional works independently – finds out what he/she does not know.
a professional follows up on all the details.
a professional has high standards of ethical behavior – does not lie or cheat.
a professional does not steal the work of others and present it as his own.
8
What characterizes a “professional”?
a professional is respectful to others.
a professional does not offer excuses in lieu of completed work.
a professional is resourceful.
a professional has initiative.
a professional succeeds in spite of obstacles and road blocks.
a professional has justifiable self-confidence.
9
The Student is the Product of our Engineering School
We are an accredited engineering school: our product is engineering professionals.
Employers expect our graduates to behave like professionals.
Employers seek the qualities of a professional in job interviews.
Professionalism must start in the first semester and be part of every course over four years.
10
The Student is the Product of our Engineering School
Every student must learn to “think like an engineer”:o accept responsibility for his/her own learning o follow up on lecture material and homeworko learn problem-solving skills not just how to solve each specific
homework problemo build a body of knowledge integrated over four years of courses
We all want HU’s excellent reputation to be reinforced so that employers will hire our graduates!
11
By the architecture of a system, I mean the complete and detailed specification of the user interface. … As Blaauw has said, “Where architecture tells whathappens, implementation tells how it is made to happen.”
The Mythical Man-Month, Brooks, pg 45
12
Moore’s Law
In 1965, Gordon Moore predicted that the number of transistors that can be integrated on a die would double every 18 to 24 months (i.e., grow exponentially with time).
Amazingly visionary – million transistor/chip barrier was crossed in the 1980’s.
2300 transistors, 1 MHz clock (Intel 4004) - 197116 Million transistors (Ultra Sparc III)42 Million transistors, 2 GHz clock (Intel Xeon) – 200155 Million transistors, 3 GHz, 130nm technology, 250mm2 die (Intel Pentium 4) - 2004140 Million transistor (HP PA-8500)
13
Where is the Market?
290
933
488
1143
892
1354
862
1294
1122
1315
0
200
400
600
800
1000
1200
1998 1999 2000 2001 2002
EmbeddedDesktopServers
Milli
ons
of C
ompu
ters
14
Processor Performance Increase
1
10
100
1000
10000
1987 1989 1991 1993 1995 1997 1999 2001 2003
Year
Perf
orm
ance
(SPE
C In
t)
SUN-4/260 MIPS M/120MIPS M2000
IBM RS6000
HP 9000/750
DEC AXP/500 IBM POWER 100
DEC Alpha 4/266DEC Alpha 5/500
DEC Alpha 21264/600
DEC Alpha 5/300
DEC Alpha 21264A/667Intel Xeon/2000
Intel Pentium 4/3000
15
Growth Capacity of DRAM Chips
K = 1024 (210)In recent years growth rate has slowed to 2x every 2 year
16
The Evolution of Computer Hardware
When was the first transistor invented?Modern-day electronics began with the invention in 1947 of the transfer resistor - the bi-polar transistor -by Bardeen et.al at Bell Laboratories
18
The Evolution of Computer Hardware
When was the first IC (integrated circuit) invented?
In 1958 the IC was “born” when Jack Kilby at Texas Instruments successfully interconnected, by hand, several transistors, resistors and capacitors on a single substrate
20
The Underlying Technologies
Year Technology Relative Perform/Unit Cost1951 Vacuum Tube 11965 Transistor 351975 Integrated Circuit (IC) 9001995 Very Large Scale IC (VLSI) 2,400,0002005 Submicron VLSI 6,200,000,000
What if technology in the transportation industry advanced at the same rate?
21
The PowerPC 750
Introduced in 1999
3.65M transistors
366 MHz clock rate
40 mm2 die size
250nm (0.25micron) technology
22
Technology Outlook
High Volume Manufacturing
2004 2006 2008 2010 2012 2014 2016 2018
Technology Node (nm)
90 65 45 32 22 16 11 8
Integration Capacity (BT)
2 4 8 16 32 64 128 256
Delay = CV/I scaling
0.7 ~0.7 >0.7 Delay scaling will slow down
Energy/Logic Op scaling
>0.35 >0.5 >0.5 Energy scaling will slow down
Bulk Planar CMOS High Probability Low ProbabilityAlternate, 3G etc Low Probability High ProbabilityVariability Medium High Very HighILD (K) ~3 <3 Reduce slowly towards 2 to 2.5RC Delay 1 1 1 1 1 1 1 1Metal Layers 6-7 7-8 8-9 0.5 to 1 layer per generation
23
Impacts of Advancing Technology
Processorlogic capacity: increases about 30% per yearperformance: 2x every 1.5 years
MemoryDRAM capacity: 4x every 3 years, now 2x every 2 yearsmemory speed: 1.5x every 10 yearscost per bit: decreases about 25% per year
Diskcapacity: increases about 60% per year
ClockCycle = 1/ClockRate
500 MHz ClockRate = 2 nsec ClockCycle1 GHz ClockRate = 1 nsec ClockCycle4 GHz ClockRate = 250 psec ClockCycle
25
Computer Organization and DesignThis course is all about how computers work
But what do we mean by a computer?Different types: embedded, laptop, desktop, server
Different uses: automobiles, graphics, finance, genomics,…
Different manufacturers: Intel, AMD, IBM, HP, Apple, IBM, Sony, Sun …
Different underlying technologies and different costs !
Best way to learn:Focus on a specific instance and learn how it works
While learning general principles and historical perspectives26
Example Machine OrganizationWorkstation design target
25% of cost on processor25% of cost on memory (minimum memory size)Rest on I/O devices, power supplies, box
CPU
Computer
Control
Datapath
Memory Devices
Input
Output
27
Embedded Computers in You Car
28
Why Learn this Stuff?
You want to call yourself a “computer scientist/engineer”
You want to build HW/SW people use (so you need to deliver performance at low cost)
You need to make a purchasing decision or offer “expert” adviceBoth hardware and software affect performance
The algorithm determines number of source-level statements
The language/compiler/architecture determine the number of machine-level instructions
- (Chapters 1, 2 and 3)
The processor/memory determine how fast machine-level instructions are executed
- (Chapters 4, and 5) 29
What is a Computer?Components:
processor (datapath, control)input (mouse, keyboard)output (display, printer)memory (cache (SRAM), main memory (DRAM), disk drive, CD/DVD)network
Our primary focus: the processor (datapath and control)
Implemented using millions of transistorsImpossible to understand by looking at each transistorWe need abstraction!
30
Major Components of a Computer
31
PC Motherboard Closeup
32
Inside the Pentium 4 Processor Chip
33
Below the ProgramHigh-level language program (in C)
swap (int v[], int k)(int temp;
temp = v[k];v[k] = v[k+1];v[k+1] = temp;
)
Assembly language program (for MIPS)swap: sll $2, $5, 2
add $2, $4, $2lw $15, 0($2)lw $16, 4($2)sw $16, 0($2)sw $15, 4($2)jr $31
Machine (object) code (for MIPS)000000 00000 00101 0001000010000000000000 00100 00010 0001000000100000. . .
C compiler
assembler
one-to-many
one-to-one
35
Advantages of Higher-Level Languages ?Higher-level languages
As a result, very little programming is done today at the assembler level
Allow the programmer to think in a more natural language and for their intended use (Fortran for scientific computation, Cobol for business programming, Lisp for symbol manipulation, Java for web programming, …)Improve programmer productivity – more understandable code that is easier to debug and validateImprove program maintainabilityAllow programs to be independent of the computer on which they are developed (compilers and assemblers can translate high-level language programs to the binary instructions of any machine)Emergence of optimizing compilers that produce very efficient assembly code optimized for the target machine
37
Machine OrganizationCapabilities and performance characteristics of the principal Functional Units (FUs)
e.g., register file, ALU, multiplexors, memories, ...
The ways those FUs are interconnected
e.g., buses
Logic and means by which information flow between FUs is controlled
The machine’s Instruction Set Architecture (ISA)
Register Transfer Level (RTL) machine description
38
Instruction Set Architecture (ISA)ISA: An abstract interface between the hardware and the lowest level software of a machine that encompasses all the information necessary to write a machine language program that will run correctly, including instructions, registers, memory access, I/O, and so on.
“... the attributes of a [computing] system as seen by the programmer, i.e., the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls, the logic design, and the physical implementation.”
– Amdahl, Blaauw, and Brooks, 1964Enables implementations of varying cost and performance to run identical software
ABI (application binary interface): The user portion of the instruction set plus the operating system interfaces used by application programmers. Defines a standard for binary portability across computers.
39
ISA Type Sales
0
200
400
600
800
1000
1200
1400
1998 1999 2000 2001 2002
OtherSPARCHitachi SHPowerPCMotorola 68KMIPSIA-32ARM
PowerPoint “comic” bar chart with approximate values (see text for correct values)
Mill
ions
of P
roce
ssor
40
Major Components of a Computer
Processor
Control
Datapath
Memory
Devices
Input
Output
Network
41
Below the Program
C compiler
assembler
High-level language program (in C)swap (int v[], int k). . .
Assembly language program (for MIPS)swap: sll $2, $5, 2
add $2, $4, $2lw $15, 0($2)lw $16, 4($2)sw $16, 0($2)sw $15, 4($2)jr $31
Machine (object) code (for MIPS)000000 00000 00101 0001000010000000000000 00100 00010 0001000000100000100011 00010 01111 0000000000000000100011 00010 10000 0000000000000100101011 00010 10000 0000000000000000101011 00010 01111 0000000000000100000000 11111 00000 0000000000001000 43
Input Device Inputs Object Code
Processor
Control
Datapath
Memory
000000 00000 00101 0001000010000000000000 00100 00010 0001000000100000100011 00010 01111 0000000000000000100011 00010 10000 0000000000000100101011 00010 10000 0000000000000000101011 00010 01111 0000000000000100000000 11111 00000 0000000000001000
Devices
Input
Output
Network
44
Object Code Stored in Memory
Processor
Control
Datapath
Memory000000 00000 00101 0001000010000000000000 00100 00010 0001000000100000100011 00010 01111 0000000000000000100011 00010 10000 0000000000000100101011 00010 10000 0000000000000000101011 00010 01111 0000000000000100000000 11111 00000 0000000000001000
Devices
Input
Output
Network
45
Processor Fetches an Instruction
Processor
Control
Datapath
Memory000000 00000 00101 0001000010000000000000 00100 00010 0001000000100000100011 00010 01111 0000000000000000100011 00010 10000 0000000000000100101011 00010 10000 0000000000000000101011 00010 01111 0000000000000100000000 11111 00000 0000000000001000
Processor fetches an instruction from memory
Devices
Input
Output
Network
46
Control Decodes the Instruction
Processor
Control
Datapath
Memory000000 00100 00010 0001000000100000
Control decodes the instruction to determine what to execute
Devices
Input
Output
Network
47
Datapath Executes the Instruction
Processor
Control
Datapath
Memory
contents Reg #4 ADD contents Reg #2results put in Reg #2
Datapath executes the instruction as directed by control
000000 00100 00010 0001000000100000
Devices
Input
Output
Network
48
What Happens Next?
Processor
Control
Datapath
Memory000000 00000 00101 0001000010000000000000 00100 00010 0001000000100000100011 00010 01111 0000000000000000100011 00010 10000 0000000000000100101011 00010 10000 0000000000000000101011 00010 01111 0000000000000100000000 11111 00000 0000000000001000
Fetch
DecodeExec
Devices
Input
Output
Network
Processor fetches the next instruction from memory
How does it know which location in memory to fetch
from next?
50
Processor OrganizationControl needs to have circuitry to
What location does it load from and store to?
Decide which is the next instruction and input it from memoryDecode the instructionIssue signals that control the way information flows between datapath componentsControl what operations the datapath’s functional units perform
Execute instructions - functional units (e.g., adder) and storage locations (e.g., register file) Interconnect the functional units so that the instructions can be executed as requiredLoad data from and store data to memory
Datapath needs to have circuitry to
52
Output Data Stored in Memory
Processor
Control
Datapath
Memory
000001000101000000000000000000000000000001001111000000000000010000000011111000000000000000001000
At program completion the data to be output resides in memory
Devices
Input
Output
Network
53
Output Device Outputs Data
Processor
Control
Datapath
Memory
000001000101000000000000000000000000000001001111000000000000010000000011111000000000000000001000
Devices
Input
Output
Network
54
The Instruction Set Architecture (ISA)
instruction set architecture
software
hardware
The interface description separating the software and hardware
55
The MIPS ISAInstruction Categories
Load/StoreComputationalJump and BranchFloating Point
- coprocessorMemory ManagementSpecial
R0 - R31
PCHILO
OP
OP
OP
rs rt rd sa funct
rs rt immediate
jump target
3 Instruction Formats: all 32 bits wide
Registers
Q: How many already familiar with MIPS ISA? 56
How Do the Pieces Fit Together?
I/O systemProcessor
Compiler
OperatingSystem
Applications
Digital DesignCircuit Design
Instruction SetArchitecture
Firmware
Coordination of many levels of abstraction
Under a rapidly changing set of forces
Design, measurement, and evaluation
Memory system
Datapath & Control
network
57
Performance MetricsPurchasing perspective
given a collection of machines, which has the- best performance ?- least cost ?- best cost/performance?
Design perspectivefaced with design options, which has the
- best performance improvement ?- least cost ?- best cost/performance?
Both requirebasis for comparisonmetric for evaluation
Our goal is to understand what factors in the architecture contribute to overall system performance and the relative importance (and cost) of these factors
59
Which of these airplanes has the best performance?
Airplane Passengers Range (mi) Speed (mph)
Boeing 737-100 101 630 598Boeing 747 470 4150 610BAC/Sud Concorde 132 4000 1350Douglas DC-8-50 146 8720 544
How much faster is the Concorde compared to the 747? How much bigger is the 747 than the Douglas DC-8?
60
Response Time (latency)— How long does it take for my job to run?— How long does it take to execute a job?— How long must I wait for the database query?
Throughput— How many jobs can the machine run at once?— What is the average execution rate?— How much work is getting done?
If we upgrade a machine with a new processor what do we increase?
If we add a new machine to the lab what do we increase?
Computer Performance: TIME, TIME, TIME
61
Elapsed Timecounts everything (disk and memory accesses, I/O , etc.)a useful number, but often not good for comparison purposes
CPU timedoesn't count I/O or time spent running other programscan be broken up into system time, and user time
Our focus: user CPU time time spent executing the lines of code that are "in" our program
Execution Time
62
For some program running on machine X,
PerformanceX = 1 / Execution timeX
"X is n times faster than Y"
PerformanceX / PerformanceY = n
Problem:machine A runs a program in 20 secondsmachine B runs the same program in 25 seconds
Book's Definition of Performance
63
Defining (Speed) PerformanceNormally interested in reducing
Response time (aka execution time) – the time between the start and the completion of a task
- Important to individual usersThus, to maximize performance, need to minimize execution time
Throughput – the total amount of work done in a given time- Important to data center managers
Decreasing response time almost always improves throughput
performanceX = 1 / execution_timeX
If X is n times faster than Y, then
performanceX execution_timeY -------------------- = --------------------- = nperformanceY execution_timeX
64
Performance FactorsWant to distinguish elapsed time and the time spent on our task
CPU execution time (CPU time) – time the CPU spends working on a task
Does not include time waiting for I/O or running other programs
CPU execution time # CPU clock cyclesfor a program for a program
= x clock cycle time
CPU execution time # CPU clock cycles for a programfor a program clock rate
= -------------------------------------------
Can improve performance by reducing either the length of the clock cycle or the number of clock cycles required for a program
or
65
Review: Machine Clock RateClock rate (MHz, GHz) is inverse of clock cycle time (clock period)
CC = 1 / CR
one clock period
10 nsec clock cycle => 100 MHz clock rate
5 nsec clock cycle => 200 MHz clock rate
2 nsec clock cycle => 500 MHz clock rate
1 nsec clock cycle => 1 GHz clock rate
500 psec clock cycle => 2 GHz clock rate
250 psec clock cycle => 4 GHz clock rate
200 psec clock cycle => 5 GHz clock rate66
Clock Cycles
Instead of reporting execution time in seconds, we often use cycles
Clock “ticks” indicate when to start activities (one abstraction):
cycle time = time between ticks = seconds per cycle
clock rate (frequency) = cycles per second (1 Hz. = 1 cycle/sec)
A 4 Ghz. clock has a cycle time
time
secondsprogram
=cycles
program×
secondscycle
(ps) spicosecond 2501210 9104
1 =××
67
So, to improve performance (everything else being equal) you can
either (increase or decrease?)
________ the # of required cycles for a program, or
________ the clock cycle time or, said another way,
________ the clock rate.
How to Improve Performance
secondsprogram
=cycles
program×
secondscycle
68
Could assume that number of cycles equals number of instructions
This assumption is incorrect,
different instructions take different amounts of time on different machines.
Why? hint: remember that these are machine instructions, not lines of C code
time
1st i
nstru
ctio
n
2nd
inst
ruct
ion
3rd
inst
ruct
ion
4th
5th
6th ...
How many cycles are required for a program?
69
Multiplication takes more time than addition
Floating point operations take longer than integer ones
Accessing memory takes more time than accessing
registersImportant point: changing the cycle time often changes the number of cycles required for various instructions
time
Different numbers of cycles for different instructions
70
CSE431 L01 Introduction.71 Irwin, PSU, 2005
Clock Cycles per InstructionNot all instructions take the same amount of time to execute
One way to think about execution time is that it equals the number of instructions executed multiplied by the average time per instruction
Clock cycles per instruction (CPI) – the average number of clock cycles each instruction takes to execute
A way to compare two different implementations of the same ISA
# CPU clock cycles # Instructions Average clock cyclesfor a program for a program per instruction = x
CPI for this instruction classA B C
CPI 1 2 3
71
Effective CPIComputing the overall effective CPI is done by looking at the different types of instructions and their individual cycle counts and averaging
Overall effective CPI = Σ (CPIi x ICi)i = 1
n
Where ICi is the count (percentage) of the number of instructions of class i executedCPIi is the (average) number of clock cycles per instruction for that instruction classn is the number of instruction classes
The overall effective CPI varies by instruction mix – a measure of the dynamic frequency of instructions across one or many programs
72
THE Performance EquationOur basic performance equation is then
CPU time = Instruction_count x CPI x clock_cycle
Instruction_count x CPI
clock_rate CPU time = -----------------------------------------------
or
These equations separate the three key factors that affect performance
Can measure the CPU execution time by running the programThe clock rate is usually givenCan measure overall instruction count by using profilers/ simulators without knowing all of the implementation detailsCPI varies by instruction type and ISA implementation for which we must know the implementation details
73
CSE431 L01 Introduction.75 Irwin, PSU, 2005
Determinates of CPU PerformanceCPU time = Instruction_count x CPI x clock_cycle
Instruction_count
CPI clock_cycle
Algorithm
Programming languageCompiler
ISA
Processor organizationTechnology X
XX
XX
X X
X
X
X
X
X
75
CSE431 L01 Introduction.77 Irwin, PSU, 2005
A Simple Example
How much faster would the machine be if a better data cache reduced the average load time to 2 cycles?
How does this compare with using branch prediction to save a cycle off the branch time?
What if two ALU instructions could be executed at once?
Op Freq CPIi Freq x CPIiALU 50% 1
Load 20% 5
Store 10% 3
Branch 20% 2
Σ =
.5
1.0
.3
.4
2.2
CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster
1.6
.5
.4
.3
.4
.5
1.0
.3
.2
2.0
CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster
.25
1.0
.3
.4
1.95
CPU time new = 1.95 x IC x CC so 2.2/1.95 means 12.8% faster77
Comparing and Summarizing Performance
Guiding principle in reporting performance measurements is reproducibility – list everything another experimenter would need to duplicate the experiment (version of the operating system, compiler settings, input set used, specific computer configuration (clock rate, cache sizes and speed, memory size and speed, etc.))
How do we summarize the performance for benchmark set with a single number?
The average of execution times that is directly proportional to total execution time is the arithmetic mean (AM)
AM = 1/n Σ Timeii = 1
n
Where Timei is the execution time for the ith program of a total of n programs in the workloadA smaller mean indicates a smaller average execution time and thus improved performance
78
Performance is specific to a particular program/sTotal execution time is a consistent summary of performance
For a given architecture performance increases come
from:increases in clock rate (without adverse CPI affects)improvements in processor organization that lower CPIcompiler enhancements that lower CPI and/or instruction countAlgorithm/Language choices that affect instruction count
Pitfall: expecting improvement in one aspect of a
machine’s performance to affect the total performance
Remember
79
Summary: Evaluating ISAsDesign-time metrics:
Can it be implemented, in how long, at what cost?Can it be programmed? Ease of compilation?
Static Metrics:How many bytes does the program occupy in memory?
Dynamic Metrics:How many instructions are executed? How many bytes does the processor fetch to execute the program?How many clocks are required per instruction?How "lean" a clock is practical?
Best Metric: Time to execute the program!
CPI
Inst. Count Cycle Timedepends on the instructions set, the processor organization, and compilation techniques.
80
Next Lecture and RemindersNext lecture
Instructions: Language of the Computer- Reading assignment – Chapter 2
82