Module 1Overview: Introduction to
Computer Architecture and Organization
Architecture & Organization 1 Architecture is those attributes visible to the
programmer Instruction set, number of bits used for data
representation, I/O mechanisms, addressing techniques.
e.g. Is there a multiply instruction? Organization is how architecture features are
implemented Control signals, interfaces, memory technology. e.g. Is there a hardware multiply unit or is it
done by repeated addition?
Architecture & Organization 2 All Intel x86 family share the same
basic architecture The IBM System/370 family share
the same basic architecture This gives code compatibility
At least backwards Organization differs between
different versions
Structure & Function Structure is the way in which
components relate to each other Function is the operation of
individual components as part of the structure
Function All basic computer functions are:
Data processing–process data in variety of forms and requirements
Data storage–short and long term data storage for retrieval and update
Data movement–move data between computer and outside world. Devices that serve as source and destination of data–the process is known as I/O
Control–control of process, move and store data using instruction.
Functional View
Operations (a) Data movement device operation
Operations (b) Data storage device operation – read and write
Operation (c) Processing from/to storage
Operation (d)Processing from storage to I/O
Structure - Top Level - 1
Computer
Main Memory
InputOutput
SystemsInterconnection
Peripherals
Communicationlines
CentralProcessing Unit
Computer
Structure - Top Level - 2 Central Processing Unit (CPU) – controls
operation of the computer and performs data processing functions
Main memory – stores data I/O – moves data between computer
and external environment System interconnection – provides
communication between CPU, main memory and I/O
Structure - The CPU - 1
Computer Arithmeticand Login Unit
ControlUnit
Internal CPUInterconnection
Registers
CPU
I/O
Memory
SystemBus
CPU
Structure - The CPU - 2 Control unit (CU)–control the operation
of CPU Arithmetic and logic unit (ALU)-performs
data processing functions Registers-provides internal storage to
the CPU CPU Interconnection-provides
communication between control unit (CU), ALU and registers
Structure - The Control Unit
CPU
ControlMemory
Control Unit Registers and Decoders
SequencingLogin
ControlUnit
ALU
Registers
InternalBus
Control Unit
Implementation of control unit – micro programmed implementation
Module 1Overview: Von Neumann Machine and Computer
Evolution
First Generation – Vacuum TubesENIAC - background Electronic Numerical Integrator And Computer Eckert and Mauchly University of Pennsylvania Trajectory tables for weapons (range and
trajectory) Started 1943 Finished 1946
Too late for war effort, but used determine the feasibility of hydrogen bomb
Used until 1955
ENIAC - diagram
ENIAC - details Decimal (not binary) 20 accumulators of 10 digits Programmed manually by switches 18,000 vacuum tubes 30 tons 15,000 square feet 140 kW power consumption 5,000 additions per second
Von Neumann Machine/Turing Stored Program concept Main memory storing programs and data- That could
be changed and programming will become easier ALU operating on binary data Control unit interpreting instructions from memory
and executing Input and output (I/O) equipment operated by control
unit Princeton Institute for Advanced Studies
IAS computer Completed 1952
Structure of von Neumann machine
Von Neumann Architecture Data and instruction are stored in
a single read-write memory The contents of this memory is
addressable by location Execution occurs in a sequential
fashion from one instruction to the next
1000 x 40 bit wordsBinary number – number word
2 x 20 bit instructions – instruction word1
IAS – details-1
IAS – details-2
Set of registers (storage in CPU) Memory Buffer Register–contains word to be
stored in memory or sent to I/O unit, receive a word from memory of from I/O unit
Memory Address Register–specify the address in memory for MBR
Instruction Register-contains opcode Instruction Buffer Register-temporary storage for
instruction Program Counter-contains next instruction to be
fetched from memory Accumulator and Multiplier Quotient-temporary
storage for operands and result of ALU operations
Structure of IAS – detail-3
Control unit– fetches instructions from memory and executes them one by one
Commercial Computers 1947 - Eckert-Mauchly Computer
Corporation-manufacture computer commercially
UNIVAC I (Universal Automatic Computer)-first commercial computer
US Bureau of Census 1950 calculations Became part of Sperry-Rand Corporation Late 1950s - UNIVAC II
Faster More memory
IBM Punched-card processing equipment 1953 - the 701
IBM’s first stored program computer Scientific calculations
1955 - the 702 Business applications
Lead to 700/7000 series
Second Generation: Transistors Replaced vacuum tubes Smaller Cheaper Less heat dissipation Solid State device Made from Silicon (Sand) Invented 1947 at Bell Labs William Shockley et al.
Transistor Based Computers Second generation machines More complex arithmetic and logic
unit(ALU)and control unit(CU) Use of high level programming languages NCR & RCA produced small transistor
machines IBM 7000 DEC - 1957
Produced PDP-1
Transistors Based Computers
The Third Generation: Integrated CircuitsMicroelectronics Literally - “small electronics” A computer is made up of gates, memory cells
and interconnections Data storage-provided by memory cells Data processing-provided by gates Data movement-the paths between components that
are used to move data Control-the paths between components that carry
control singnals These can be manufactured on a
semiconductor e.g. silicon wafer
Integrated Circuits
Early integrated circuits- know as small scale integration (SSI)
Moore’s Law Increased density of components on chip-refer chart Gordon Moore – co-founder of Intel Number of transistors on a chip will double every
year-refer chart Since 1970’s development has slowed a little
Number of transistors doubles every 18 months Cost of a chip has remained almost unchanged Higher packing density means shorter electrical
paths, giving higher performance Smaller size gives increased flexibility Reduced power and cooling requirements Fewer interconnections increases reliability
Growth in CPU Transistor Count
IBM 360 series 1964 Replaced (& not compatible with) 7000 series First planned “family” of computers
Similar or identical instruction sets Similar or identical O/S Increasing speed Increasing number of I/O ports (i.e. more
terminals) Increased memory size Increased cost
Multiplexed switch structure-multiplexor
IBM 360 - images
DEC PDP-8 1964 First minicomputer (after
miniskirt!) Did not need air conditioned
room Small enough to sit on a lab
bench $16,000
$100k++ for IBM 360 Embedded applications &
OEM-integrate into a total system with other manufacturers
DEC - PDP-8 Bus Structure
Universal bus structure (Omnibus)-separate signals paths to carry control, data and address signalsCommon bus structure- control by CPU
Later Generations:Semiconductor Memory 1950s to 1960s-memory made of ring of
ferromagnetic material called cores-fast but expensive, bulky and destructive read
Semiconductor memory-By Fairchild 1970 Size of a single core
i.e. 1 bit of magnetic core storage Holds 256 bits Non-destructive read Much faster than core Capacity approximately doubles each year
Generations of Computer Vacuum tube - 1946-1957 Vacuum tube Transistor - 1958-1964 Transistor Small scale integration SSI - 1965 on
Up to 100 devices on a chip Medium scale integration MSI - to 1971
100-3,000 devices on a chip Large scale integration LSI - 1971-1977
3,000 - 100,000 devices on a chip Very large scale integration VLSI - 1978 -1991
100,000 - 100,000,000 devices on a chip Ultra large scale integration ULSI – 1991 -
Over 100,000,000 devices on a chip
Microprocessors Intel 1971 - 4004
First microprocessor All CPU components on a single chip 4 bit design
Followed in 1972 by 8008 8 bit Both designed for specific applications
1974 - 8080 Intel’s first general purpose microprocessor Design to be CPU of a general purpose
microcomputer
Pentium Evolution - 1 8080
first general purpose microprocessor 8 bit data path Used in first personal computer – Altair
8086 much more powerful 16 bit instruction cache, prefetch few instructions 8088 (8 bit external bus) used in first IBM PC
80286 16 MB memory addressable
80386 First 32 bit design Support for multitasking- run multiple programs at the same
time
Pentium Evolution - 2 80486
sophisticated powerful cache and instruction pipelining
built in maths co-processor Pentium
Superscalar technique-multiple instructions executed in parallel
Pentium Pro Increased superscalar organization Aggressive register renaming branch prediction data flow analysis speculative execution
Pentium Evolution - 3 Pentium II
MMX technology graphics, video & audio processing
Pentium III Additional floating point instructions for 3D graphics
Pentium 4 Note Arabic rather than Roman numerals Further floating point and multimedia enhancements
Itanium 64 bit see chapter 15
Itanium 2 Hardware enhancements to increase speed
See Intel web pages for detailed information on processors
Module 1Computer System: Designing and
Understanding Performance
Designing for PerformanceMicroprocessor Speed
Achieve full potential of speed if microprocessor is fed constant data and instructions. Techniques used:
Branch prediction-processor looks ahead in the instruction code and predicts which branches (group of instructions) are likely to be processed next
Data flow analysis-processor analyzes instructions which are dependent on each other’s result or data to create an optimized schedule of instructions
Speculative execution-use branch prediction and data flow analysis, processor speculatively executes instructions ahead of their program execution, holding the result in temporary locations.
Performance Balance – Processor and Memory Performance balance- an adjusting
of organization and architecture to balance the mismatch capabilities of various components (etc processor vs. memory)
Processor speed increased Memory capacity increased Memory speed lags behind
processor speed
Performance Balance – Processor and Memory (Performance Gap)
Performance Balance – Processor and Memory - Solutions Increase number of bits retrieved at one time
Make DRAM “wider” rather than “deeper” Change DRAM interface
Include cache in DRAM chip Reduce frequency of memory access
More complex and efficient cache between processor and memory
Cache on chip/processor Increase interconnection bandwidth between processor
and memory High speed buses Hierarchy of buses
Performance Balance: I/O Devices Peripherals with intensive I/O demands-refer chart Large data throughput demands-refer chart Processors can handle this I/O process, but the
problem is moving data between processor and devices
Solutions: Caching Buffering Higher-speed interconnection buses More elaborate bus structures Multiple-processor configurations
Performance Balance: I/O Devices
Key is Balance : Designers 2 factors:
The rate at which performance is changing in the various technology areas (processor, busses, memory, peripherals) differs greatly from one type of element to another
New applications and new peripheral devices constantly change the nature of demand on the system in term of typical instruction profile and the data access patterns
Improvements in Chip Organization and Architecture Increase hardware speed of processor
Fundamentally due to shrinking logic gate size More gates, packed more tightly, increasing
clock rate Propagation time for signals reduced
Increase size and speed of caches Dedicating part of processor chip
Cache access times drop significantly Change processor organization and architecture
Increase effective speed of execution Parallelism
Problems with Clock Speed and Logic Density Power
Power density increases with density of logic and clock speed Dissipating heat
RC delay Speed at which electrons flow limited by resistance and
capacitance of metal wires connecting them Delay increases as RC product increases Wire interconnects thinner, increasing resistance Wires closer together, increasing capacitance
Memory latency Memory speeds lag processor speeds
Solution: More emphasis on organizational and architectural
approaches
Intel Microprocessor Performance
Approach 1: Increased Cache Capacity Typically two or three levels of cache
between processor and main memory (L1, L2, L3)
Chip density increased More cache memory on chip
Faster cache access
Pentium chip devoted about 10% of chip area to cache
Pentium 4 devotes about 50%
Approach 2: More Complex Execution Logic
Enable parallel execution of instructions
Two approaches introduced: Pipelining Superscalar(covered later on)
Diminishing Returns from Approach 1 and Approach 2 Internal organization of processors very
complex Can get a great deal of parallelism Further significant increases likely to be
relatively modest Benefits from cache are reaching limit Increasing clock rate runs into power
dissipation problem Some fundamental physical limits are being
reached
New Approach – Multiple Cores Multiple processors on single chip
With large shared cache Within a processor, increase in performance
proportional to square root of increase in complexity If software can use multiple processors, doubling
number of processors almost doubles performance So, use two simpler processors on the chip rather
than one more complex processor With two processors, larger caches are justified
Power consumption of memory logic less than processing logic
Example: IBM POWER4 Two cores based on PowerPC
POWER4 Chip Organization
Module 1Computer System: Designing and
Understanding Performance
Introduction Hardware performance is often key to the effectiveness of
an entire system of hardware and software For different types of applications, different performance
metrics may by appropriate, and different aspects of a computer systems may be the most significant in determining overall performance
Understanding how best to measure performance and limitations of performance is important when selecting a computer system
To understand the issues of assessing performance. Why a piece of software performs as it does? Why one instruction set can be implemented to perform better than
another? How some hardware feature affects performance?
Defining Performance - 1
•How do we say one computer has better performance than another?•Peformance based on speed
•To take a single passenger from one point to another in the least time – Concorde •Performance based on passenger throughput
•To transport 450 passengers from one point to another - 747
Response Time and Throughput - 2
As an individual computer user, you are interested in reducing response time (or execution time)
The time between the start and completion of a task Data center managers are often interested in increasing
throughput The total amount of work done in a given time
Example: What is improved with the following changes? Replacing the processor in a computer with a faster version – will
improve response time and throughput Adding additional processors to a system that uses multiple
processors for separate tasks (e.g., searching the web) – will improve throughtput
Performance and Execution Time - 3
Relative Performance - 4 If computer A runs a program in 10 seconds and computer B runs the
same program in 15 seconds, how much faster is A than B? We know A is n times faster than B
Performance(A) = Execution time(B) = nPerformance(B) Execution time(B)
Performance ratio15 =1.510
We can say A is 1.5 times faster than B B is 1.5 times slower than A
To avoid the potential confusion between the terms increasing and decreasing, we usually say “improve performance” or “improve execution time”
Measuring Performance Time – measure of computer performance Definition of time - Wall-clock time, response time, or elapsed time CPU execution time (or CPU time)
The time the CPU spends computing for this task and does not include time spent waiting for I/O or running other programs
User CPU time vs. system CPU time Clock cycles (e.g., 0.25 ns) vs. clock rate (e.g., 4 GHz)
Different applications are sensitive to different aspects of the performance of a computer system
Many applications, especially those running on servers, depend as much on I/O performance and total elapsed time measured by a wall clock is of interest
In some application environments, the user may care about throughput, response time, or a complex combination of the two (e.g., maximum throughput with a worst-case response time)
CPU Execution Time A simple formula relates the most
basic metrics (i.e., clock cycles and clock cycle time) to CPU time
Improving Performance Our favorite program runs in 10 seconds on computer A, which has a
4 GHz clock. If a computer B will run this program in 6 seconds given that computer B requires 1.2 times as many clock cycles as computer A for this program. What is computer B’s clock rate?
Clock Cycles for program A
CPU Time(A) = CPU Clock Cycles(A) / Clock Rate(A)
10 s = CPU Clock Cycles(A) / 4 GHz
10 s = CPU Clock Cycles(A) / 4 X 10*9 Hz
CPU Clock Cycles(A) = 40 x 10*9 cycles
CPU Time(B)
CPU Time(B) = 1.2 X CPU Clock Cycles(A) / Clock Rate(B)
6 s = 1.2 X CPU Clock Cycles(A) / Clock Rate(B)
Clock Rate (B) = 1.2 X 40 X 10*9 cycles / 6 seconds
Clock Rate (B) = 48 X 10*9 cycles / 6 seconds
Clock Rate (B) = 8 X 10*9 cycles / seconds
Clock Rate (B) = 8 GHz
Clock Cycles Per Instruction (CPI) The execution time must depend on the number of
instructions in a program and the average time per instruction CPU clock cycles = Instructions for a program ×
Average clock cycles per instruction
Clock cycles per instruction (CPI) Average number of clock cycles each instruction
takes to execute CPI can provide one way of comparing two
different implementations of the same instruction set architecture
Using Performance Equation-1 Suppose we have two implementations of the same instruction set
architecture and for the same program. Which computer is faster and by how much?
Computer A: clock cycle time=250 ps and CPI=2.0 Computer B: clock cycle time=500 ps and CPI=1.2
Say I = number of instructions for the program, find number of clock cycles for A and B
CPU Clock Cycles(A) = I X CPI(A)CPU Clock Cycles(A) = I X 2.0CPU Clock Cycles(B) = I X CPI(B)CPU Clock Cycles(B) = I X 1.2
Compute CPU Time for A and B CPU Time(A) = CPU Clock Cycles(A) X Clock Cycle Time(A) CPU Time(A) = I X 2.0 X 250 ps = I X 500 ps CPU Time(B) = CPU Clock Cycles(B) X Clock Cycle Time(B) CPU Time(B) = I X 1.2 X 500 ps = I X 600 ps
Using Performance Equation-2 Clearly A is faster. The amount
faster is the ratio of execution time.Performance(A) = Execution time(B) = I X 600 ps = 1.2 timesPerformance(B) Execution time(B) I X 500 ps
We can conclude, A is 1.2 times faster than B for this program
Basic Peformance Equation Basic performance equation
Instruction count can be measured by using software tools that profile the execution or by using a simulator of the architecture
Hardware counters, which are included on many processors, can be used alternatively to record a variety of measurements
Number of instructions executed Average CPI Sources of performance loss
Basis Components of Performance
Measuring the CPI Sometimes it is possible to compute the CPU clock cycles by looking at the different types of
instructions and using their individual clock cycle counts
CPIi = count of the number of instructions of class i executed CPIi = average number of cycles per instruction for that instruction class n = number of instruction classes
Remember that overall CPI for a program will depend on both the number of cycles for each instruction type and the frequency of each instruction type in the program execution
Affecting Factors for the CPU Performance
Comparing Code Segments-1A compiler designer is trying to decide between 2 code sequences for a particular computer.
Which code sequence executes the most instructions? Which will be faster? What is the CPI for each sequence?
Comparing Code Segments-2 Sequence 1 (Instruction Count(1)) 2+1+2=5 instructions Sequence 2 (Instruction Count(2)) 4+1+1=6 instructions
CPU Clock Cycles(1)= (2X1)+(1X2)+(2X3) = 10 cycles CPU Clock Cycles(2)= (4X1)+(1X2)+(1X3) = 9 cycles
So code Sequence 2 faster, even though it executes 1 extra instruction Code Sequence 2 uses fewer clock cycles, must have lower CPI
CPI = CPU Clock Cycles/Instruction Count CPI(1) = CPU Clock Cycles(1)/Instruction Count(1) = 10/5 = 2 CPI(2) = CPU Clock Cycles(2)/Instruction Count(2) = 9/6 = 1.5
Module 1Computer System: Computer
Components and Bus Interconnection Structure
Von Neumann Architecture Data and instruction are stored in
a single read-write memory The contents of this memory is
addressable by location Execution occurs in a sequential
fashion from one instruction to the next
Program Concept Hardwired program-connecting/combining
various logic components to store data and perform arithmetic and logic operations
Hardwired systems are inflexible General purpose hardware can do different
tasks, given correct control signals Instead of re-wiring, supply a new set of
control signals
What is a program? A sequence of steps For each step, an arithmetic or logical
operation is done For each operation, a different/new set of
control signals is needed For each operation a unique code is
provided e.g. ADD, MOVE
A hardware segment accepts the code and issues the control signals
Hardware and Software Approaches
Components The Control Unit and the Arithmetic and
Logic Unit constitute the Central Processing Unit
Data and instructions need to get into the system and results out Input/output
Temporary storage of code and results is needed Main memory
Computer Components:Top Level View
Interconnection Structures All the units (processor, memory
and I/O components) must be connected – interconnection structure
Different type of connection for different type of unit Memory Input/Output CPU
Computer Modules
Memory Connection Receives and sends data Receives addresses (of locations) Receives control signals
Read Write Timing
Input/Output Connection(1) Similar to memory from computer’s
viewpoint Output
Receive data from computer Send data to peripheral
Input Receive data from peripheral Send data to computer
Input/Output Connection(2) Receive control signals from
computer Send control signals to peripherals
e.g. spin disk Receive addresses from computer
e.g. port number to identify peripheral Send interrupt signals (control)
CPU Connection Reads instruction and data Writes out data (after processing) Sends control signals to other units Receives (& acts on) interrupts
Buses Interconnection There are a number of possible
interconnection systems Single and multiple BUS structures
are most common e.g. Control/Address/Data bus (PC) e.g. Unibus (DEC-PDP)
What is a Bus? A communication pathway
connecting two or more devices Usually broadcast/shared medium Often grouped
A number of channels in one bus e.g. 32 bit data bus is 32 separate
single bit channels Power lines may not be shown
Bus Interconnection Scheme
Data Bus Carries data
Remember that there is no difference between “data” and “instruction” at this level
Width is a key determinant of performance 8, 16, 32, 64 bit
Address bus Identify the source or destination of
data e.g. CPU needs to read an instruction
(data) from a given location in memory
Bus width determines maximum memory capacity of system e.g. 8080 has 16 bit address bus giving
64k address space
Control Bus Control and timing information for
data and address line- Typical control lines:
Memory read/write signal Interrupt request Clock signals Bus grant/request
Big and Yellow? What do buses look like?
Parallel lines on circuit boards Ribbon cables Strip connectors on mother boards
e.g. PCI Sets of wires
Physical Realization of Bus Architecture
Single Bus Problems Lots of devices on one bus leads to:
Propagation delays Long data paths mean that co-ordination of
bus use can adversely affect performance If aggregate data transfer approaches bus
capacity
Most systems use multiple buses to overcome these problems
Traditional (ISA)(with cache)
High Performance Bus
PCI Bus Peripheral Component
Interconnection Intel released to public domain 32 or 64 bit 64 data lines High bandwidth, processor
independent bus
PCI Bus