module 1 overview: introduction to computer architecture and organization

Module 1Overview: Introduction to

Computer Architecture and Organization

Architecture & Organization 1 Architecture is those attributes visible to the

programmer Instruction set, number of bits used for data

representation, I/O mechanisms, addressing techniques.

e.g. Is there a multiply instruction? Organization is how architecture features are

implemented Control signals, interfaces, memory technology. e.g. Is there a hardware multiply unit or is it

done by repeated addition?

Architecture & Organization 2 All Intel x86 family share the same

basic architecture The IBM System/370 family share

the same basic architecture This gives code compatibility

At least backwards Organization differs between

different versions

Structure & Function Structure is the way in which

components relate to each other Function is the operation of

individual components as part of the structure

Function All basic computer functions are:

Data processing–process data in variety of forms and requirements

Data storage–short and long term data storage for retrieval and update

Data movement–move data between computer and outside world. Devices that serve as source and destination of data–the process is known as I/O

Control–control of process, move and store data using instruction.

Functional View

Operations (a) Data movement device operation

Operations (b) Data storage device operation – read and write

Operation (c) Processing from/to storage

Operation (d)Processing from storage to I/O

Structure - Top Level - 1

Computer

Main Memory

InputOutput

SystemsInterconnection

Peripherals

Communicationlines

CentralProcessing Unit

Computer

Structure - Top Level - 2 Central Processing Unit (CPU) – controls

operation of the computer and performs data processing functions

Main memory – stores data I/O – moves data between computer

and external environment System interconnection – provides

communication between CPU, main memory and I/O

Structure - The CPU - 1

Computer Arithmeticand Login Unit

ControlUnit

Internal CPUInterconnection

Registers

CPU

I/O

Memory

SystemBus

CPU

Structure - The CPU - 2 Control unit (CU)–control the operation

of CPU Arithmetic and logic unit (ALU)-performs

data processing functions Registers-provides internal storage to

the CPU CPU Interconnection-provides

communication between control unit (CU), ALU and registers

Structure - The Control Unit

CPU

ControlMemory

Control Unit Registers and Decoders

SequencingLogin

ControlUnit

ALU

Registers

InternalBus

Control Unit

Implementation of control unit – micro programmed implementation

Module 1Overview: Von Neumann Machine and Computer

Evolution

First Generation – Vacuum TubesENIAC - background Electronic Numerical Integrator And Computer Eckert and Mauchly University of Pennsylvania Trajectory tables for weapons (range and

trajectory) Started 1943 Finished 1946

Too late for war effort, but used determine the feasibility of hydrogen bomb

Used until 1955

ENIAC - diagram

ENIAC - details Decimal (not binary) 20 accumulators of 10 digits Programmed manually by switches 18,000 vacuum tubes 30 tons 15,000 square feet 140 kW power consumption 5,000 additions per second

Von Neumann Machine/Turing Stored Program concept Main memory storing programs and data- That could

be changed and programming will become easier ALU operating on binary data Control unit interpreting instructions from memory

and executing Input and output (I/O) equipment operated by control

unit Princeton Institute for Advanced Studies

IAS computer Completed 1952

Structure of von Neumann machine

Von Neumann Architecture Data and instruction are stored in

a single read-write memory The contents of this memory is

addressable by location Execution occurs in a sequential

fashion from one instruction to the next

1000 x 40 bit wordsBinary number – number word

2 x 20 bit instructions – instruction word1

IAS – details-1

IAS – details-2

Set of registers (storage in CPU) Memory Buffer Register–contains word to be

stored in memory or sent to I/O unit, receive a word from memory of from I/O unit

Memory Address Register–specify the address in memory for MBR

Instruction Register-contains opcode Instruction Buffer Register-temporary storage for

instruction Program Counter-contains next instruction to be

fetched from memory Accumulator and Multiplier Quotient-temporary

storage for operands and result of ALU operations

Structure of IAS – detail-3

Control unit– fetches instructions from memory and executes them one by one

Commercial Computers 1947 - Eckert-Mauchly Computer

Corporation-manufacture computer commercially

UNIVAC I (Universal Automatic Computer)-first commercial computer

US Bureau of Census 1950 calculations Became part of Sperry-Rand Corporation Late 1950s - UNIVAC II

Faster More memory

IBM Punched-card processing equipment 1953 - the 701

IBM’s first stored program computer Scientific calculations

1955 - the 702 Business applications

Lead to 700/7000 series

Second Generation: Transistors Replaced vacuum tubes Smaller Cheaper Less heat dissipation Solid State device Made from Silicon (Sand) Invented 1947 at Bell Labs William Shockley et al.

Transistor Based Computers Second generation machines More complex arithmetic and logic

unit(ALU)and control unit(CU) Use of high level programming languages NCR & RCA produced small transistor

machines IBM 7000 DEC - 1957

Produced PDP-1

Transistors Based Computers

The Third Generation: Integrated CircuitsMicroelectronics Literally - “small electronics” A computer is made up of gates, memory cells

and interconnections Data storage-provided by memory cells Data processing-provided by gates Data movement-the paths between components that

are used to move data Control-the paths between components that carry

control singnals These can be manufactured on a

semiconductor e.g. silicon wafer

Integrated Circuits

Early integrated circuits- know as small scale integration (SSI)

Moore’s Law Increased density of components on chip-refer chart Gordon Moore – co-founder of Intel Number of transistors on a chip will double every

year-refer chart Since 1970’s development has slowed a little

Number of transistors doubles every 18 months Cost of a chip has remained almost unchanged Higher packing density means shorter electrical

paths, giving higher performance Smaller size gives increased flexibility Reduced power and cooling requirements Fewer interconnections increases reliability

Growth in CPU Transistor Count

IBM 360 series 1964 Replaced (& not compatible with) 7000 series First planned “family” of computers

Similar or identical instruction sets Similar or identical O/S Increasing speed Increasing number of I/O ports (i.e. more

terminals) Increased memory size Increased cost

Multiplexed switch structure-multiplexor

IBM 360 - images

DEC PDP-8 1964 First minicomputer (after

miniskirt!) Did not need air conditioned

room Small enough to sit on a lab

bench $16,000

$100k++ for IBM 360 Embedded applications &

OEM-integrate into a total system with other manufacturers

DEC - PDP-8 Bus Structure

Universal bus structure (Omnibus)-separate signals paths to carry control, data and address signalsCommon bus structure- control by CPU

Later Generations:Semiconductor Memory 1950s to 1960s-memory made of ring of

ferromagnetic material called cores-fast but expensive, bulky and destructive read

Semiconductor memory-By Fairchild 1970 Size of a single core

i.e. 1 bit of magnetic core storage Holds 256 bits Non-destructive read Much faster than core Capacity approximately doubles each year

Generations of Computer Vacuum tube - 1946-1957 Vacuum tube Transistor - 1958-1964 Transistor Small scale integration SSI - 1965 on

Up to 100 devices on a chip Medium scale integration MSI - to 1971

100-3,000 devices on a chip Large scale integration LSI - 1971-1977

3,000 - 100,000 devices on a chip Very large scale integration VLSI - 1978 -1991

100,000 - 100,000,000 devices on a chip Ultra large scale integration ULSI – 1991 -

Over 100,000,000 devices on a chip

Microprocessors Intel 1971 - 4004

First microprocessor All CPU components on a single chip 4 bit design

Followed in 1972 by 8008 8 bit Both designed for specific applications

1974 - 8080 Intel’s first general purpose microprocessor Design to be CPU of a general purpose

microcomputer

Pentium Evolution - 1 8080

first general purpose microprocessor 8 bit data path Used in first personal computer – Altair

8086 much more powerful 16 bit instruction cache, prefetch few instructions 8088 (8 bit external bus) used in first IBM PC

80286 16 MB memory addressable

80386 First 32 bit design Support for multitasking- run multiple programs at the same

time

Pentium Evolution - 2 80486

sophisticated powerful cache and instruction pipelining

built in maths co-processor Pentium

Superscalar technique-multiple instructions executed in parallel

Pentium Pro Increased superscalar organization Aggressive register renaming branch prediction data flow analysis speculative execution

Pentium Evolution - 3 Pentium II

MMX technology graphics, video & audio processing

Pentium III Additional floating point instructions for 3D graphics

Pentium 4 Note Arabic rather than Roman numerals Further floating point and multimedia enhancements

Itanium 64 bit see chapter 15

Itanium 2 Hardware enhancements to increase speed

See Intel web pages for detailed information on processors

Module 1Computer System: Designing and

Understanding Performance

Designing for PerformanceMicroprocessor Speed

Achieve full potential of speed if microprocessor is fed constant data and instructions. Techniques used:

Branch prediction-processor looks ahead in the instruction code and predicts which branches (group of instructions) are likely to be processed next

Data flow analysis-processor analyzes instructions which are dependent on each other’s result or data to create an optimized schedule of instructions

Speculative execution-use branch prediction and data flow analysis, processor speculatively executes instructions ahead of their program execution, holding the result in temporary locations.

Performance Balance – Processor and Memory Performance balance- an adjusting

of organization and architecture to balance the mismatch capabilities of various components (etc processor vs. memory)

Processor speed increased Memory capacity increased Memory speed lags behind

processor speed

Performance Balance – Processor and Memory (Performance Gap)

Performance Balance – Processor and Memory - Solutions Increase number of bits retrieved at one time

Make DRAM “wider” rather than “deeper” Change DRAM interface

Include cache in DRAM chip Reduce frequency of memory access

More complex and efficient cache between processor and memory

Cache on chip/processor Increase interconnection bandwidth between processor

and memory High speed buses Hierarchy of buses

Performance Balance: I/O Devices Peripherals with intensive I/O demands-refer chart Large data throughput demands-refer chart Processors can handle this I/O process, but the

problem is moving data between processor and devices

Solutions: Caching Buffering Higher-speed interconnection buses More elaborate bus structures Multiple-processor configurations

Performance Balance: I/O Devices

Key is Balance : Designers 2 factors:

The rate at which performance is changing in the various technology areas (processor, busses, memory, peripherals) differs greatly from one type of element to another

New applications and new peripheral devices constantly change the nature of demand on the system in term of typical instruction profile and the data access patterns

Improvements in Chip Organization and Architecture Increase hardware speed of processor

Fundamentally due to shrinking logic gate size More gates, packed more tightly, increasing

clock rate Propagation time for signals reduced

Increase size and speed of caches Dedicating part of processor chip

Cache access times drop significantly Change processor organization and architecture

Increase effective speed of execution Parallelism

Problems with Clock Speed and Logic Density Power

Power density increases with density of logic and clock speed Dissipating heat

RC delay Speed at which electrons flow limited by resistance and

capacitance of metal wires connecting them Delay increases as RC product increases Wire interconnects thinner, increasing resistance Wires closer together, increasing capacitance

Memory latency Memory speeds lag processor speeds

Solution: More emphasis on organizational and architectural

approaches

Intel Microprocessor Performance

Approach 1: Increased Cache Capacity Typically two or three levels of cache

between processor and main memory (L1, L2, L3)

Chip density increased More cache memory on chip

Faster cache access

Pentium chip devoted about 10% of chip area to cache

Pentium 4 devotes about 50%

Approach 2: More Complex Execution Logic

Enable parallel execution of instructions

Two approaches introduced: Pipelining Superscalar(covered later on)

Diminishing Returns from Approach 1 and Approach 2 Internal organization of processors very

complex Can get a great deal of parallelism Further significant increases likely to be

relatively modest Benefits from cache are reaching limit Increasing clock rate runs into power

dissipation problem Some fundamental physical limits are being

reached

New Approach – Multiple Cores Multiple processors on single chip

With large shared cache Within a processor, increase in performance

proportional to square root of increase in complexity If software can use multiple processors, doubling

number of processors almost doubles performance So, use two simpler processors on the chip rather

than one more complex processor With two processors, larger caches are justified

Power consumption of memory logic less than processing logic

Example: IBM POWER4 Two cores based on PowerPC

POWER4 Chip Organization

Module 1Computer System: Designing and

Understanding Performance

Introduction Hardware performance is often key to the effectiveness of

an entire system of hardware and software For different types of applications, different performance

metrics may by appropriate, and different aspects of a computer systems may be the most significant in determining overall performance

Understanding how best to measure performance and limitations of performance is important when selecting a computer system

To understand the issues of assessing performance. Why a piece of software performs as it does? Why one instruction set can be implemented to perform better than

another? How some hardware feature affects performance?

Defining Performance - 1

•How do we say one computer has better performance than another?•Peformance based on speed

•To take a single passenger from one point to another in the least time – Concorde •Performance based on passenger throughput

•To transport 450 passengers from one point to another - 747

Response Time and Throughput - 2

As an individual computer user, you are interested in reducing response time (or execution time)

The time between the start and completion of a task Data center managers are often interested in increasing

throughput The total amount of work done in a given time

Example: What is improved with the following changes? Replacing the processor in a computer with a faster version – will

improve response time and throughput Adding additional processors to a system that uses multiple

processors for separate tasks (e.g., searching the web) – will improve throughtput

Performance and Execution Time - 3

Relative Performance - 4 If computer A runs a program in 10 seconds and computer B runs the

same program in 15 seconds, how much faster is A than B? We know A is n times faster than B

Performance(A) = Execution time(B) = nPerformance(B) Execution time(B)

Performance ratio15 =1.510

We can say A is 1.5 times faster than B B is 1.5 times slower than A

To avoid the potential confusion between the terms increasing and decreasing, we usually say “improve performance” or “improve execution time”

Measuring Performance Time – measure of computer performance Definition of time - Wall-clock time, response time, or elapsed time CPU execution time (or CPU time)

The time the CPU spends computing for this task and does not include time spent waiting for I/O or running other programs

User CPU time vs. system CPU time Clock cycles (e.g., 0.25 ns) vs. clock rate (e.g., 4 GHz)

Different applications are sensitive to different aspects of the performance of a computer system

Many applications, especially those running on servers, depend as much on I/O performance and total elapsed time measured by a wall clock is of interest

In some application environments, the user may care about throughput, response time, or a complex combination of the two (e.g., maximum throughput with a worst-case response time)

CPU Execution Time A simple formula relates the most

basic metrics (i.e., clock cycles and clock cycle time) to CPU time

Improving Performance Our favorite program runs in 10 seconds on computer A, which has a

4 GHz clock. If a computer B will run this program in 6 seconds given that computer B requires 1.2 times as many clock cycles as computer A for this program. What is computer B’s clock rate?

Clock Cycles for program A

CPU Time(A) = CPU Clock Cycles(A) / Clock Rate(A)

10 s = CPU Clock Cycles(A) / 4 GHz

10 s = CPU Clock Cycles(A) / 4 X 10*9 Hz

CPU Clock Cycles(A) = 40 x 10*9 cycles

CPU Time(B)

CPU Time(B) = 1.2 X CPU Clock Cycles(A) / Clock Rate(B)

6 s = 1.2 X CPU Clock Cycles(A) / Clock Rate(B)

Clock Rate (B) = 1.2 X 40 X 10*9 cycles / 6 seconds

Clock Rate (B) = 48 X 10*9 cycles / 6 seconds

Clock Rate (B) = 8 X 10*9 cycles / seconds

Clock Rate (B) = 8 GHz

Clock Cycles Per Instruction (CPI) The execution time must depend on the number of

instructions in a program and the average time per instruction CPU clock cycles = Instructions for a program ×

Average clock cycles per instruction

Clock cycles per instruction (CPI) Average number of clock cycles each instruction

takes to execute CPI can provide one way of comparing two

different implementations of the same instruction set architecture

Using Performance Equation-1 Suppose we have two implementations of the same instruction set

architecture and for the same program. Which computer is faster and by how much?

Computer A: clock cycle time=250 ps and CPI=2.0 Computer B: clock cycle time=500 ps and CPI=1.2

Say I = number of instructions for the program, find number of clock cycles for A and B

CPU Clock Cycles(A) = I X CPI(A)CPU Clock Cycles(A) = I X 2.0CPU Clock Cycles(B) = I X CPI(B)CPU Clock Cycles(B) = I X 1.2

Compute CPU Time for A and B CPU Time(A) = CPU Clock Cycles(A) X Clock Cycle Time(A) CPU Time(A) = I X 2.0 X 250 ps = I X 500 ps CPU Time(B) = CPU Clock Cycles(B) X Clock Cycle Time(B) CPU Time(B) = I X 1.2 X 500 ps = I X 600 ps

Using Performance Equation-2 Clearly A is faster. The amount

faster is the ratio of execution time.Performance(A) = Execution time(B) = I X 600 ps = 1.2 timesPerformance(B) Execution time(B) I X 500 ps

We can conclude, A is 1.2 times faster than B for this program

Basic Peformance Equation Basic performance equation

Instruction count can be measured by using software tools that profile the execution or by using a simulator of the architecture

Hardware counters, which are included on many processors, can be used alternatively to record a variety of measurements

Number of instructions executed Average CPI Sources of performance loss

Basis Components of Performance

Measuring the CPI Sometimes it is possible to compute the CPU clock cycles by looking at the different types of

instructions and using their individual clock cycle counts

CPIi = count of the number of instructions of class i executed CPIi = average number of cycles per instruction for that instruction class n = number of instruction classes

Remember that overall CPI for a program will depend on both the number of cycles for each instruction type and the frequency of each instruction type in the program execution

Affecting Factors for the CPU Performance

Comparing Code Segments-1A compiler designer is trying to decide between 2 code sequences for a particular computer.

Which code sequence executes the most instructions? Which will be faster? What is the CPI for each sequence?

Comparing Code Segments-2 Sequence 1 (Instruction Count(1)) 2+1+2=5 instructions Sequence 2 (Instruction Count(2)) 4+1+1=6 instructions

CPU Clock Cycles(1)= (2X1)+(1X2)+(2X3) = 10 cycles CPU Clock Cycles(2)= (4X1)+(1X2)+(1X3) = 9 cycles

So code Sequence 2 faster, even though it executes 1 extra instruction Code Sequence 2 uses fewer clock cycles, must have lower CPI

CPI = CPU Clock Cycles/Instruction Count CPI(1) = CPU Clock Cycles(1)/Instruction Count(1) = 10/5 = 2 CPI(2) = CPU Clock Cycles(2)/Instruction Count(2) = 9/6 = 1.5

Module 1Computer System: Computer

Components and Bus Interconnection Structure

Von Neumann Architecture Data and instruction are stored in

a single read-write memory The contents of this memory is

addressable by location Execution occurs in a sequential

fashion from one instruction to the next

Program Concept Hardwired program-connecting/combining

various logic components to store data and perform arithmetic and logic operations

Hardwired systems are inflexible General purpose hardware can do different

tasks, given correct control signals Instead of re-wiring, supply a new set of

control signals

What is a program? A sequence of steps For each step, an arithmetic or logical

operation is done For each operation, a different/new set of

control signals is needed For each operation a unique code is

provided e.g. ADD, MOVE

A hardware segment accepts the code and issues the control signals

Hardware and Software Approaches

Components The Control Unit and the Arithmetic and

Logic Unit constitute the Central Processing Unit

Data and instructions need to get into the system and results out Input/output

Temporary storage of code and results is needed Main memory

Computer Components:Top Level View

Interconnection Structures All the units (processor, memory

and I/O components) must be connected – interconnection structure

Different type of connection for different type of unit Memory Input/Output CPU

Computer Modules

Memory Connection Receives and sends data Receives addresses (of locations) Receives control signals

Read Write Timing

Input/Output Connection(1) Similar to memory from computer’s

viewpoint Output

Receive data from computer Send data to peripheral

Input Receive data from peripheral Send data to computer

Input/Output Connection(2) Receive control signals from

computer Send control signals to peripherals

e.g. spin disk Receive addresses from computer

e.g. port number to identify peripheral Send interrupt signals (control)

CPU Connection Reads instruction and data Writes out data (after processing) Sends control signals to other units Receives (& acts on) interrupts

Buses Interconnection There are a number of possible

interconnection systems Single and multiple BUS structures

are most common e.g. Control/Address/Data bus (PC) e.g. Unibus (DEC-PDP)

What is a Bus? A communication pathway

connecting two or more devices Usually broadcast/shared medium Often grouped

A number of channels in one bus e.g. 32 bit data bus is 32 separate

single bit channels Power lines may not be shown

Bus Interconnection Scheme

Data Bus Carries data

Remember that there is no difference between “data” and “instruction” at this level

Width is a key determinant of performance 8, 16, 32, 64 bit

Address bus Identify the source or destination of

data e.g. CPU needs to read an instruction

(data) from a given location in memory

Bus width determines maximum memory capacity of system e.g. 8080 has 16 bit address bus giving

64k address space

Control Bus Control and timing information for

data and address line- Typical control lines:

Memory read/write signal Interrupt request Clock signals Bus grant/request

Big and Yellow? What do buses look like?

Parallel lines on circuit boards Ribbon cables Strip connectors on mother boards

e.g. PCI Sets of wires

Physical Realization of Bus Architecture

Single Bus Problems Lots of devices on one bus leads to:

Propagation delays Long data paths mean that co-ordination of

bus use can adversely affect performance If aggregate data transfer approaches bus

capacity

Most systems use multiple buses to overcome these problems

Traditional (ISA)(with cache)

High Performance Bus

PCI Bus Peripheral Component

Interconnection Intel released to public domain 32 or 64 bit 64 data lines High bandwidth, processor

independent bus

PCI Bus

module 1 overview: introduction to computer architecture and organization

Documents

data processingprocess

data representation

store data

long term data storage

requirements data storageshort

updatedata movementmove

computer architecture

control unit cucontrol