computer organization & assembly language programming · 2017. 2. 13. · assembly language...
TRANSCRIPT
1
Computer Organization &
Assembly Language Programming
CSE 2312
Lecture 4 Processor
2
Metric Units
The principal metric prefixes.
3
Measuring Computer and CPU Performance
• Elapsed time– Total response time, including all aspects, such as processing,
I/O, OS overhead, and idle time
– Determines system performance
• CPU time– CPU time spent on processing a given job
– Discount I/O time, and other jobs’ shares
– User CPU time + system CPU time
– Different programs are affected differently by CPU and system performance
Time spent
executing this
program’s
instructions
4
CPU Clock
• Every action is driven by a clock in the CPU
• Clock time = 1/Frequency– 1 Mhz clock = 10–6 seconds
– 1 Ghz clock = 10–9 seconds
5
How Long Does an Instruction Take?
• Digital logic is controlled by a clock
• Clock period: duration of a clock cycle– e.g., 250ps = 0.25ns = 250×10–12s
• Clock frequency (rate): cycles per second– e.g., 4.0GHz = 4000MHz = 4.0×109Hz
Clock (cycles)
Data transfer
and computation
Update state
Clock period
6
Predicting CPU Time
• Ideal: Only need to know number of instructions
• Reality: Some instructions take longer than others
Rate Clock
nsInstructio
TimeCycle ClocknsInstructio TimeCPU
cycle Clock
Seconds
nInstructio
cycles Clock
Program
nsInstructio TimeCPU
Instruction
Count
Cycles per
instruction
7
Instruction Count and Cycles Per Instruction
• IC is determined by program, ISA, and compiler
• CPI is determined by CPU and other factors– Different instructions have different CPI
– Average CPI is affected by instruction mix
Rate Clock
CPIIC
TimeCycle ClockCPIIC TimeCPU
(CPI) nInstructioper Cycles
(IC)Count nInstructioCycles Clock
8
Improving CPU Time
Rate Clock
Cycles Clock CPU
TimeCycle ClockCycles Clock CPU TimeCPU
n
1i
ii
n
1i
ii
Count nInstructio
Count nInstructioCPI
Count nInstructio
Cycles ClockCPI
Count nInstructioCPI Cycles Clock
Relative frequency
Usually a
tradeoff
9
Compiler Matters!
• Suppose compiler has two choices:– Can use 5 or 6 instructions, as described below:
• Which is better?
Class A B C
CPI for class 1 2 3
IC in sequence 1 2 1 2
IC in sequence 2 4 1 1
• Sequence 1: IC = 5
– Clock Cycles
= 2×1 + 1×2 + 2×3
= 10
– Avg. CPI = 10/5 = 2.0
• Sequence 2: IC = 6
– Clock Cycles
= 4×1 + 1×2 + 1×3
= 9
– Avg. CPI = 9/6 = 1.5
Sequence 2 has lower average CPI, so it is better.
10
Comparing Performance
• Performance = 1 / Execution Time
• “X is n times faster than Y”
• Example: time taken to run a program
– 10s on A, 15s on B… how much faster is A?
– Execution TimeB / Execution TimeA = 15s / 10s = 1.5
– So A is 1.5 times faster than B
n = (Performancex) / (Performancey)
= (Execution Timey) / (Execution Timex)
11
CPI Example
• Computer A: Cycle Time = 250ps, CPI = 2.0
• Computer B: Cycle Time = 500ps, CPI = 1.2
• Same ISA
• Which is faster, and by how much?
1.2500psI
600psI
A TimeCPU
B TimeCPU
600psI500ps1.2I
B TimeCycleBCPICount nInstructioB TimeCPU
500psI250ps2.0I
A TimeCycleACPICount nInstructioA TimeCPU
A is faster…
…by this much
12
CPU Example
• Computer A: – 2GHz clock, 10s CPU time
• Let’s design Computer B– Aim for 6s CPU time
– Can do faster clock, but causes 1.2x clock cycles
• How fast must the new clock be?
4GHz6s
1024
6s
10201.2Rate Clock
10202GHz10s
Rate Clock TimeCPUCycles Clock
6s
Cycles Clock1.2
TimeCPU
Cycles ClockRate Clock
99
B
9
AAA
A
B
B
B
=×
=××
=
×=×=
×=
×==
13
Time for a Program
• CPU executes various instructions
• A Program has a number of Instructions, how many?– Depends on program and compiler
• Each Instruction takes a number of CPU cycles, how many?– Depends on the Instruction Set Architecture (ISA)
– ISA: Learn in this course
• Each cycle has a fixed time based on CPU and BUS speed. – Depends on the hardware, organization
– Computer Architecture – Learn in this course
14
CPU Performance Equation
15
Performance Summary
• Performance depends on– Algorithm: affects IC, possibly CPI
– Programming language: affects IC, CPI
– Compiler: affects IC, CPI
– Instruction set architecture: affects IC, CPI, Tc
cycle Clock
Seconds
nInstructio
cycles Clock
Program
nsInstructio TimeCPU
16
How Improve Performance?
We must lower execution time!
• Algorithm– Determines number of operations executed
• Programming language, compiler, architecture– Determine number of machine instructions executed per operation
(IC)
• Processor and memory system– Determine how fast instructions are executed (CPI)
• I/O system (including OS)– Determines how fast I/O operations are executed
17
Amdahl’s Law
• Improving one aspect of a computer won’t give a proportional improvement in overall performance
• Especially true of multicore computers
• So make the common case fast!
unaffectedaffected
improved Tfactort improvemen
TT
18
Exercise 1
• Problem– There are 3 classes of instructions, A, B, C. Suppose compiler has two
choices: Sequence 1 and Sequence 2, as described below:
• Which one is better? Why?
Class A B C
CPI for class 1 2 3
IC in sequence 1 2 1 2
IC in sequence 2 3 1 1
• Sequence 1: IC = 5
– Clock Cycles = 2×1 + 1×2
+ 2×3 = 10
– Avg. CPI = 10/5 = 2.0
• Sequence 2: IC = 5
– Clock Cycles= 3×1 + 1×2
+ 1×3 = 8
– Avg. CPI = 8/5 = 1.6
Sequence 2 has lower average CPI, so it is better.
19
Exercise 2
• Problem:– There are two computers: A and B.
– Computer A: Cycle Time = 250ps, CPI = 2.0
– Computer B: Cycle Time = 400ps, CPI = 1.5
– If they have the same ISA, which computer is faster?
– How many times it is faster than another?
• Answer:– We know that CPU = IC * CPI * Cycle time
– Therefore, CPU(A) = IC*2*250 = 500*IC
– CPU(B) = IC*1.5*400 = 600*IC
– So, A is (600/500) = 1.2 times faster.
20
Exercise 3
• Problem:– Computer A has 2GHz clock. It takes 10s CPU time to finish one
given task.
– We want to design Computer B to finish the same task within 5s CPU time.
– The clock cycle number for computer B is 2 times as that of Computer A.
– What clock rate should be designed for Computer B?
• Answer:
8GHz5s
1040
5s
10202Rate Clock
10202GHz10s
Rate Clock TimeCPUCycles Clock
5s
Cycles Clock2
TimeCPU
Cycles ClockRate Clock
99
B
9
AAA
A
B
B
B
=×
=××
=
×=×=
×=
×==
21
Central Processing Unit (CPU)
The organization of a simple computer with one CPU and two I/O devices
22
Basic Elements
Other devices: Cache; Virtual Memory Support (MMU); ….
23
Processor
• CPU
– Brain of the Computer, that execute programs stored in the main memory by fetching instructions, examining and executing them one after another
• Bus
– Connect different components
– Parallel wires for transmitting address, data and control signals
– Could be external to the CPU (connect with memory, I/O), or internal
• Control Unit
– Fetching instructions from main memory and determining their types
• Arithmetic Logic Unit (ALU)
– Performing arithmetic operations, such as addition, boolean operations
• Registers
– High speed memory used to store temporary results and control information
– Program Counter (PC): point to the next instruction to be fetched
– Instruction Registers (IR): hold the instruction currently being executed
24
CPU Organization
The data path of a typical Von Neumann
machine
• Instructions:– Register-Memory:
memory words being fetched into registers
– Register-Register
• Data Path Cycle– The process of running
two operands through the ALU and storing results
– Define what the machine can do
– The faster the data path cycles is, the faster the computer runs
25
Arithmetic Logic Unit (ALU)
ALU
A B
C
op c
n
z
v4
• Conduct different calculations – +, -, x, /,
– and, or, xor, not,
– Shift, …
• Variants– Integer, Floating Point, Double
Precision
– High performance CPU has multiple!
• Input
– Operands - A, B
– Operation code: obtained from encoded instruction
• Output
– Result – C
– Status codes
Status codes:Usually
c - carry out from +, -, x, shift
n - result is negative
z - result is zero
v - result overflowed
26
Instruction Execution Steps
• Fetch-decode-execute – Fetch next instruction from memory into instruction register
– Change the program counter to point out the following instruction
– Determine type of instruction just fetched
– If instructions uses a word in memory, determine where it is
– Fetch the word, if needed, into a CPU register
– Execute the instruction
– Go to step 1 to begin executing following instruction
Central to the operation of all computers
An interpreter for a simple computer (written in Java).
• Figure 2-3. An interpreter for a simple computer
(written in Java).
. . .
An interpreter for a simple computer (cont’d)
29
Interpreting Instructions
• Interpreter
– A program that fetches, examines and executes the instructions of other program
– Can write a program to imitate the function of a CPU
– Main advantage: the ability to design a simple processor to support a rich set of instructions.
• Benefits (simple computer with interpreted instructions)
– The ability to fix incorrectly implemented instructions or make up for design deficiencies in the basic hardware
– The opportunity to add new instructions at minimal cost even after delivery of the machine
– Structured design that permits efficient development, testing and documenting of complex instructions
30
RISC vs. CISC
• Semantic Gap between – What machine can do?– What high-level programming languages required?
• Reduced Instruction Set Computer (Lego Building Example)– Did not use the interpretation – Did not have to be backward compatible with existing products– Small number of instructions, 50
• Key of designing RISC instructions– Designed instructions should be able to issued quickly – How long an instruction actually takes matters less than how
many could be started per second
• Complex Instruction Set Computer– Instructions, around 200-300, DEC VAX and IBM main-frames
• Inter (486 up)– A RISC core executes the simplest (most common) instructions– Interpreting the more complicated instructions in the usual CISC
way
31
RISC Design Principles for Modern Computers
• Instructions directly executed by hardware– Eliminating a level of interpretation provides high speed for most
instructions;
– Using CISC instruction with interpretation for less frequently occurring instructions is acceptable
• Maximize rate at which instructions are issued– Parallelism can play a major role in improving performance
• Instructions should be easy to decode– A critical limit on the rate of issue of instructions is decoding individual
instructions to determine what resources they need;
– The fewer different formats for instructions, the better
• Only loads, stores should reference memory– Access the memory can take a long time
– All other instructions should operate only on registers
• Provide plenty of registers– Running out of registers leads to flushing them back to memory
– Memory access leads to slow speed
32
Instruction-Level Parallelism
• A five-stage pipeline– The state of each stage as a function of time.
– Nine clock cycles are illustrated
33
Pipelining
• A five-stage pipeline– Suppose 2ns for the cycle time.
– It takes 10ns for an instruction to progress all the way through the five-stage pipeline
– So, the machine runs at 100 MIPS?
– Actual rate is 500 MIPS
• Pipelining– Allow a tradeoff between latency and processor bandwidth
– Latency: how long it takes to execute an instruction
– Processor bandwidth: how many MIPS the CPU has
• Example– Suppose a complex instruction should take 10 ns, under perfect
condition, how many stages pipeline we should design to guarantee to execute 500 MIPS?
Each pipeline: 1/500 MIPS = 2 ns
10 ns/ 2ns =5 stages
34
Superscalar Architectures (1)
• Dual five-stage pipelines with a common instruction fetch unit– Fetches pairs of instructions together and put each one into its own
pipeline
– Two instructions must not conflict over resource usage
– Neither must depend on the results of others
35
Superscalar Architectures (2)
A superscalar processor with five functional units.
• Implicit idea– S3 stage can issue instructions
considerably faster than the S4 stage is able to execute them
36
Processor-Level Parallelism (1)
• An array processor (SIMD)– A large number of identical processors that perform the same
sequence of instructions on different sets of data
– Different from a standard Von Neumann machine
37
Processor-Level Parallelism (2)
• A single-bus multiprocessor– Example: locate where the white ball in a picture
• A multicomputer with local memories.
38
Processor-Level Parallelism (3)
• Multiple-Computers (Loosely coupled)– Easier to build
• Multiple-Processors (Tightly coupled)– Easier to programming
39
Exercise
• Ex 1: TRUE OR FALSE, Why?
– The Data Path Cycle defines what the computer can do. The longer the data path cycles is, the faster the computer runs.
– Answer: F
– Reason: The Data Path Cycle defines what the computer can do. The shorter/faster the data path cycles is, the faster the computer runs.
40
Exercise
• Ex 2: What are the design principles for modern computers?
– (a) Instructions directly executed by hardware
– (b) Minimize rate at which instructions are issued
– (c) Instructions should be easy to decode
– (d) Only loads, stores should reference memory
– (e) Provide plenty of registers
– Answer: [a, c, d, e]
41
Exercise
• Ex 3: The following diagram gives the organization of a simple computer with one CPU and two I/O devices.
– Is it correct? If not, please correct it in the diagram.
• Solution
– Incorrect.
– In place of Disk, it should be Register.
– In place of Register, it should be Disk.
Disk
Register
42
Exercise
• Ex 4: Consider each instruction has 5 stages in a computer with pipelining techniques. Each stage takes 2 ns.
– What is the maximum number of MIPS that this machine is capable of with this 5-stage pipelining techniques?
– What is the maximum number of MIPS that this machine is capable of in the absence of pipelining?
• Solution
– 1/2ns=500MIPS
– 500/5=100MIPS or 1/(5*2ns)=100MIPS