computer organization & assembly language programming · 2017. 2. 13. · assembly language...

1

Computer Organization &

Assembly Language Programming

CSE 2312

Lecture 4 Processor

2

Metric Units

The principal metric prefixes.

3

Measuring Computer and CPU Performance

• Elapsed time– Total response time, including all aspects, such as processing,

I/O, OS overhead, and idle time

– Determines system performance

• CPU time– CPU time spent on processing a given job

– Discount I/O time, and other jobs’ shares

– User CPU time + system CPU time

– Different programs are affected differently by CPU and system performance

Time spent

executing this

program’s

instructions

4

CPU Clock

• Every action is driven by a clock in the CPU

• Clock time = 1/Frequency– 1 Mhz clock = 10–6 seconds

– 1 Ghz clock = 10–9 seconds

5

How Long Does an Instruction Take?

• Digital logic is controlled by a clock

• Clock period: duration of a clock cycle– e.g., 250ps = 0.25ns = 250×10–12s

• Clock frequency (rate): cycles per second– e.g., 4.0GHz = 4000MHz = 4.0×109Hz

Clock (cycles)

Data transfer

and computation

Update state

Clock period

6

Predicting CPU Time

• Ideal: Only need to know number of instructions

• Reality: Some instructions take longer than others

Rate Clock

nsInstructio

TimeCycle ClocknsInstructio TimeCPU

cycle Clock

Seconds

nInstructio

cycles Clock

Program

nsInstructio TimeCPU

Instruction

Count

Cycles per

instruction

7

Instruction Count and Cycles Per Instruction

• IC is determined by program, ISA, and compiler

• CPI is determined by CPU and other factors– Different instructions have different CPI

– Average CPI is affected by instruction mix

Rate Clock

CPIIC

TimeCycle ClockCPIIC TimeCPU

(CPI) nInstructioper Cycles

(IC)Count nInstructioCycles Clock

8

Improving CPU Time

Rate Clock

Cycles Clock CPU

TimeCycle ClockCycles Clock CPU TimeCPU

n

1i

ii

n

1i

ii

Count nInstructio

Count nInstructioCPI

Count nInstructio

Cycles ClockCPI

Count nInstructioCPI Cycles Clock

Relative frequency

Usually a

tradeoff

9

Compiler Matters!

• Suppose compiler has two choices:– Can use 5 or 6 instructions, as described below:

• Which is better?

Class A B C

CPI for class 1 2 3

IC in sequence 1 2 1 2


• Sequence 1: IC = 5

– Clock Cycles

= 2×1 + 1×2 + 2×3

= 10

– Avg. CPI = 10/5 = 2.0


– Clock Cycles

= 4×1 + 1×2 + 1×3

= 9

– Avg. CPI = 9/6 = 1.5

Sequence 2 has lower average CPI, so it is better.

10

Comparing Performance

• Performance = 1 / Execution Time

• “X is n times faster than Y”

• Example: time taken to run a program

– 10s on A, 15s on B… how much faster is A?

– Execution TimeB / Execution TimeA = 15s / 10s = 1.5

– So A is 1.5 times faster than B

n = (Performancex) / (Performancey)

= (Execution Timey) / (Execution Timex)

11

CPI Example

• Computer A: Cycle Time = 250ps, CPI = 2.0

• Computer B: Cycle Time = 500ps, CPI = 1.2

• Same ISA

• Which is faster, and by how much?

1.2500psI

600psI

A TimeCPU

B TimeCPU

600psI500ps1.2I

B TimeCycleBCPICount nInstructioB TimeCPU

500psI250ps2.0I

A TimeCycleACPICount nInstructioA TimeCPU

A is faster…

…by this much

12

CPU Example

• Computer A: – 2GHz clock, 10s CPU time

• Let’s design Computer B– Aim for 6s CPU time

– Can do faster clock, but causes 1.2x clock cycles

• How fast must the new clock be?

4GHz6s

1024

6s

10201.2Rate Clock

10202GHz10s

Rate Clock TimeCPUCycles Clock

6s

Cycles Clock1.2

TimeCPU

Cycles ClockRate Clock

99

B

9

AAA

A

B

B

B

=×

=××

=

×=×=

×=

×==

13

Time for a Program

• CPU executes various instructions

• A Program has a number of Instructions, how many?– Depends on program and compiler

• Each Instruction takes a number of CPU cycles, how many?– Depends on the Instruction Set Architecture (ISA)

– ISA: Learn in this course

• Each cycle has a fixed time based on CPU and BUS speed. – Depends on the hardware, organization

– Computer Architecture – Learn in this course

14

CPU Performance Equation

15

Performance Summary

• Performance depends on– Algorithm: affects IC, possibly CPI

– Programming language: affects IC, CPI

– Compiler: affects IC, CPI

– Instruction set architecture: affects IC, CPI, Tc

cycle Clock

Seconds

nInstructio

cycles Clock

Program

nsInstructio TimeCPU

16

How Improve Performance?

We must lower execution time!

• Algorithm– Determines number of operations executed

• Programming language, compiler, architecture– Determine number of machine instructions executed per operation

(IC)

• Processor and memory system– Determine how fast instructions are executed (CPI)

• I/O system (including OS)– Determines how fast I/O operations are executed

17

Amdahl’s Law

• Improving one aspect of a computer won’t give a proportional improvement in overall performance

• Especially true of multicore computers

• So make the common case fast!

unaffectedaffected

improved Tfactort improvemen

TT

18

Exercise 1

• Problem– There are 3 classes of instructions, A, B, C. Suppose compiler has two

choices: Sequence 1 and Sequence 2, as described below:

• Which one is better? Why?

Class A B C

CPI for class 1 2 3




– Clock Cycles = 2×1 + 1×2

+ 2×3 = 10

– Avg. CPI = 10/5 = 2.0


– Clock Cycles= 3×1 + 1×2

+ 1×3 = 8

– Avg. CPI = 8/5 = 1.6

Sequence 2 has lower average CPI, so it is better.

19

Exercise 2

• Problem:– There are two computers: A and B.

– Computer A: Cycle Time = 250ps, CPI = 2.0

– Computer B: Cycle Time = 400ps, CPI = 1.5

– If they have the same ISA, which computer is faster?

– How many times it is faster than another?

• Answer:– We know that CPU = IC * CPI * Cycle time

– Therefore, CPU(A) = IC*2*250 = 500*IC

– CPU(B) = IC*1.5*400 = 600*IC

– So, A is (600/500) = 1.2 times faster.

20

Exercise 3

• Problem:– Computer A has 2GHz clock. It takes 10s CPU time to finish one

given task.

– We want to design Computer B to finish the same task within 5s CPU time.

– The clock cycle number for computer B is 2 times as that of Computer A.

– What clock rate should be designed for Computer B?

• Answer:

8GHz5s

1040

5s

10202Rate Clock

10202GHz10s

Rate Clock TimeCPUCycles Clock

5s

Cycles Clock2

TimeCPU

Cycles ClockRate Clock

99

B

9

AAA

A

B

B

B

=×

=××

=

×=×=

×=

×==

21

Central Processing Unit (CPU)

The organization of a simple computer with one CPU and two I/O devices

22

Basic Elements

Other devices: Cache; Virtual Memory Support (MMU); ….

23

Processor

• CPU

– Brain of the Computer, that execute programs stored in the main memory by fetching instructions, examining and executing them one after another

• Bus

– Connect different components

– Parallel wires for transmitting address, data and control signals

– Could be external to the CPU (connect with memory, I/O), or internal

• Control Unit

– Fetching instructions from main memory and determining their types

• Arithmetic Logic Unit (ALU)

– Performing arithmetic operations, such as addition, boolean operations

• Registers

– High speed memory used to store temporary results and control information

– Program Counter (PC): point to the next instruction to be fetched

– Instruction Registers (IR): hold the instruction currently being executed

24

CPU Organization

The data path of a typical Von Neumann

machine

• Instructions:– Register-Memory:

memory words being fetched into registers

– Register-Register

• Data Path Cycle– The process of running

two operands through the ALU and storing results

– Define what the machine can do

– The faster the data path cycles is, the faster the computer runs

25

Arithmetic Logic Unit (ALU)

ALU

A B

C

op c

n

z

v4

• Conduct different calculations – +, -, x, /,

– and, or, xor, not,

– Shift, …

• Variants– Integer, Floating Point, Double

Precision

– High performance CPU has multiple!

• Input

– Operands - A, B

– Operation code: obtained from encoded instruction

• Output

– Result – C

– Status codes

Status codes:Usually

c - carry out from +, -, x, shift

n - result is negative

z - result is zero

v - result overflowed

26

Instruction Execution Steps

• Fetch-decode-execute – Fetch next instruction from memory into instruction register

– Change the program counter to point out the following instruction

– Determine type of instruction just fetched

– If instructions uses a word in memory, determine where it is

– Fetch the word, if needed, into a CPU register

– Execute the instruction

– Go to step 1 to begin executing following instruction

Central to the operation of all computers

An interpreter for a simple computer (written in Java).

• Figure 2-3. An interpreter for a simple computer

(written in Java).

. . .

An interpreter for a simple computer (cont’d)

29

Interpreting Instructions

• Interpreter

– A program that fetches, examines and executes the instructions of other program

– Can write a program to imitate the function of a CPU

– Main advantage: the ability to design a simple processor to support a rich set of instructions.

• Benefits (simple computer with interpreted instructions)

– The ability to fix incorrectly implemented instructions or make up for design deficiencies in the basic hardware

– The opportunity to add new instructions at minimal cost even after delivery of the machine

– Structured design that permits efficient development, testing and documenting of complex instructions

30

RISC vs. CISC

• Semantic Gap between – What machine can do?– What high-level programming languages required?

• Reduced Instruction Set Computer (Lego Building Example)– Did not use the interpretation – Did not have to be backward compatible with existing products– Small number of instructions, 50

• Key of designing RISC instructions– Designed instructions should be able to issued quickly – How long an instruction actually takes matters less than how

many could be started per second

• Complex Instruction Set Computer– Instructions, around 200-300, DEC VAX and IBM main-frames

• Inter (486 up)– A RISC core executes the simplest (most common) instructions– Interpreting the more complicated instructions in the usual CISC

way

https://www.youtube.com/watch?v=5IgST00wXHQ

31

RISC Design Principles for Modern Computers

• Instructions directly executed by hardware– Eliminating a level of interpretation provides high speed for most

instructions;

– Using CISC instruction with interpretation for less frequently occurring instructions is acceptable

• Maximize rate at which instructions are issued– Parallelism can play a major role in improving performance

• Instructions should be easy to decode– A critical limit on the rate of issue of instructions is decoding individual

instructions to determine what resources they need;

– The fewer different formats for instructions, the better

• Only loads, stores should reference memory– Access the memory can take a long time

– All other instructions should operate only on registers

• Provide plenty of registers– Running out of registers leads to flushing them back to memory

– Memory access leads to slow speed

32

Instruction-Level Parallelism

• A five-stage pipeline– The state of each stage as a function of time.

– Nine clock cycles are illustrated

33

Pipelining

• A five-stage pipeline– Suppose 2ns for the cycle time.

– It takes 10ns for an instruction to progress all the way through the five-stage pipeline

– So, the machine runs at 100 MIPS?

– Actual rate is 500 MIPS

• Pipelining– Allow a tradeoff between latency and processor bandwidth

– Latency: how long it takes to execute an instruction

– Processor bandwidth: how many MIPS the CPU has

• Example– Suppose a complex instruction should take 10 ns, under perfect

condition, how many stages pipeline we should design to guarantee to execute 500 MIPS?

Each pipeline: 1/500 MIPS = 2 ns

10 ns/ 2ns =5 stages

34

Superscalar Architectures (1)

• Dual five-stage pipelines with a common instruction fetch unit– Fetches pairs of instructions together and put each one into its own

pipeline

– Two instructions must not conflict over resource usage

– Neither must depend on the results of others

35

Superscalar Architectures (2)

A superscalar processor with five functional units.

• Implicit idea– S3 stage can issue instructions

considerably faster than the S4 stage is able to execute them

36

Processor-Level Parallelism (1)

• An array processor (SIMD)– A large number of identical processors that perform the same

sequence of instructions on different sets of data

– Different from a standard Von Neumann machine

37


• A single-bus multiprocessor– Example: locate where the white ball in a picture

• A multicomputer with local memories.

38


• Multiple-Computers (Loosely coupled)– Easier to build

• Multiple-Processors (Tightly coupled)– Easier to programming

39

Exercise

• Ex 1: TRUE OR FALSE, Why?

– The Data Path Cycle defines what the computer can do. The longer the data path cycles is, the faster the computer runs.

– Answer: F

– Reason: The Data Path Cycle defines what the computer can do. The shorter/faster the data path cycles is, the faster the computer runs.

40

Exercise

• Ex 2: What are the design principles for modern computers?

– (a) Instructions directly executed by hardware

– (b) Minimize rate at which instructions are issued

– (c) Instructions should be easy to decode

– (d) Only loads, stores should reference memory

– (e) Provide plenty of registers

– Answer: [a, c, d, e]

41

Exercise

• Ex 3: The following diagram gives the organization of a simple computer with one CPU and two I/O devices.

– Is it correct? If not, please correct it in the diagram.

• Solution

– Incorrect.

– In place of Disk, it should be Register.

– In place of Register, it should be Disk.

Disk

Register

42

Exercise

• Ex 4: Consider each instruction has 5 stages in a computer with pipelining techniques. Each stage takes 2 ns.

– What is the maximum number of MIPS that this machine is capable of with this 5-stage pipelining techniques?

– What is the maximum number of MIPS that this machine is capable of in the absence of pipelining?

• Solution

– 1/2ns=500MIPS

– 500/5=100MIPS or 1/(5*2ns)=100MIPS

computer organization & assembly language programming · 2017. 2. 13. · assembly language...

Documents