cse502 computer architecture textbook: computer architecture: a quantitative approach (3 rd edition)...

CSE502 Computer Architecture

• Textbook: Computer Architecture: A quantitative Approach (3rd edition)

• http://www.cs.sunysb.edu/~cse502• Introduction (Chap 1)• Instruction-level parallelism (Chap 3 & 4)• Memory system design (Chap 5)• Multiprocessor (Chap 6)• Storage System (Chap 7)• Clusters (Chap 8)

Requirements

• 2 midterms (30%), 4 homeworks (30%), one final project (40%)

• Important dates:– Midterm 1: 10/17; Midterm2: 11/30– HW1: 9/26; HW2: 10/12; HW3: 11/2; HW4: 11/21– Final project proposal due: 10/3– Final project presentation: 12/12-14– Final project due: 5PM on 12/22

Class Project

• Transition from undergrad to grad student

• Learn to take initiativepick topic design/implement evaluate write up

• give oral presentation• Write a conference-quality report • 4 weeks work full time for 2 people• Opportunity to do “research in the

small” to help make transition from good student to research colleague

• Focus: exploiting architectural features through novel systems software

Class Project Topics-I

1. Fast system call using hyper-threading 2. Exploiting System Management Mode for

whole-system check-pointing and restart 3. Debugger Support for GPU 4. Graphics Engine Resource Management 5. Novel application of trusted computing

hardware 6. Novel application of virtualization

hardware7. Array bound checking using debug register

hardware 8. Decoupled architecture for file prefetching

Class Project Topics -II

9. Feather-weight virtual machine for Linux10. Binary interpretation for self-modifying

code11. Software protection through polymorphic

code generation12. Clock synchronization using Wi-Fi beacon13. Trace-driven parallel network simulation14. Scalable comprehensive network packet

analysis engine15. Applying virtual LAN for low-latency

message logging

Technology Change• Performance

– Technology Advances» CMOS VLSI dominates older technologies (TTL, ECL) in

cost AND performance– Computer architecture advances improves low-end

» RISC, superscalar, RAID, …

• Price: Lower costs due to …– Simpler development

» CMOS VLSI: smaller systems, fewer components– Higher volumes

» CMOS VLSI : same dev. cost 10,000 vs. 10,000,000 units

– Lower margins by class of computer, due to fewer services

• Function– Rise of networking/local interconnection technology

Year

Tra

nsis

tors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000

i80386

i4004

i8080

Pentium

i80486

i80286

i8086

Technology Trends: Microprocessor Capacity

CMOS improvements:• Die size: 2X every 3 yrs• Line width: halve / 7 yrs

Pentium 4: 55 millionAlpha 21264: 15 millionPentium Pro: 5.5 millionPowerPC 620: 6.9 millionAlpha 21164: 9.3 millionSparc Ultra: 5.2 million

Moore’s Law

size

Year

Bit

s

1000

10000

100000

1000000

10000000

100000000

1000000000

1970 1975 1980 1985 1990 1995 2000

Memory Capacity (Single Chip DRAM)

year size(Mb) cyc time1980 0.0625 250 ns1983 0.25 220 ns1986 1 190 ns1989 4 165 ns1992 16 145 ns1996 64 120 ns2000 256 100 ns2003 1024 60 ns

Technology dramatic change• Processor

– logic capacity: about 30% per year– clock rate: about 20% per year

• Memory– DRAM capacity: about 60% per year (4x every 3

years)– Memory speed: about 10% per year– Cost per bit: improves about 25% per year

• Disk– capacity: about 60% per year– Total use of data: 100% per 9 months!

• Network Bandwidth– Bandwidth increasing more than 100% per year!

0

200

400

600

800

1000

1200

87 88 89 90 91 92 93 94 95 96 97

DEC A

lpha

21164/6

00

DEC A

lpha

5/5

00

DEC A

lpha

5/3

00

DEC A

lpha

4/2

66

IBM

PO

WER 1

00

DEC A

XP/

500

HP

9000/7

50

Sun

-4/2

60

IBM

RS

/6000

MIP

S M

/120

MIP

S M

/2000

Processor Performance(1.35X before, 1.55X now)

1.54X/yr

Computer Architecture Is …

the attributes of a [computing] system as seen by the programmer, i.e., the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls the logic design, and the physical implementation.

Amdahl, Blaaw, and Brooks, 1964

SOFTWARESOFTWARE

Computer Architecture Course Evolution

• 1950s to 1960s: Computer Arithmetic• 1970s to mid 1980s: Instruction Set

Design, especially ISA appropriate for compilers

• 1990s: Design of CPU, memory system, I/O system, Multiprocessors, Networks

• 2010s: Self adapting systems? Self organizing structures? DNA Systems/Quantum Computing?

• Our focus: how to exploit cool architectural features through novel systems software

Instruction Set Architecture (ISA)

instruction set

software

hardware

Evolution of Instruction Sets

Single Accumulator (EDSAC 1950)

Accumulator + Index Registers(Manchester Mark I, IBM 700 series 1953)

Separation of Programming Model from Implementation

High-level Language Based Concept of a Family(B5000 1963) (IBM 360 1964)

General Purpose Register Machines

Complex Instruction Sets Load/Store Architecture

RISC

(Vax, Intel 432 1977-80) (CDC 6600, Cray 1 1963-76)

(Mips,Sparc,HP-PA,IBM RS6000, . . .1987)

Interface Design

A good interface:

• Lasts through many implementations (portability, compatibility)

• Is used in many different ways (generality)

• Provides convenient functionality to higher levels

• Permits an efficient implementation at lower levels

Interfaceimp 1

imp 2

imp 3

use

use

use

time

Virtualization:One of the lessons of RISC

• Integrated Systems Approach – What really matters is the functioning of the complete system,

I.e. hardware, runtime system, compiler, and operating system– In networking, this is called the “End to End argument”– Programmers care about high-level languages, debuggers,

source-level object-oriented programming

• Computer architecture is not just about transistors, individual instructions, or particular implementations

• Original RISC projects replaced complex instructions with a compiler + simple instructions

• Logical Extension => Genetically adaptive runtime systems enhanced by dynamic compilation running on reconfigurable hardware? Perhaps.

Computer Architecture Topics

Instruction Set Architecture

Pipelining, Hazard Resolution,Superscalar, Reordering, Prediction, Speculation,Vector, Dynamic Compilation

Addressing,Protection,Exception Handling

L1 Cache

L2 Cache

DRAM

Disks, WORM, Tape

Coherence,Bandwidth,Latency

Emerging TechnologiesInterleavingBus protocols

RAID

VLSI

Input/Output and Storage

MemoryHierarchy

Pipelining and Instruction Level Parallelism

NetworkCommunication

Oth

er

Pro

cessors

Sample Organization: It’s all about

communication

Proc

CachesBusses

Memory

I/O Devices:

Controllers

adapters

DisksDisplaysKeyboards

Networks

Pentium III Chipset

Computer Architecture Topics

M

Interconnection NetworkS

PMPMPMP° ° °

Topologies,Routing,Bandwidth,Latency,Reliability

Network Interfaces

Shared Memory,Message Passing,Data Parallelism

Processor-Memory-Switch

MultiprocessorsNetworks and Interconnections

Measurement Tools

• Benchmarks, Traces, Mixes• Hardware: Cost, delay, area, power

consumption• Simulation (many levels)

– ISA, RTL, Gate, Circuit

• Queuing Theory• Rules of Thumb• Fundamental “Laws”/Principles

Performance Metrics

• Bandwidth/Throughput– Multithreading

• Response time/Latency– Optimization at every level

• Through ==? Latency– Low-latency low-throughput– High-latency high-throughput

• Power consumption• Security• Robustness

Performance(X) Execution_time(Y)

n = =

Performance(Y) Execution_time(Y)

Definitions•Performance is in units of things per sec

– bigger is better

•If we are primarily concerned with response time–performance(x) = 1

execution_time(x)

" X is n times faster than Y" means

Amdahl’s Law

enhanced

enhancedenhanced

new

oldoverall

Speedup

Fraction Fraction

1

ExTimeExTime

Speedup

1

Best you could ever hope to do:

enhancedmaximum Fraction - 1

1 Speedup

enhanced

enhancedenhancedoldnew Speedup

FractionFraction ExTime ExTime 1

Metrics of Performance

Compiler

Programming Language

Application

DatapathControl

TransistorsWiresPins

ISA

Function Units

(millions) of Instructions per second: MIPS(millions) of (FP) operations per second: MFLOP/s

Cycles per second (clock rate)

Megabytes per second

Answers per monthOperations per second

Computer Performance

CPU time = Seconds = Instructions x Cycles x Seconds

Program Program Instruction Cycle

CPU time = Seconds = Instructions x Cycles x Seconds

Program Program Instruction Cycle

Inst Count CPI Clock RateProgram X

Compiler X (X)

Inst. Set. X X

Organization X X

Technology X

inst count

CPI

Cycle time

Cycles Per Instruction(Throughput)

“Instruction Frequency”

CPI = (CPU Time * Clock Rate) / Instruction Count = Cycles / Instruction Count

“Average Cycles per Instruction”

j

n

jj I CPI TimeCycle time CPU

1

Count nInstructio

I F where F CPI CPI j

j

n

jjj

1

Example: Calculating CPI bottom up

Typical Mix of instruction typesin program

Base Machine (Reg / Reg)

Op Freq Cycles CPI(i) (% Time)

ALU 50% 1 .5 (33%)

Load 20% 2 .4 (27%)

Store 10% 2 .2 (13%)

Branch 20% 2 .4 (27%)

1.5

Example: Branch Stall Impact

• Assume CPI = 1.0 ignoring branches (ideal)• Assume solution was stalling for 3 cycles• If 30% branch, Stall 3 cycles on 30%

Op Freq Cycles CPI(i) (% Time)Other 70% 1 .7 (37%)Branch30% 4 1.2 (63%)

new CPI = 1.9

• New machine is 1/1.9 = 0.52 times faster (i.e. slow!)

SPEC: System Performance Evaluation Cooperative

• First Round 1989– 10 programs yielding a single number (“SPECmarks”)

• Second Round 1992– SPECInt92 (6 integer programs) and SPECfp92 (14 floating point

programs)» Compiler Flags unlimited. March 93 of DEC 4000 Model 610:

spice: unix.c:/def=(sysv,has_bcopy,”bcopy(a,b,c)=memcpy(b,a,c)”wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas

• Third Round 1995– new set of programs: SPECint95 (8 integer programs) and

SPECfp95 (10 floating point) – “benchmarks useful for 3 years”– Single flag setting for all programs: SPECint_base95,

SPECfp_base95

• Fourth Round 2000: 26 apps– analysis and simulation programs– Compression: bzip2, gzip, – Integrated circuit layout, ray tracing, lots of others

Performance Evaluation• “For better or worse, benchmarks shape a field”• Good products created when have:

– Good benchmarks– Good ways to summarize performance

• Given sales is a function in part of performance relative to competition, investment in improving product as reported by performance summary

• If benchmarks/summary inadequate, then choose between improving product for real programs vs. improving product to get more sales;Sales almost always wins!

• Execution time is the measure of computer performance!

Integrated Circuits Costs

Die Cost goes roughly with die area4

Test_Die Die_Area 2

Wafer_diam

Die_Area

2m/2)(Wafer_dia wafer per Dies

Die_area sityDefect_Den

1 dWafer_yiel YieldDie

yieldtest Finalcost Packaging cost Testingcost Die

cost IC

yield Die Wafer per DiescostWafer

cost Die

A "Typical" RISC

• 32-bit fixed format instruction (3 formats)• 32 32-bit GPR (R0 contains zero, DP take

pair)• 3-address, reg-reg arithmetic instruction• Single address mode for load/store:

base + displacement– no indirection

• Simple branch conditions• Delayed branch

see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC, CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3

Example: MIPS ( DLX)

Op

31 26 01516202125

Rs1 Rd immediate

Op

31 26 025

Op

31 26 01516202125

Rs1 Rs2

target

Rd Opx

Register-Register

561011

Register-Immediate

Op

31 26 01516202125

Rs1 Rs2/Opx immediate

Branch

Jump / Call

5 Steps of DLX DatapathFigure 3.1, Page 130

MemoryAccess

Write

Back

InstructionFetch

Instr. DecodeReg. Fetch

ExecuteAddr. Calc

LMD

ALU

MU

X

Mem

ory

Reg File

MU

XM

UX

Data

Mem

ory

MU

X

SignExtend

4

Ad

der Zero?

Next SEQ PC

Addre

ss

Next PC

WB Data

Inst

RD

RS1

RS2

Imm

5 Steps of DLX DatapathFigure 3.4, Page 134

MemoryAccess

Write

Back

InstructionFetch


ExecuteAddr. Calc

ALU

Mem

ory

Reg File

MU

XM

UX

Data

Mem

ory

MU

X

SignExtend

Zero?

IF/ID

ID/E

X

MEM

/WB

EX

/MEM

4

Ad

der

Next SEQ PC Next SEQ PC

RD RD RD WB

Data

• Data stationary control– local decode for each instruction phase / pipeline stage

Next PC

Addre

ss

RS1

RS2

Imm

MU

X

Visualizing PipeliningFigure 3.3, Page 133

Instr.

Order

Time (clock cycles)

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 6Cycle 7Cycle 5

Pipeline Stages vs. Throughput

• Hazard• Register overhead• Clock skew

Pipelining is not quite that easy!

• Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle

– Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away)

– Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock)

– Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).

One Memory Port/Structural Hazards

Figure 3.6, Page 142

Instr.

Order

Time (clock cycles)

Load

Instr 1

Instr 2

Instr 3

Instr 4

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg


Reg

ALU

DMemIfetch Reg

One Memory Port/Structural Hazards


Instr.

Order

Time (clock cycles)

Load

Instr 1

Instr 2

Stall

Instr 3

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg


Reg

ALU

DMemIfetch Reg

Bubble Bubble Bubble BubbleBubble

Speed Up Equation for Pipelining

pipelined

dunpipeline

TimeCycle

TimeCycle

CPI stall Pipeline CPI Idealdepth Pipeline CPI Ideal

Speedup

pipelined

dunpipeline

TimeCycle

TimeCycle

CPI stall Pipeline 1depth Pipeline

Speedup

Instper cycles Stall Average CPI Ideal CPIpipelined

For simple RISC pipeline, CPI = 1:

Example: Dual-port vs. Single-port

• Machine A: Dual ported memory (“Harvard Architecture”)

• Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate

• Ideal CPI = 1 for both• Loads are 40% of instructions executed

SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe)

= Pipeline Depth

SpeedUpB = Pipeline Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe / 1.05)

= (Pipeline Depth/1.4) x 1.05

= 0.75 x Pipeline Depth

SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33

• Machine A is 1.33 times faster

Instr.

Order

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Data Hazard on R1Figure 3.9, page 147

Time (clock cycles)

IF ID/RF EX MEM WB

• Read After Write (RAW) InstrJ tries to read operand before InstrI writes it

• Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication.

Three Generic Data Hazards

I: add r1,r2,r3J: sub r4,r1,r3

• Write After Read (WAR) InstrJ writes operand before InstrI reads it

• Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”.

• Can’t happen in DLX 5 stage pipeline because:– All instructions take 5 stages, and– Reads are always in stage 2, and – Writes are always in stage 5

I: sub r4,r1,r3 J: add r1,r2,r3K: mul r6,r1,r7



• Write After Write (WAW) InstrJ writes operand before InstrI writes it.

• Called an “output dependence” by compiler writersThis also results from the reuse of name “r1”.

• Can’t happen in DLX 5 stage pipeline because: – All instructions take 5 stages, and – Writes are always in stage 5

• Will see WAR and WAW in more complicated pipes

I: sub r1,r4,r3 J: add r1,r2,r3K: mul r6,r1,r7

Time (clock cycles)

Forwarding to Avoid Data Hazard


Inst

r.

Order

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

HW Change for ForwardingFigure 3.20, Page 161

MEM

/WR

ID/E

X

EX

/MEM

DataMemory

ALU

mux

mux

Registe

rs

NextPC

Immediate

mux

Time (clock cycles)

Instr.

Order

lw r1, 0(r2)

sub r4,r1,r6

and r6,r1,r7

or r8,r1,r9

Data Hazard Even with Forwarding


Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Data Hazard Even with ForwardingFigure 3.13, Page 154

Time (clock cycles)

or r8,r1,r9

Instr.

Order

lw r1, 0(r2)

sub r4,r1,r6

and r6,r1,r7

Reg

ALU

DMemIfetch Reg

RegIfetch

ALU

DMem RegBubble

Ifetch

ALU

DMem RegBubble Reg

Ifetch

ALU

DMemBubble Reg

Try producing fast code for

a = b + c;

d = e – f;

assuming a, b, c, d ,e, and f in memory. Slow code:

LW Rb,b

LW Rc,c

ADD Ra,Rb,Rc

SW a,Ra

LW Re,e

LW Rf,f

SUB Rd,Re,Rf

SW d,Rd

Software Scheduling to Avoid Load Hazards

Fast code:

LW Rb,b

LW Rc,c

LW Re,e

ADD Ra,Rb,Rc

LW Rf,f

SW a,Ra

SUB Rd,Re,Rf

SW d,Rd

Control Hazard on Branches

Three Stage Stall

10: beq r1,r3,36

14: and r2,r3,r5

18: or r6,r1,r7

22: add r8,r1,r9

36: xor r10,r1,r11

Reg ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Branch Stall Impact

• If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9!

• Two part solution:– Determine branch taken or not sooner, AND– Compute taken branch address earlier

• DLX branch tests if register = 0 or 0• DLX Solution:

– Move Zero test to ID/RF stage– Adder to calculate new PC in ID/RF stage– 1 clock cycle penalty for branch versus 3

Ad

der

IF/ID

Pipelined DLX DatapathFigure 3.22, page 163

MemoryAccess

Write

Back

InstructionFetch


ExecuteAddr. Calc

ALU

Mem

ory

Reg File

MU

X

Data

Mem

ory

MU

X

SignExtend

Zero?

MEM

/WB

EX

/MEM

4

Ad

der

Next SEQ PC

RD RD RD WB

Data

• Data stationary control– local decode for each instruction phase / pipeline stage

Next PC

Addre

ss

RS1

RS2

ImmM

UX

ID/E

X

Four Branch Hazard Alternatives

#1: Stall until branch direction is clear#2: Predict Branch Not Taken

– Execute successor instructions in sequence– “Squash” instructions in pipeline if branch actually taken– Advantage of late pipeline state update– 47% DLX branches not taken on average– PC+4 already calculated, so use it to get next instruction

#3: Predict Branch Taken– 53% DLX branches taken on average– But haven’t calculated branch target address in DLX

» DLX still incurs 1 cycle branch penalty» Other machines: branch target known before outcome

Four Branch Hazard Alternatives

#4: Delayed Branch– Define branch to take place AFTER a following instruction

branch instructionsequential successor1

sequential successor2........sequential successorn

branch target if taken

– 1 slot delay allows proper decision and branch target address in 5 stage pipeline

– DLX uses this

Branch delay of length n

Delayed Branch• Where to get instructions to fill branch delay slot?

– Before branch instruction– From the target address: only valuable when branch taken– From fall through: only valuable when branch not taken– Canceling branches allow more slots to be filled

• Compiler effectiveness for single branch delay slot:– Fills about 60% of branch delay slots– About 80% of instructions executed in branch delay slots

useful in computation– About 50% (60% x 80%) of slots usefully filled

• Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)

Evaluating Branch Alternatives

Scheduling Branch CPIspeedup v. speedup v. scheme penalty unpipelined stall

Stall pipeline 3 1.42 3.5 1.0Predict taken 1 1.14 4.4 1.26Predict not taken 1 1.09 4.5 1.29Delayed branch 0.5 1.07 4.6 1.31

Conditional & Unconditional = 14%, 65% change PC

Pipeline speedup = Pipeline depth1 +Branch frequencyBranch penalty

Changes in the flow of instructions make pipelining

difficult• Must avoid adding too much overhead in

pipeline startup and drain.• Branches and Jumps cause fast alteration of

PC. Things that get in the way:– Instructions take time to decode, introducing delay slots.– The next PC takes time to compute– For conditional branches, the branch direction takes time

to compute.

• Interrupts and Exceptions also cause problems

– Must make decisions about when to interrupt flow of instructions

– Must preserve sufficient pipeline state to resume execution

Jumps and Calls (JAL) (unconditional branches)

• Even though we know that we will change PC, still require delay slot because of:

– Instruction Decode -- Pretty hard and fast– PC Computation -- Could fix with absolute jumps/calls

(not necessarily a good solution)

• Basically, there is a decision being made, which takes time.

• This suggests single delay slot:– I.e. next instruction after jump or JAL is always

executed

MemoryAccess

Write

Back

InstructionFetch


ExecuteAddr. Calc

ALU

Mem

ory

Reg File

MU

XM

UX

Data

Mem

ory

MU

X

SignExtend

Branch?

4

Ad

der

RD RD RD WB

Data

Next PC

Addre

ss

RS2

ImmM

UX

ID/E

X

MEM

/WB

EX

/MEM

IF/ID

Ad

der

RS1

Return PC(Addr + 8)

Imm

Opcode

Summary of Pipelining Basics• Hazards limit performance

– Structural: need more HW resources– Data: need forwarding, compiler scheduling– Control: early evaluation & PC, delayed branch, prediction

• Increasing length of pipe increases impact of hazards; pipelining helps instruction bandwidth, not latency

• Interrupts, Instruction Set, FP makes pipelining harder

• Compilers reduce cost of data and control hazards– Load delay slots– Branch delay slots– Branch prediction

• Today: Longer pipelines (R4000) => Better branch prediction, more instruction parallelism?

Chapter 3: Advanced Pipelining

Problem: “Fetch” unit

Instruction Fetchwith

Branch Prediction

Out-Of-OrderExecution

Unit

Correctness FeedbackOn Branch Results

Stream of InstructionsTo Execute

• Instruction fetch decoupled from execution

• Often issue logic (+ rename) included with Fetch

Branches must be resolved quickly for loop overlap!

• In our loop-unrolling example, we relied on the fact that branches were under control of “fast” integer unit in order to get overlap!

Loop: LD F0 0 R1MULTD F4 F0 F2SD F4 0 R1SUBI R1 R1 #8BNEZ R1 Loop

• What happens if branch depends on result of multd??

– We completely lose all of our advantages!

– Need to be able to “predict” branch outcome.

– If we were to predict that branch was taken, this would be right most of the time.

• Problem much worse for superscalar machines!

• Prediction has become essential to getting good performance from scalar instruction streams.

• We will discuss predicting branches. However, architects are now predicting everything: data dependencies, actual data, and results of groups of instructions:

– At what point does computation become a probabilistic operation + verification?

– We are pretty close with control hazards already…

• Why does prediction work?– Underlying algorithm has regularities.– Data that is being operated on has regularities.– Instruction sequence has redundancies that are artifacts of

way that humans/compilers think about problems.

• Prediction Compressible information streams?

Prediction: Branches, Dependencies, Data

Dynamic Branch Prediction

• Is dynamic branch prediction better than static branch prediction?

– Seems to be. Still some debate to this effect– Josh Fisher had good paper on “Predicting

Conditional Branch Directions from Previous Runs of a Program.”ASPLOS ‘92. In general, good results if allowed to run program for lots of data sets.

» How would this information be stored for later use?

» Still some difference between best possible static prediction (using a run to predict itself) and weighted average over many different data sets

– Paper by Young et all, “A Comparative Analysis of Schemes for Correlated Branch Prediction” notices that there are a small number of important branches in programs which have dynamic behavior.

Need Address at Same Time as Prediction

• Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken)– Note: must check for branch match now, since can’t use wrong branch address (Figure 4.22, p. 273)

• Return instruction addresses predicted with stack Branch PC Predicted PC

=?

PC

of in

stru

ctio

nFETC

H

Predict taken or untaken

Dynamic Branch Prediction• Prediction could be “Static” (at compile

time) or “Dynamic” (at runtime)– For our example, if we were to statically

predict “taken”, we would only be wrong once each pass through loop

– Static information passed through bits in opcode

• Is dynamic branch prediction better than static branch prediction?– Seems to be. Still some debate to this effect– Today, lots of hardware being devoted to

dynamic branch predictors.• Does branch prediction make sense for 5-

stage, in-order pipeline? What about 8-stage pipeline?

– Perhaps: eliminate branch delay slots– Then predict branches

• BHT is a table of “Predictors”– Usually 2-bit, saturating counters– Indexed by PC address of Branch – without tags

• In Fetch state of branch:– BTB identifies branch– Predictor from BHT used to make prediction

• When branch completes– Update corresponding Predictor

Predictor 0

Predictor 7

Predictor 1

Branch PC

Branch History Table

Dynamic Branch Prediction (standard technologies)

• Combine Branch Target Buffer and History Tables– Branch Target Buffer (BTB): identify branches and hold taken

addresses» Trick: identify branch before fetching instruction!» Must be careful not to misidentify branches or destinations

– Branch History Table makes prediction» Can be complex prediction mechanisms with long history» No address check: Can be good, can be bad (aliasing)

• Simple 1-bit BHT: keep last direction of branch• Problem: in a loop, 1-bit BHT will cause two

mispredictions (avg is 9 iteratios before exit):– End of loop case, when it exits instead of looping as before– First time through loop on next time through code, when it

predicts exit instead of looping

• Performance = ƒ(accuracy, cost of misprediction)– Misprediction Flush Reorder Buffer

• Solution: 2-bit scheme where change prediction only if get misprediction twice: (Figure 4.13, p. 264)

• Red: stop, not taken• Green: go, taken• Adds hysteresis to decision making process

Dynamic Branch Prediction(Jim Smith, 1981)

T

T

NT

Predict Taken

Predict Not Taken

Predict Taken

Predict Not TakenT

NT

T

NT

NT

BHT Accuracy

• Mispredict because either:– Wrong guess for that branch– Got branch history of wrong branch when index the

table

• 4096 entry table programs vary from 1% misprediction (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12%

• 4096 about as good as infinite table(in Alpha 211164)

Correlating Branches• Hypothesis: recent branches are correlated; that is, behavior of

recently executed branches affects prediction of current branch• Two possibilities; Current branch depends on:

– Last m most recently executed branches anywhere in programProduces a “GA” (for “global adaptive”) in the Yeh and Patt classification (e.g. GAg)

– Last m most recent outcomes of same branch.Produces a “PA” (for “per-address adaptive”) in same classification (e.g. PAg)

• Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper branch history table entry

– A single history table shared by all branches (appends a “g” at end), indexed by history value.

– Address is used along with history to select table entry (appends a “p” at end of classification)

– If only portion of address used, often appends an “s” to indicate “set-indexed” tables (I.e. GAs)

Correlating Branches

(2,2) GAs predictor– First 2 means that we

keep two bits of history– Second means that we

have 2 bit counters in each slot.

– Then behavior of recent branches selects between, say, four predictions of next branch, updating just that prediction

– Note that the original two-bit counter solution would be a (0,2) GAs predictor

– Note also that aliasing is possible here...

Branch address

2-bits per branch predictors

PredictionPrediction

2-bit global branch history register

• For instance, consider global history, set-indexed BHT. That gives us a GAs history table.

Each slot is2-bit counter

Comparison

Tournament Predictor

• Multiple predictors per branch, typically two, one global and the other local

• Use a 2-bit saturating counter to determine which predictor’s outcome to use

– The counter is incremented when the prediction of one predictor is correct and that of the other predictor is incorrect; it is decremented in the reverse direction

• Slight improvement over correlating predictors

Misprediction Ratio Comparison

• Avoid branch prediction by turning branches into conditionally executed instructions:

if (x) then A = B op C else NOP– If false, then neither store result nor cause

exception– Expanded ISA of Alpha, MIPS, PowerPC, SPARC

have conditional move; PA-RISC can annul any following instr.

– IA-64: 64 1-bit condition fields selected so conditional execution of any instruction

– This transformation is called “if-conversion”

• Drawbacks to conditional instructions– Still takes a clock even if “annulled”– Stall if condition evaluated late– Complex conditions reduce effectiveness;

condition result becomes known late in pipeline

x

A = B op C

Predicated Execution

Dynamic Branch Prediction Summary

• Prediction becoming important part of scalar execution.

– Prediction is exploiting “information compressibility” in execution

• Branch History Table: 2 bits for loop accuracy• Correlation: Recently executed branches

correlated with next branch.– Either different branches (GA)– Or different executions of same branches (PA).

• Branch Target Buffer: include branch address & prediction

• Predicated Execution can reduce number of branches, number of mispredicted branches

CS502 Projects

• 2 Persons per group working on a project• Project proposal due on 10/3 (2% of final

grade)– Members of the group– Essentially Introduction section of your final report

What is the research question?What is the expected deliverable?What is the general approach?

Can we use HW to get CPI closer to 1?

• Why in HW at run time?– Works when can’t know real dependence at compile time– Compiler simpler– Code for one machine runs well on another

• Key idea: Allow instructions behind stall to proceed

DIVD F0,F2,F4ADDD F10,F0,F8SUBD F12,F8,F14

• Out-of-order execution => out-of-order completion.

Problems?• How do we prevent WAR and WAW hazards?• How do we deal with variable latency?

– Forwarding for RAW hazards harder.

Clock Cycle Number

Instruction 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

LD F6,34(R2) I F I D EX MEM WB

LD F2,45(R3) I F I D EX MEM WB

MULTD F0,F2,F4 I F I D stall M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 MEM WB

SUBD F8,F6,F2 I F I D A1 A2 MEM WB

DI VD F10,F0,F6 I F I D stall stall stall stall stall stall stall stall stall D1 D2

ADDD F6,F8,F2 I F I D A1 A2 MEM WB

RAW

WAR

Dynamic hardware techniques for

out-of-order execution• HW exploitation of ILP

– Works when can’t know dependence at compile time.– Code for one machine runs well on another

• Scoreboard (ala CDC 6600 in 1963)– Centralized control structure– No register renaming, no forwarding– Pipeline stalls for WAR and WAW hazards.– Are these fundamental limitations??? (No)

• Reservation stations (ala IBM 360/91 in 1966)– Distributed control structures– Implicit renaming of registers (dispatched pointers)– WAR and WAW hazards eliminated by register renaming– Results broadcast to all reservation stations for RAW

Scoreboard: a bookkeeping technique

• Out-of-order execution divides ID stage:1. Issue—decode instructions, check for structural

hazards2. Read operands—wait until no data hazards, then

read operands

• Scoreboards date to CDC6600 in 1963• Instructions execute whenever not dependent

on previous instructions and no hazards. • CDC 6600: In order issue, out-of-order

execution, out-of-order commit (or completion)

– No forwarding!– Imprecise interrupt/exception model for now

Scoreboard Architecture(CDC 6600)

Fu

ncti

on

al U

nit

s

Reg

iste

rs

FP MultFP Mult

FP MultFP Mult

FP DivideFP Divide

FP AddFP Add

IntegerInteger

MemorySCOREBOARDSCOREBOARD

Scoreboard Implications• Out-of-order completion => WAR, WAW

hazards?• Solutions for WAR:

– Stall writeback until registers have been read– Read registers only during Read Operands stage

• Solution for WAW:– Detect hazard and stall issue of new instruction until other

instruction completes

• No register renaming!• Need to have multiple instructions in

execution phase => multiple execution units or pipelined execution units

• Scoreboard keeps track of dependencies between instructions that have already issued.

• Scoreboard replaces ID, EX, WB with 4 stages

Four Stages of Scoreboard Control

• Issue—decode instructions & check for structural hazards (ID1)

– Instructions issued in program order (for hazard checking)– Don’t issue if structural hazard– Don’t issue if instruction is output dependent on any

previously issued but uncompleted instruction (no WAW hazards)

• Read operands—wait until no data hazards, then read operands (ID2)

– All real dependencies (RAW hazards) resolved in this stage, since we wait for instructions to write back data.

– No forwarding of data in this model!

Four Stages of Scoreboard Control

• Execution—operate on operands (EX)– The functional unit begins execution upon receiving

operands. When the result is ready, it notifies the scoreboard that it has completed execution.

• Write result—finish execution (WB)– Stall until no WAR hazards with previous instructions:

Example: DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F8,F8,F14

CDC 6600 scoreboard would stall SUBD until ADDD reads operands

Three Parts of the Scoreboard

• Instruction status:Which of 4 steps the instruction is in

• Functional unit status:—Indicates the state of the functional unit (FU). 9 fields for each functional unit

Busy: Indicates whether the unit is busy or notOp: Operation to perform in the unit (e.g., + or –)Fi: Destination registerFj,Fk: Source-register numbersQj,Qk: Functional units producing source registers Fj,

FkRj,Rk: Flags indicating when Fj, Fk are ready

• Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register

Scoreboard ExampleInstruction status: Read Exec Write

Instruction j k Issue Oper Comp ResultLD F6 34+ R2LD F2 45+ R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Functional unit status: dest S1 S2 FU FU Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer NoMult1 NoMult2 NoAdd NoDivide No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

FU

Detailed Scoreboard Pipeline Control

Read operandsExecutio

n complete

Instruction status

Write result

Issue

Bookkeeping

Rj No; Rk No

f(if Qj(f)=FU then Rj(f) Yes);f(if Qk(f)=FU then Rj(f) Yes); Result(Fi(FU)) 0; Busy(FU) No

Busy(FU) yes; Op(FU) op; Fi(FU) `D’; Fj(FU) `S1’;

Fk(FU) `S2’; Qj Result(‘S1’); Qk Result(`S2’); Rj not Qj; Rk not Qk; Result(‘D’) FU;

Rj and Rk

Functional unit done

Wait until

f((Fj(f)Fi(FU) or Rj(f)=No) & (Fk(f)Fi(FU) or

Rk( f )=No))

Not busy (FU) and not result(D)

Scoreboard Example: Cycle 1

Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1LD F2 45+ R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2


Integer Yes Load F6 R2 YesMult1 NoMult2 NoAdd NoDivide No


1 FU Integer


Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2LD F2 45+ R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2




2 FU Integer

• Issue 2nd LD?


Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3LD F2 45+ R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2


Integer Yes Load F6 R2 NoMult1 NoMult2 NoAdd NoDivide No


3 FU Integer

• Issue MULT?


Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2




4 FU Integer


Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2




5 FU Integer


Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6MULTD F0 F2 F4 6SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2


Integer Yes Load F2 R3 YesMult1 Yes Mult F0 F2 F4 Integer No YesMult2 NoAdd NoDivide No


6 FU Mult1 Integer


Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7

MULTD F0 F2 F4 6SUBD F8 F6 F2 7DIVD F10 F0 F6ADDD F6 F8 F2


Integer Yes Load F2 R3 NoMult1 Yes Mult F0 F2 F4 Integer No YesMult2 NoAdd Yes Sub F8 F6 F2 Integer Yes NoDivide No


7 FU Mult1 Integer Add

• Read multiply operands?

Scoreboard Example: Cycle 8a

(First half of clock cycle)Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7MULTD F0 F2 F4 6SUBD F8 F6 F2 7DIVD F10 F0 F6 8ADDD F6 F8 F2


Integer Yes Load F2 R3 NoMult1 Yes Mult F0 F2 F4 Integer No YesMult2 NoAdd Yes Sub F8 F6 F2 Integer Yes NoDivide Yes Div F10 F0 F6 Mult1 No Yes


8 FU Mult1 Integer Add Divide

Scoreboard Example: Cycle 8b

(Second half of clock cycle)Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6SUBD F8 F6 F2 7DIVD F10 F0 F6 8ADDD F6 F8 F2


Integer NoMult1 Yes Mult F0 F2 F4 Yes YesMult2 NoAdd Yes Sub F8 F6 F2 Yes YesDivide Yes Div F10 F0 F6 Mult1 No Yes


8 FU Mult1 Add Divide


Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4

LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDD F6 F8 F2


Integer No10 Mult1 Yes Mult F0 F2 F4 Yes Yes

Mult2 No2 Add Yes Sub F8 F6 F2 Yes Yes

Divide Yes Div F10 F0 F6 Mult1 No Yes



• Read operands for MULT & SUB? Issue ADDD?

Note Remaining


Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDD F6 F8 F2


Integer No9 Mult1 Yes Mult F0 F2 F4 No No

Mult2 No1 Add Yes Sub F8 F6 F2 No No





Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9 11DIVD F10 F0 F6 8ADDD F6 F8 F2



Mult2 No0 Add Yes Sub F8 F6 F2 No No





Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8ADDD F6 F8 F2



Mult2 NoAdd NoDivide Yes Div F10 F0 F6 Mult1 No Yes


12 FU Mult1 Divide

• Read operands for DIVD?


Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8ADDD F6 F8 F2 13



Mult2 NoAdd Yes Add F6 F8 F2 Yes YesDivide Yes Div F10 F0 F6 Mult1 No Yes




Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8ADDD F6 F8 F2 13 14



Mult2 No2 Add Yes Add F6 F8 F2 Yes Yes





Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9

SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8ADDD F6 F8 F2 13 14



Mult2 No1 Add Yes Add F6 F8 F2 No No





Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8ADDD F6 F8 F2 13 14 16



Mult2 No0 Add Yes Add F6 F8 F2 No No








Mult2 NoAdd Yes Add F6 F8 F2 No NoDivide Yes Div F10 F0 F6 Mult1 No Yes



• Why not write result of ADD???

WAR Hazard!


Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9 19SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8ADDD F6 F8 F2 13 14 16







Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9 19 20SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8ADDD F6 F8 F2 13 14 16


Integer NoMult1 NoMult2 NoAdd Yes Add F6 F8 F2 No NoDivide Yes Div F10 F0 F6 Yes Yes


20 FU Add Divide


Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9 19 20SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8 21ADDD F6 F8 F2 13 14 16


Integer NoMult1 NoMult2 NoAdd Yes Add F6 F8 F2 No NoDivide Yes Div F10 F0 F6 Yes Yes


21 FU Add Divide

• WAR Hazard is now gone...


Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9 19 20SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8 21ADDD F6 F8 F2 13 14 16 22


Integer NoMult1 NoMult2 NoAdd No

39 Divide Yes Div F10 F0 F6 No No


22 FU Divide


Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9 19 20SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8 21 61ADDD F6 F8 F2 13 14 16 22


Integer NoMult1 NoMult2 NoAdd No

0 Divide Yes Div F10 F0 F6 No No


61 FU Divide


Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9 19 20SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8 21 61 62ADDD F6 F8 F2 13 14 16 22




62 FU

Review: Scoreboard Example: Cycle 62

Instruction status: Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9 19 20SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8 21 61 62ADDD F6 F8 F2 13 14 16 22




62 FU

• In-order issue; out-of-order execute & commit

CDC 6600 Scoreboard

• Speedup 1.7 from compiler; 2.5 by hand BUT slow memory (no cache) limits benefit

• Limitations of 6600 scoreboard:– No forwarding hardware– Limited to instructions in basic block (small window)– Small number of functional units (structural hazards),

especially integer/load store units– Do not issue on structural hazards– Wait for WAR hazards– Prevent WAW hazards

Another Dynamic Algorithm: Tomasulo

Algorithm• For IBM 360/91 about 3 years after CDC 6600

(1966)• Goal: High Performance without special

compilers• Differences between IBM 360 & CDC 6600 ISA

– IBM has only 2 register specifiers/instr vs. 3 in CDC 6600

– IBM has 4 FP registers vs. 8 in CDC 6600– IBM has memory-register ops

• Why Study? lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604, …

Tomasulo Algorithm vs. Scoreboard

• Control & buffers distributed with Function Units (FU) vs. centralized in scoreboard;

– FU buffers called “reservation stations”; have pending operands

• Registers in instructions replaced by values or pointers to reservation stations(RS); called register renaming ;

– avoids WAR, WAW hazards– More reservation stations than registers, so can do

optimizations compilers can’t• Results to FU from RS, not through registers, over

Common Data Bus that broadcasts results to all FUs• Load and Stores treated as FUs with RSs as well• Integer instructions can go past branches, allowing

FP ops beyond basic block in FP queue

Tomasulo Organization

FP addersFP adders

Add1Add2Add3

FP multipliersFP multipliers

Mult1Mult2

From Mem FP Registers

Reservation Stations

Common Data Bus (CDB)

To Mem

FP OpQueue

Load Buffers

Store Buffers

Load1Load2Load3Load4Load5Load6

Reservation Station Components

Op: Operation to perform in the unit (e.g., + or –)

Vj, Vk: Value of Source operands– Store buffers has V field, result to be stored

Qj, Qk: Reservation stations producing source registers (value to be written)

– Note: No ready flags as in Scoreboard; Qj,Qk=0 => ready– Store buffers only have Qi for RS producing result

Busy: Indicates reservation station or FU is busy

Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.

Three Stages of Tomasulo Algorithm

1. Issue—get instruction from FP Op Queue If reservation station free (no structural hazard),

control issues instr & sends operands (renames registers).

2. Execution—operate on operands (EX) When both operands ready then execute; (RAW)

if not ready, watch Common Data Bus for result

3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting units;

mark reservation station available

• Normal data bus: data + destination (“go to” bus)• Common data bus: data + source (“come from” bus)

– 64 bits of data + 4 bits of Functional Unit source address– Write if matches expected Functional Unit (produces result)– Does the broadcast

Tomasulo ExampleInstruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 Load1 NoLD F2 45+ R3 Load2 NoMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 No


0 FU

Tomasulo Example Cycle 1Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 Load1 Yes 34+R2LD F2 45+ R3 Load2 NoMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2




1 FU Load1


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2




2 FU Load2 Load1

Note: Unlike 6600, can have multiple loads outstanding


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2


Add1 NoAdd2 NoAdd3 NoMult1 Yes MULTD R(F4) Load2Mult2 No


3 FU Mult1 Load2 Load1

• Note: registers names are removed (“renamed”) in Reservation Stations; MULT issued vs. scoreboard

• Load1 completing; what is waiting for Load1?


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 Load2 Yes 45+R3MULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6ADDD F6 F8 F2


Add1 Yes SUBD M(A1) Load2Add2 NoAdd3 NoMult1 Yes MULTD R(F4) Load2Mult2 No


4 FU Mult1 Load2 M(A1) Add1

• Load2 completing; what is waiting for Load1?


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6 5ADDD F6 F8 F2


2 Add1 Yes SUBD M(A1) M(A2)Add2 NoAdd3 No

10 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1


5 FU Mult1 M(A2) M(A1) Add1 Mult2


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6 5ADDD F6 F8 F2 6


1 Add1 Yes SUBD M(A1) M(A2)Add2 Yes ADDD M(A2) Add1Add3 No



6 FU Mult1 M(A2) Add2 Add1 Mult2

• Issue ADDD here vs. scoreboard?


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7DIVD F10 F0 F6 5ADDD F6 F8 F2 6


0 Add1 Yes SUBD M(A1) M(A2)Add2 Yes ADDD M(A2) Add1Add3 No



7 FU Mult1 M(A2) Add2 Add1 Mult2

• Add1 completing; what is waiting for it?


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6


Add1 No2 Add2 Yes ADDD (M-M) M(A2)

Add3 No7 Mult1 Yes MULTD M(A2) R(F4)

Mult2 Yes DIVD M(A1) Mult1


8 FU Mult1 M(A2) Add2 (M-M) Mult2


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6







Tomasulo Example Cycle 10

Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10







• Add2 completing; what is waiting for it?


Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11


Add1 NoAdd2 NoAdd3 No



11 FU Mult1 M(A2) (M-M+M)(M-M) Mult2

• Write result of ADDD here vs. scoreboard?• All quick instructions complete in this cycle!


Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11







Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11


Add1 NoAdd2 NoAdd3 NoMult1 No

40 Mult2 Yes DIVD M*F4 M(A1)


16 FU M*F4 M(A2) (M-M+M)(M-M) Mult2


Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11







Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5 56ADDD F6 F8 F2 6 10 11






• Mult2 is completing; what is waiting for it?


Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5 56 57ADDD F6 F8 F2 6 10 11


Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 Yes DIVD M*F4 M(A1)


56 FU M*F4 M(A2) (M-M+M)(M-M) Result

• Once again: In-order issue, out-of-order execution and completion.

Compare to Scoreboard Cycle 62

Instruction status: Read Exec Write Exec WriteInstruction j k Issue Oper Comp Result Issue ComplResultLD F6 34+ R2 1 2 3 4 1 3 4LD F2 45+ R3 5 6 7 8 2 4 5MULTD F0 F2 F4 6 9 19 20 3 15 16SUBD F8 F6 F2 7 9 11 12 4 7 8DIVD F10 F0 F6 8 21 61 62 5 56 57ADDD F6 F8 F2 13 14 16 22 6 10 11

• Why take longer on scoreboard/6600?•Structural Hazards•Lack of forwarding

Tomasulo v. Scoreboard(IBM 360/91 v. CDC 6600)

Pipelined Functional Units Multiple Functional Units

(6 load, 3 store, 3 +, 2 x/÷) (1 load/store, 1 + , 2 x, 1 ÷)

window size: ≤ 14 instructions ≤ 5 instructions

No issue on structural hazard same

WAR: renaming avoids stall completion

WAW: renaming avoids stall issue

Broadcast results from FU Write/read registers

Control: reservation stationscentral scoreboard

Tomasulo Drawbacks

• Complexity– delays of 360/91, MIPS 10000, IBM 620?

• Many associative stores (CDB) at high speed

• Performance limited by Common Data Bus– Multiple CDBs => more FU logic for parallel assoc

stores

Review: Three Stages of Tomasulo Algorithm

1. Issue—get instruction from FP Op Queue If reservation station free (no structural hazard),

control issues instr & sends operands (renames registers).

2. Execution—operate on operands (EX) When both operands ready then execute;

if not ready, watch Common Data Bus for result

3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting units;

mark reservation station available

• Common data bus: data + source (“come from” bus)– 64 bits of data + 4 bits of Functional Unit source address– Write if matches expected Functional Unit (produces result)– Does the broadcast

HW support for precise interrupts

• Concept of Reorder Buffer (ROB):– Holds instructions in FIFO order, exactly as they were issued

» Each ROB entry contains PC, dest reg, result, exception status

– When instructions complete, results placed into ROB» Supplies operands to other instruction between execution

complete & commit more registers like RS» Tag results with ROB buffer number instead of reservation

station– Instructions commit values at head of ROB placed in

registers– As a result, easy to undo

speculated instructions on mispredicted branches or on exceptions

ReorderBufferFP

OpQueue

FP Adder FP Adder

Res Stations Res Stations

FP Regs

Commit path

Four Steps of Speculative Tomasulo Algorithm

1.Issue—get instruction from FP Op Queue If reservation station and reorder buffer slot free, issue

instr & send operands & reorder buffer no. for destination (this stage sometimes called “dispatch”)

2.Execution—operate on operands (EX) When both operands ready then execute; if not ready,

watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called “issue”)

3.Write result—finish execution (WB) Write on Common Data Bus to all awaiting FUs

& reorder buffer; mark reservation station available.

4.Commit—update register with reorder result When instr. at head of reorder buffer & result present,

update register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch flushes reorder buffer (sometimes called “graduation”)

What are the hardware complexities with reorder buffer

(ROB)?ReorderBuffer

FPOp

Queue

FP Adder FP Adder

Res Stations Res Stations

FP Regs

Com

par n

etw

ork

• How do you find the latest version of a register?– As specified by Smith paper, need associative comparison network– Could use future file or just use the register result status buffer to track which specific reorder buffer has received the value

• Need as many ports on ROB as register file

Reorder Table

Dest

Reg

Resu

lt

Excep

tion

s?

Valid

Pro

gra

m C

ou

nte

r

Tomasulo With Reorder buffer:

ToMemory

FP addersFP adders FP multipliersFP multipliers


FP OpQueue

ROB7

ROB6

ROB5

ROB4

ROB3

ROB2

ROB1F0F0 LD F0,10(R2)LD F0,10(R2) NN

Done?

DestDest

Oldest

Newest

from Memory

1 10+R21 10+R2Dest

Reorder Buffer

Registers

2 ADDD R(F4),ROB12 ADDD R(F4),ROB1


ToMemory



FP OpQueue

ROB7

ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F10F10

F0F0ADDD F10,F4,F0ADDD F10,F4,F0

LD F0,10(R2)LD F0,10(R2)NN

NN

Done?

DestDest

Oldest

Newest

from Memory

1 10+R21 10+R2Dest

Reorder Buffer

Registers

3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB12 ADDD R(F4),ROB1


ToMemory



FP OpQueue

ROB7

ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F2F2

F10F10

F0F0

DIVD F2,F10,F6DIVD F2,F10,F6

ADDD F10,F4,F0ADDD F10,F4,F0

LD F0,10(R2)LD F0,10(R2)

NN

NN

NN

Done?

DestDest

Oldest

Newest

from Memory

1 10+R21 10+R2Dest

Reorder Buffer

Registers

3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB12 ADDD R(F4),ROB16 ADDD ROB5, R(F6)6 ADDD ROB5, R(F6)


ToMemory



FP OpQueue

ROB7

ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F0F0 ADDD F0,F4,F6ADDD F0,F4,F6 NN

F4F4 LD F4,0(R3)LD F4,0(R3) NN

---- BNE F2,<…>BNE F2,<…> NN

F2F2

F10F10

F0F0



LD F0,10(R2)LD F0,10(R2)

NN

NN

NN

Done?

DestDest

Oldest

Newest

from Memory

1 10+R21 10+R2Dest

Reorder Buffer

Registers

6 0+R36 0+R3

3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB12 ADDD R(F4),ROB16 ADDD ROB5, R(F6)6 ADDD ROB5, R(F6)


ToMemory



FP OpQueue

ROB7

ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

----

F0F0ROB5ROB5

ST 0(R3),F4ST 0(R3),F4

ADDD F0,F4,F6ADDD F0,F4,F6NN

NN

F4F4 LD F4,0(R3)LD F4,0(R3) NN

---- BNE F2,<…>BNE F2,<…> NN

F2F2

F10F10

F0F0



LD F0,10(R2)LD F0,10(R2)

NN

NN

NN

Done?

DestDest

Oldest

Newest

from Memory

Dest

Reorder Buffer

Registers

1 10+R21 10+R26 0+R36 0+R3

3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)


ToMemory



FP OpQueue

ROB7

ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

----

F0F0M[10]M[10]

ST 0(R3),F4ST 0(R3),F4

ADDD F0,F4,F6ADDD F0,F4,F6YY

NN

F4F4 M[10]M[10] LD F4,0(R3)LD F4,0(R3) YY

---- BNE F2,<…>BNE F2,<…> NN

F2F2

F10F10

F0F0



LD F0,10(R2)LD F0,10(R2)

NN

NN

NN

Done?

DestDest

Oldest

Newest

from Memory

1 10+R21 10+R2Dest

Reorder Buffer

Registers

2 ADDD R(F4),ROB12 ADDD R(F4),ROB16 ADDD M[10],R(F6)6 ADDD M[10],R(F6)



ToMemory



FP OpQueue

ROB7

ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

----

F0F0M[10]M[10]

<val2><val2>ST 0(R3),F4ST 0(R3),F4


ExEx

F4F4 M[10]M[10] LD F4,0(R3)LD F4,0(R3) YY

---- BNE F2,<…>BNE F2,<…> NN

F2F2

F10F10

F0F0



LD F0,10(R2)LD F0,10(R2)

NN

NN

NN

Done?

DestDest

Oldest

Newest

from Memory

1 10+R21 10+R2Dest

Reorder Buffer

Registers

----

F0F0M[10]M[10]

<val2><val2>ST 0(R3),F4ST 0(R3),F4


ExEx

F4F4 M[10]M[10] LD F4,0(R3)LD F4,0(R3) YY

---- BNE F2,<…>BNE F2,<…> NN



ToMemory



FP OpQueue

ROB7

ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F2F2

F10F10

F0F0



LD F0,10(R2)LD F0,10(R2)

NN

NN

NN

Done?

DestDest

Oldest

Newest

from Memory

1 10+R21 10+R2Dest

Reorder Buffer

Registers

What about memoryhazards???

Memory Disambiguation:Sorting out RAW Hazards in

memory• Question: Given a load that follows a store in

program order, are the two related?– (Alternatively: is there a RAW hazard between the store and

the load)?

Eg: st 0(R2),R5 ld R6,0(R3)

• Can we go ahead and start the load early? – Store address could be delayed for a long time by some

calculation that leads to R2 (divide?). – We might want to issue/begin execution of both operations in

same cycle.– Today: Answer is that we are not allowed to start load until we

know that address 0(R2) 0(R3)– Next Week: We might guess at whether or not they are

dependent (called “dependence speculation”) and use reorder buffer to fixup if we are wrong.

Hardware Support for Memory Disambiguation

• Need buffer to keep track of all outstanding stores to memory, in program order.

– Keep track of address (when becomes available) and value (when becomes available)

– FIFO ordering: will retire stores from this buffer in program order

• When issuing a load, record current head of store queue (know which stores are ahead of you).

• When have address for load, check store queue:– If any store prior to load is waiting for its address, stall load.– If load address matches earlier store address (associative

lookup), then we have a memory-induced RAW hazard:» store value available return value» store value not available return ROB number of source

– Otherwise, send out request to memory

• Actual stores commit in order, so no worry about WAR/WAW hazards through memory.

---- LD F4, 10(R3)LD F4, 10(R3) NN

Memory Disambiguation:

ToMemory



FP OpQueue

ROB7

ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F2F2

F0F0

----

R[F5]R[F5]

<val 1><val 1>

ST 10(R3), F5 ST 10(R3), F5

LD F0,32(R2)LD F0,32(R2)

ST 0(R3), F4ST 0(R3), F4

NN

NN

YY

Done?

DestDest

Oldest

Newest

from Memory

2 32+R22 32+R2

4 ROB34 ROB3

Dest

Reorder Buffer

Registers

Does it make sense to rename memory

addresses?• Renaming is about transforming the names

of storage locations:– I.e. Register names: F0P10, F0 P2, ….– Data addresses…?

• Consider:loop: <inst1>

st r2, 0(r3)<inst2>ld r6, 0(r3)<really slow instruction>beq loop

• In order to overlap iterations, need to:– Recognize that there are names being reused– Bypass the store/load sequence internal to pipeline

• Often done by memory disambiguation unit

Explicit Register Renaming

• Make use of a physical register file that is larger than number of registers specified by ISA

• Keep a translation table:– ISA register => physical register mapping– When register is written, replace table entry with new

register from freelist.– Physical register becomes free when not being used by any

instructions in progress.

FetchDecode/Rename

Execute

RenameTable

Explicit register renaming:R10000 Freelist Management

Done?

Oldest

Newest

P0P0 P2P2 P4P4 F6F6 F8F8 P10P10 P12P12 P14P14 P16P16 P18P18 P20P20 P22P22 P24P24 p26p26 P28P28 P30P30

P32P32 P34P34 P36P36 P38P38 P60P60 P62P62

Current Map Table

Freelist

• Physical register file larger than ISA register file• On issue, each instruction that modifies a register

is allocated new physical register from freelist• Used on: R10000, Alpha 21264, HP PA8000


F0F0 P0P0 LD P32,10(R2)LD P32,10(R2) NN

Done?

Oldest

Newest


P34P34 P36P36 P38P38 P40P40 P60P60 P62P62

Current Map Table

Freelist

• Note that physical register P0 is “dead” (or not “live”) past the point of this load.

– When we go to commit the load, we free up


F10F10

F0F0P10P10

P0P0ADDD P34,P4,P32ADDD P34,P4,P32

LD P32,10(R2)LD P32,10(R2)NN

NN

Done?

Oldest

Newest


P36P36 P38P38 P40P40 P42P42 P60P60 P62P62

Current Map Table

Freelist


----

----

F2F2

F10F10

F0F0

P2P2

P10P10

P0P0

BNE P36,<…>BNE P36,<…> NN

DIVD P36,P34,P6DIVD P36,P34,P6

ADDD P34,P4,P32ADDD P34,P4,P32

LD P32,10(R2)LD P32,10(R2)

NN

NN

NN

Done?

Oldest

Newest


P38P38 P40P40 P44P44 P48P48 P60P60 P62P62

Current Map Table

Freelist


P38P38 P40P40 P44P44 P48P48 P60P60 P62P62 Checkpoint at BNE instruction


----

F0F0

F4F4

----

F2F2

F10F10

F0F0

P32P32

P4P4

P2P2

P10P10

P0P0

ST 0(R3),P40ST 0(R3),P40

ADDD P40,P38,P6ADDD P40,P38,P6YY

YY

LD P38,0(R3)LD P38,0(R3) YY

BNE P36,<…>BNE P36,<…> NN



LD P32,10(R2)LD P32,10(R2)

NN

yy

yy

Done?

Oldest

Newest


P42P42 P44P44 P48P48 P50P50 P0P0 P10P10

Current Map Table

Freelist




F2F2

F10F10

F0F0

P2P2

P10P10

P0P0



LD P32,10(R2)LD P32,10(R2)

NN

yy

yy

Done?

Oldest

Newest

Current Map Table

Freelist




P38P38 P40P40 P44P44 P48P48 P0P0 P10P10

Error fixed by restoring map table and merging freelist

Advantages of Explicit Renaming

• Decouples renaming from scheduling:– Pipeline can be exactly like “standard” DLX pipeline (perhaps

with multiple operations issued per cycle)– Or, pipeline could be tomasulo-like or a scoreboard, etc.– Standard forwarding or bypassing could be used

• Allows data to be fetched from single register file– No need to bypass values from reorder buffer– This can be important for balancing pipeline

• Many processors use a variant of this technique:– R10000, Alpha 21264, HP PA8000

• Another way to get precise interrupt points:– All that needs to be “undone” for precise break point

is to undo the table mappings– Provides an interesting mix between reorder buffer and future

file» Results are written immediately back to register file» Registers names are “freed” in program order (by ROB)

Summary #1• HW exploiting ILP

– Works when can’t know dependence at compile time.– Code for one machine runs well on another

• Key idea of Scoreboard: Allow instructions behind stall to proceed (Decode => Issue

instr & read operands)– Enables out-of-order execution => out-of-order

completion– ID stage checked both for structural & data

dependencies– Original version didn’t handle forwarding. – No automatic register renaming

Summary #2• Reservations stations: renaming to larger set of

registers + buffering source operands– Prevents registers as bottleneck– Avoids WAR, WAW hazards of Scoreboard– Allows loop unrolling in HW

• Not limited to basic blocks (integer units gets ahead, beyond branches)

• Helps cache misses as well• Lasting Contributions

– Dynamic scheduling– Register renaming– Load/store disambiguation

• 360/91 descendants are Pentium II; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha 21264

Chapter 4: ILP

Instruction Level Parallelism

• High speed execution based on instruction level parallelism (ilp): potential of short instruction sequences to execute in parallel

• High-speed microprocessors exploit ILP by:1) pipelined execution: overlap instructions

2) Out-of-order execution (commit in-order)

3) Multiple issue: issue and execute multiple instructions per clock cycle

4) Vector instructions: many independent ops specified with a single instruction

• Memory accesses for high-speed microprocessor?

– Data Cache possibly multiported, multiple levels

Getting CPI < 1: IssuingMultiple Instructions/Cycle

• Superscalar: varying no. instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo)

– IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000

• (Very) Long Instruction Words (V)LIW: fixed number of instructions (4-16) scheduled by the compiler; put ops into wide templates

– Joint HP/Intel agreement in 1999/2000?– Intel Architecture-64 (IA-64) 64-bit address

» Style: “Explicitly Parallel Instruction Computer (EPIC)”– New SUN MAJIC Architecture: VLIW for Java

• Vector Processing:Explicit coding of independent loops as operations on large vectors of numbers

– Multimedia instructions being added to many processors

• Anticipated success lead to use of Instructions Per Clock cycle (IPC) vs. CPI

Review: Multiple Issue Challenges• If more instructions issue at same time,

greater difficulty of decode and issue:– Even 2-scalar => examine 2 opcodes, 6 register specifiers,

& decide if 1 or 2 instructions can issue– Register file: need 2x reads and 1x writes/cycle– Rename logic: must be able to rename same register

multiple times in one cycle! For instance, consider 4-way issue:

add r1, r2, r3 add p11, p4, p7sub r4, r1, r2 sub p22, p11, p4lw r1, 4(r4) lw p23, 4(p22)add r5, r1, r2 add p12, p23, p4

Imagine doing this transformation in a single cycle!– Result buses: Need to complete multiple instructions/cycle

» So, need multiple buses with associated matching logic at every reservation station.

» Or, need multiple forwarding paths

First Pentium-4:Willamette

Die Photo Heat Sink

Pentium-4 Pipeline

• Microprocessor Report: August 2000– 20 Pipeline Stages!– Drive Wire Delay!– Trace-Cache: caching paths through the code for quick

decoding.– Renaming: similar to Tomasulo architecture– Branch and DATA prediction!

Pentium (Original 586)

Pentium-II (and III) (Original 686)

Where is the P4 “Decode”?• On Hit:

– Trace Cache holds ops– Renamed/Scheduled

• Hit on complex ops:– Trace Cache only holds

pointer to decode ROM

• On Miss:– Must decode x86

instructions into ops– Potentially slow!

Hit:(no Decode!)

Miss:(Decode)

VLIW: Very Large Instruction Word

• Each “instruction” has explicit coding for multiple operations

– In EPIC, grouping called a “packet”– In Transmeta, grouping called a “molecule” (with “atoms” as

ops)

• Tradeoff instruction space for simple decoding– The long instruction word has room for many operations– By definition, all the operations the compiler puts in the long

instruction word are independent => execute in parallel– E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch

» 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide

– Need compiling technique that schedules across several branches

Intel/HP “Explicitly Parallel Instruction Computer (EPIC)”

• 3 Instructions in 128 bit “groups”; field determines if instructions dependent or independent

– Smaller code size than old VLIW, larger than x86/RISC– Groups can be linked to show independence > 3 instr

• 128 integer registers + 128 floating point registers– Not separate register files per functional unit as in old VLIW

• Hardware checks dependencies (interlocks => binary compatibility over time)

• Predicated execution (select 1 out of 64 1-bit flags) => 40% fewer mispredictions?

• IA-64: instruction set architecture; EPIC is type– VLIW = EPIC?

• Itanium™ is name of first implementation (2000/2001?)– Highly parallel and deeply pipelined hardware at 800Mhz

– 6-wide, 10-stage pipeline at 800Mhz on 0.18 µ process

FPUIA-32Control

Instr.Fetch &Decode Cache

Cache

TLB

Integer Units

IA-64 Control

Bus

Core Processor Die 4 x 1MB L3 cache

Itanium™ Processor Silicon(Copyright: Intel at Hotchips ’00)

Itanium™ Machine Characteristics

(Copyright: Intel at Hotchips ’00)

Organic Land Grid ArrayPackage

0.18u CMOS, 6 metal layerProcess

25.4M CPU; 295M L3Transistor Count

800 MHzFrequency

2.1 GB/sec; 4-way Glueless MPSystem Bus

4MB, 4-way s.a., BW of 12.8 GB/sec; L3 Cache

Dual ported 96K Unified & 16KD; 16KIL2/L1 Cache

6 / 2 clocksL2/L1 Latency

Scalable to large (512+ proc) systems

64 entry ITLB, 32/96 2-level DTLB, VHPTVirtual Memory Support

6 insts/clock (4 ALU/MM, 2 Ld/St, 2 FP, 3 Br)Machine Width

3.2 GFlops (DP/EP); 6.4 GFlops (SP)FP Compute Bandwidth

4 DP (8 SP) operands/clockMemory -> FP Bandwidth

14 ported 128 GR & 128 FR; 64 Predicates

32 entry ALAT, Exception DeferralSpeculation

Registers

Branch Prediction Multilevel 4-stage Prediction Hierarchy

Branch Hints

Memory Hints

InstructionCache

& BranchPredictors

FetchFetch Memory Memory SubsystemSubsystem

Three levels of cache:L1, L2, L3

Register Stack & Rotation

Explicit Parallelism

128 GR &128 FR,RegisterRemap

&Stack Engine

Register Register HandlingHandling

Fast, S

imp

le 6-Issue

IssueIssue ControlControl

Micro-architecture Features in hardwareMicro-architecture Features in hardware: :

Itanium™ EPIC Design Maximizes SW-HW Synergy

(Copyright: Intel at Hotchips ’00)Architecture Features programmed by compiler::

PredicationData & ControlSpeculation

Byp

asse

s & D

ep

end

encie

s

Parallel ResourcesParallel Resources

4 Integer + 4 MMX Units

2 FMACs (4 for SSE)

2 LD/ST units

32 entry ALAT

Speculation Deferral Management

10 Stage In-Order Core Pipeline(Copyright: Intel at Hotchips ’00)

Front EndFront End• Pre-fetch/Fetch of up Pre-fetch/Fetch of up to 6 instructions/cycleto 6 instructions/cycle

• Hierarchy of branch Hierarchy of branch predictorspredictors

• Decoupling bufferDecoupling buffer

Instruction DeliveryInstruction Delivery• Dispersal of up to 6 Dispersal of up to 6 instructions on 9 portsinstructions on 9 ports

• Reg. remappingReg. remapping• Reg. stack engineReg. stack engine

Operand DeliveryOperand Delivery• Reg read + Bypasses Reg read + Bypasses • Register scoreboardRegister scoreboard• Predicated Predicated

dependencies dependencies

ExecutionExecution• 4 single cycle ALUs, 2 ld/str4 single cycle ALUs, 2 ld/str• Advanced load control Advanced load control • Predicate delivery & branchPredicate delivery & branch• Nat/Exception/Nat/Exception///RetirementRetirement

IPG FET ROT EXP REN REG EXE DET WRBWLD

REGISTER READWORD-LINE DECODERENAMEEXPAND

INST POINTER GENERATION

FETCH ROTATE EXCEPTIONDETECT

EXECUTE WRITE-BACK

Limits to Multi-Issue Machines

• Inherent limitations of ILP– 1 branch in 5: How to keep a 5-way VLIW busy?– Latencies of units: many operations must be scheduled– Need about Pipeline Depth x No. Functional Units of

independent operations to keep all pipelines busy.– Difficulties in building HW– Easy: More instruction bandwidth– Easy: Duplicate FUs to get parallel execution– Hard: Increase ports to Register File (bandwidth)

» VLIW example needs 7 read and 3 write for Int. Reg. & 5 read and 3 write for FP reg

– Harder: Increase ports to memory (bandwidth)– Decoding Superscalar and impact on clock rate, pipeline

depth?

Cost-performance of simple vs. OOO

MIPS MPUs R5000 R10000 10k/5k• Clock Rate 200 MHz 195 MHz 1.0x• On-Chip Caches 32K/32K 32K/32K 1.0x• Instructions/Cycle 1(+ FP) 4 4.0x• Pipe stages 5 5-7 1.2x• Model In-order Out-of-order ---• Die Size (mm2) 84 298 3.5x

– without cache, TLB 32 205 6.3x

• Development (man yr.)60 300 5.0x• SPECint_base95 5.7 8.8 1.6x

Advanced Pipelining and Instruction Level Parallelism

(ILP)

• ILP: Overlap execution of unrelated instructions

• gcc 17% control transfer– 5 instructions + 1 branch– Beyond single block to get more instruction level

parallelism

• Loop level parallelism one opportunity– First SW, then HW approaches

• DLX Floating Point as example– Measurements suggests R4000 performance FP

execution has room for improvement

Can we make CPI closer to 1?• Let’s assume full pipelining:

– If we have a 4-cycle latency, then we need 3 instructions between a producing instruction and its use:

multf $F0,$F2,$F4delay-1delay-2delay-3addf $F6,$F10,$F0

Fetch Decode Ex1 Ex2 Ex3 Ex4 WB

multfdelay1delay2delay3addf

Earliest forwarding for 4-cycle instructions

Earliest forwarding for1-cycle instructions

FP Loop: Where are the Hazards?

Loop: LD F0,0(R1) ;F0=vector element ADDD F4,F0,F2 ;add scalar from F2 SD 0(R1),F4 ;store result SUBI R1,R1,8 ;decrement pointer 8B (DW) BNEZ R1,Loop ;branch R1!=zero NOP ;delayed branch slot

Instruction Instruction Latency inproducing result using result clock cyclesFP ALU op Another FP ALU op 3FP ALU op Store double 2 Load double FP ALU op 1Load double Store double 0Integer op Integer op 0

• Where are the stalls?

FP Loop Showing Stalls

• 9 clocks: Rewrite code to minimize stalls?

Instruction Instruction Latency inproducing result using result clock cyclesFP ALU op Another FP ALU op 3FP ALU op Store double 2 Load double FP ALU op 1

1 Loop: LD F0,0(R1) ;F0=vector element

2 stall

3 ADDD F4,F0,F2 ;add scalar in F2

4 stall

5 stall

6 SD 0(R1),F4 ;store result

7 SUBI R1,R1,8 ;decrement pointer 8B (DW)

8 BNEZ R1,Loop ;branch R1!=zero

9 stall ;delayed branch slot

Revised FP Loop Minimizing Stalls

6 clocks: Unroll loop 4 times code to make faster?

Instruction Instruction Latency inproducing result using result clock cyclesFP ALU op Another FP ALU op 3FP ALU op Store double 2 Load double FP ALU op 1

1 Loop: LD F0,0(R1)

2 stall

3 ADDD F4,F0,F2

4 SUBI R1,R1,8

5 BNEZ R1,Loop ;delayed branch

6 SD 8(R1),F4 ;altered when move past SUBI

Swap BNEZ and SD by changing address of SD

Unroll Loop Four Times (straightforward way)

Rewrite loop to minimize stalls?

1 Loop:LD F0,0(R1)2 ADDD F4,F0,F23 SD 0(R1),F4 ;drop SUBI & BNEZ4 LD F6,-8(R1)5 ADDD F8,F6,F26 SD -8(R1),F8 ;drop SUBI & BNEZ7 LD F10,-16(R1)8 ADDD F12,F10,F29 SD -16(R1),F12 ;drop SUBI & BNEZ10 LD F14,-24(R1)11 ADDD F16,F14,F212 SD -24(R1),F1613 SUBI R1,R1,#32 ;alter to 4*814 BNEZ R1,LOOP15 NOP

15 + 4 x (1+2) = 27 clock cycles, or 6.8 per iteration Assumes R1 is multiple of 4

1 cycle stall

2 cycles stall

Unrolled Loop That Minimizes Stalls

• What assumptions made when moved code?

– OK to move store past SUBI even though changes register

– OK to move loads before stores: get right data?

– When is it safe for compiler to do such changes?

1 Loop:LD F0,0(R1)2 LD F6,-8(R1)3 LD F10,-16(R1)4 LD F14,-24(R1)5 ADDD F4,F0,F26 ADDD F8,F6,F27 ADDD F12,F10,F28 ADDD F16,F14,F29 SD 0(R1),F410 SD -8(R1),F811 SD -16(R1),F1212 SUBI R1,R1,#3213 BNEZ R1,LOOP14 SD 8(R1),F16 ; 8-32 = -24

14 clock cycles, or 3.5 per iteration

Getting CPI < 1: IssuingMultiple Instructions/Cycle

• Superscalar DLX: 2 instructions, 1 FP & 1 anything else– Fetch 64-bits/clock cycle; Int on left, FP on right– Can only issue 2nd instruction if 1st instruction issues– More ports for FP registers to do FP load & FP op in a pair

Type PipeStagesInt. instructionIF ID EX MEM WBFP instruction IF ID EX MEM WBInt. instruction IF ID EX MEM WBFP instruction IF ID EX MEM WBInt. instruction IF ID EX MEM WBFP instruction IF ID EX MEM WB

• 1 cycle load delay expands to 3 instructions in SS– instruction in right half can’t use it, nor instructions in next slot

Loop Unrolling in Superscalar

Integer instruction FP instruction Clock cycleLoop: LD F0,0(R1) 1

LD F6,-8(R1) 2LD F10,-16(R1) ADDD F4,F0,F2 3LD F14,-24(R1) ADDD F8,F6,F2 4LD F18,-32(R1) ADDD F12,F10,F2 5SD 0(R1),F4 ADDD F16,F14,F2 6SD -8(R1),F8 ADDD F20,F18,F2 7SD -16(R1),F12 8SD -24(R1),F16 9SUBI R1,R1,#40 10BNEZ R1,LOOP 11SD -32(R1),F20 12

• Unrolled 5 times to avoid delays (+1 due to SS)

• 12 clocks, or 2.4 clocks per iteration (1.5X)

VLIW: Very Large Instruction Word

• Each “instruction” has explicit coding for multiple operations

– In EPIC, grouping called a “packet”– In Transmeta, grouping called a “molecule” (with “atoms” as

ops)

• Tradeoff instruction space for simple decoding– The long instruction word has room for many operations– By definition, all the operations the compiler puts in the long

instruction word are independent => execute in parallel– E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch

» 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide

– Need compiling technique that schedules across several branches

Loop Unrolling in VLIW

Memory Memory FP FP Int. op/ Clockreference 1 reference 2 operation 1 op. 2 branch

LD F0,0(R1) LD F6,-8(R1) 1LD F10,-16(R1) LD F14,-24(R1) 2LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 3LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4

ADDD F20,F18,F2 ADDD F24,F22,F2 5SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6SD -16(R1),F12 SD -24(R1),F16 7SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,#48 8SD -0(R1),F28 BNEZ R1,LOOP 9

Unrolled 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per iteration (1.8X) Average: 2.5 ops per clock, 50% efficiency Note: Need more registers in VLIW (15 vs. 6 in SS)

Another possibility:Software Pipelining

• Observation: if iterations from loops are independent, then can get more ILP by taking instructions from different iterations

• Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop ( Tomasulo in SW)

Iteration 0 Iteration

1 Iteration 2 Iteration

3 Iteration 4

Software- pipelined iteration

Software Pipelining Example

Before: Unrolled 3 times 1 LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4 4 LD F6,-8(R1) 5 ADDD F8,F6,F2 6 SD -8(R1),F8 7 LD F10,-16(R1) 8 ADDD F12,F10,F2 9 SD -16(R1),F12 10 SUBI R1,R1,#24 11 BNEZ R1,LOOP

After: Software Pipelined 1 SD 0(R1),F4 ; Stores M[i] 2 ADDD F4,F0,F2 ; Adds to

M[i-1] 3 LD F0,-16(R1);Loads M[i-

2] 4 SUBI R1,R1,#8 5 BNEZ R1,LOOP

• Symbolic Loop Unrolling– Maximize result-use distance – Less code space than unrolling– Fill & drain pipe only once per loop vs. once per each unrolled iteration in loop unrolling

SW Pipeline

Loop Unrolled

ove

rlap

ped

op

sTime

Time

5 cycles per iteration

Software Pipelining withLoop Unrolling in VLIW

Memory Memory FP FP Int. op/ Clockreference 1 reference 2 operation 1 op. 2 branch

LD F0,-48(R1) ST 0(R1),F4 ADDD F4,F0,F2 1LD F6,-56(R1) ST -8(R1),F8 ADDD F8,F6,F2 SUBI R1,R1,#24 2LD F10,-40(R1) ST 8(R1),F12 ADDD F12,F10,F2 BNEZ R1,LOOP 3

• Software pipelined across 9 iterations of original loop– In each iteration of above loop, we:

» Store to m,m-8,m-16 (iterations I-3,I-2,I-1)» Compute for m-24,m-32,m-40 (iterations I,I+1,I+2)» Load from m-48,m-56,m-64 (iterations I+3,I+4,I+5)

• 9 results in 9 cycles, or 1 clock per iteration• Average: 3.3 ops per clock, 66% efficiency Note: Need less registers for software pipelining

(only using 7 registers here, was using 15)

Compiler Perspectives on Code Movement

• Compiler concerned about dependencies in program• Whether or not a HW hazard depends on pipeline• Try to schedule to avoid hazards that cause

performance losses• (True) Data dependencies (RAW if a hazard for HW)

– Instruction i produces a result used by instruction j, or– Instruction j is data dependent on instruction k, and instruction k

is data dependent on instruction i.

• If dependent, can’t execute in parallel• Easy to determine for registers (fixed names)• Hard for memory (“memory disambiguation” problem):

– Does 100(R4) = 20(R6)?– From different loop iterations, does 20(R6) = 20(R6)?

Where are the data dependencies?

1 Loop: LD F0,0(R1)

2 ADDD F4,F0,F2

3 SUBI R1,R1,8

4 BNEZ R1,Loop ;delayed branch

5 SD 8(R1),F4 ;altered when move past SUBI


• Another kind of dependence called name dependence: two instructions use same name (register or memory location) but don’t exchange data

• Antidependence (WAR if a hazard for HW)– Instruction j writes a register or memory location that

instruction i reads from and instruction i is executed first

• Output dependence (WAW if a hazard for HW)– Instruction i and instruction j write the same register or

memory location; ordering between instructions must be preserved.

Where are the name dependencies?


How can remove them?

Where are the name dependencies?


Called “register renaming”


• Name Dependencies are Hard to discover for Memory Accesses

– Does 100(R4) = 20(R6)?– From different loop iterations, does 20(R6) = 20(R6)?

• Our example required compiler to know that if R1 doesn’t change then:

0(R1) -8(R1) -16(R1) -24(R1)

There were no dependencies between some loads and stores so they could be moved by each other


• Final kind of dependence called control dependence

• Example

if p1 {S1;};

if p2 {S2;};

S1 is control dependent on p1 and S2 is control dependent on p2 but not on p1.


• Two (obvious?) constraints on control dependences:

– An instruction that is control dependent on a branch cannot be moved before the branch.

– An instruction that is not control dependent on a branch cannot be moved to after the branch (or its execution will be controlled by the branch).

• Control dependencies relaxed to get parallelism; get same effect if preserve order of exceptions (address in register checked by branch before use) and data flow (value in register depends on branch)

Where are the control dependencies?

1 Loop: LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4

4 SUBI R1,R1,8

5 BEQZ R1,exit 6 LD F0,0(R1) 7 ADDD F4,F0,F2 8 SD 0(R1),F4

9 SUBI R1,R1,8

10 BEQZ R1,exit 11 LD F0,0(R1) 12 ADDD F4,F0,F2 13 SD 0(R1),F4

14 SUBI R1,R1,8

15 BEQZ R1,exit....

When Safe to Unroll Loop?• Example: Where are data dependencies?

(A,B,C distinct & nonoverlapping)for (i=0; i<100; i=i+1) {

A[i+1] = A[i] + C[i]; /* S1 */B[i+1] = B[i] + A[i+1]; /* S2 */

}1. S2 uses the value, A[i+1], computed by S1 in the same iteration. 2. S1 uses a value computed by S1 in an earlier iteration, since iteration i computes A[i+1] which is read in iteration i+1. The same is true of S2 for B[i] and B[i+1]. This is a “loop-carried dependence”: between iterations

• For our prior example, each iteration was distinct• Implies that iterations can’t be executed in parallel,

Right????

Does a loop-carried dependence mean there is no

parallelism???• Consider:

for (i=0; i< 8; i=i+1) {A = A + C[i]; /* S1 */

}Could compute:

“Cycle 1”: temp0 = C[0] + C[1];temp1 = C[2] + C[3];temp2 = C[4] + C[5];temp3 = C[6] + C[7];

“Cycle 2”: temp4 = temp0 + temp1;temp5 = temp2 + temp3;

“Cycle 3”: A = temp4 + temp5;

• Relies on associative nature of “+”.

Trace Scheduling in VLIW• Parallelism across IF branches vs. LOOP branches• Two steps:

– Trace Selection» Find likely sequence of basic blocks (trace)

of (statically predicted or profile predicted) long sequence of straight-line code

– Trace Compaction» Squeeze trace into few VLIW instructions» Need bookkeeping code in case prediction is wrong

• This is a form of compiler-generated speculation– Compiler must generate “fixup” code to handle cases in which

trace is not the taken branch– Needs extra registers: undoes bad guess by discarding

• Subtle compiler bugs mean wrong answer vs. poorer performance; no hardware interlocks

Dependence mean there is no parallelism???

• Consider:for (i=0; i< 100; i=i+1) {

A[i] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i]; /* S2 */

}

• Dependency test (GCD test): a*i+b, c*j+d If GCD(c,a) does not divide (d-b), there is not

loop-carried dependence. X[2*I+3] = X[2*I]*5.0 a = 2; b=3; c=2; d=0; GCD(c,a)=2 and d-b=-

3

25

Alternative Model:Vector Processing

+

r1 r2

r3

add r3, r1, r2

SCALAR(1 operation)

v1 v2

v3

+

vectorlength

add.vv v3, v1, v2

VECTOR(N operations)

• Vector processors have high-level operations that work on linear arrays of numbers: "vectors"

Properties of Vector Processors

• Each result independent of previous result=> long pipeline, compiler ensures no dependencies=> high clock rate

• Vector instructions access memory with known pattern=> highly interleaved memory=> amortize memory latency of over 64 elements=> no (data) caches required! (Do use instruction cache)

• Reduces branches and branch problems in pipelines• Single vector instruction implies lots of work ( loop)

=> fewer instruction fetches

Spec92fp Operations (Millions) Instructions (M)

Program RISC Vector R / V RISC Vector R / V

swim256 115 95 1.1x 115 0.8 142x

hydro2d 58 40 1.4x 58 0.8 71x

nasa7 69 41 1.7x 69 2.2 31x

su2cor 51 35 1.4x 51 1.8 29x

tomcatv 15 10 1.4x 15 1.3 11x

wave5 27 25 1.1x 27 7.2 4x

mdljdp2 32 52 0.6x 32 15.8 2x

Operation & Instruction Count:

RISC v. Vector Processor(from F. Quintana, U. Barcelona.)

Vector reduces ops by 1.2X, instructions by 20X

Styles of Vector Architectures

• memory-memory vector processors: all vector operations are memory to memory

• vector-register processors: all vector operations between vector registers (except load and store)– Vector equivalent of load-store architectures– Includes all vector machines since late

1980s: Cray, Convex, Fujitsu, Hitachi, NEC

– We assume vector-register for rest of lectures

“DLXV” Vector Instructions

Instr. Operands Operation Comment

• ADDV V1,V2,V3 V1=V2+V3 vector + vector

• ADDSV V1,F0,V2 V1=F0+V2 scalar + vector

• MULTV V1,V2,V3 V1=V2xV3 vector x vector

• MULSV V1,F0,V2 V1=F0xV2 scalar x vector

• LV V1,R1 V1=M[R1..R1+63] load, stride=1

• LVWS V1,R1,R2 V1=M[R1..R1+63*R2] load, stride=R2

• LVI V1,R1,V2 V1=M[R1+V2i,i=0..63] indir.("gather")

• CeqV VM,V1,V2 VMASKi = (V1i=V2i)? comp. setmask

• MOV VLR,R1 Vec. Len. Reg. = R1 set vector length

• MOV VM,R1 Vec. Mask = R1 set vector mask

Components of Vector Processor

• Vector Register: fixed length bank holding a single vector

– has at least 2 read and 1 write ports– typically 8-32 vector registers, each holding 64-128 64-

bit elements

• Vector Functional Units (FUs): fully pipelined, start new operation every clock

– typically 4 to 8 FUs: FP add, FP mult, FP reciprocal (1/X), integer add, logical, shift; may have multiple of same

unit

• Vector Load-Store Units (LSUs): fully pipelined unit to load or store a vector; may have multiple LSUs

• Scalar registers: single element for FP scalar or address

• Cross-bar to connect FUs , LSUs, registers

Vector Optimization: Chaining

• Suppose:MULV V1,V2,V3ADDV V4,V1,V5 ; separate convoy?

• chaining: vector register (V1) is not as a single entity but as a group of individual registers, then pipeline forwarding can work on individual elements of a vector

• Flexible chaining: allow vector to chain to any other active vector operation => more read/write ports

• As long as enough HW, increases convoy sizeMULTV ADDV

MULTV

ADDV

Total=141

Total=77

7 64

646

7 64 646Unchained

Chained

Example Execution of Vector Code

Vector Memory Pipeline

Vector Multiply Pipeline

Vector Adder Pipeline

8 lanes, vector length 32,chaining

Scalar

Vector Optimization: Sparse Matrices

• Suppose:do 100 i = 1,n

100 A(K(i)) = A(K(i)) + C(M(i))

• gather (LVI) operation takes an index vector and fetches data from each address in the index vector

– This produces a “dense” vector in the vector registers

• After these elements are operated on in dense form, the sparse vector can be stored in expanded form by a scatter store (SVI), using the same index vector

• Can't be figured out by compiler since can't know elements distinct, no dependencies

• Use CVI to create index 0, 1xm, 2xm, ..., 63xm

Vector Optimization: Conditional Execution

• Suppose:do 100 i = 1, 64

if (A(i) .ne. 0) thenA(i) = A(i) – B(i)

endif100 continue

• vector-mask control takes a Boolean vector: when vector-mask register is loaded from vector test, vector instructions operate only on vector elements whose corresponding entries in the vector-mask register are 1.

• Still requires clock even if result not stored; if still performs operation, what about divide by 0?

Vector Optimization: Multi-Threaded Vectorization

• Suppose:do 100 i = 1, 64

A(i) = A(i-1) – B(i)

C(i) = C(i-2) + D(i)endif

100 continue

• Two ideas:– Component operations of a vector instruction are issued

every Kth cycle– Multiple vector instructions can time-share a vector

functional unit

Limits to ILPInitial HW Model here; MIPS compilers. Assumptions for ideal/perfect machine to start:

1. Register renaming–infinite virtual registers and all WAW & WAR hazards are avoided2. Branch prediction–perfect; no mispredictions 3. Jump prediction–all jumps perfectly predicted => machine with perfect speculation & an unbounded buffer of instructions available4. Memory-address alias analysis–addresses are known & a store can be moved before a load provided addresses not equal

1 cycle latency for all instructions; unlimited number of instructions issued per clock cycle

Upper Limit to ILP: Ideal Machine

(Figure 4.38, page 319)

Programs

Inst

ruct

ion

Iss

ues

per

cycl

e

0

20

40

60

80

100

120

140

160

gcc espresso li fpppp doducd tomcatv

54.862.6

17.9

75.2

118.7

150.1

Integer: 18 - 60

FP: 75 - 150

IPC

Window Size

Branch Prediction Accuracy

Number of Renaming Registers

Program

Instr

ucti

on

issu

es p

er

cy

cle

0

5

10

15

20

25

30

35

40

45

50

gcc espresso li fpppp doducd tomcatv

10

15

12

49

16

45

7 79

49

16

45 4 4

6 53

53 3 4 4

45

Perfect Global/stack Perfect Inspection None

More Realistic HW: Alias Impact


Change 2000 instr window, 64 instr issue, 8K 2 level Prediction, 256 renaming registers

NoneGlobal/Stack perf;heap conflicts

Perfect Inspec.Assem.

FP: 4 - 45(Fortran,no heap)

Integer: 4 - 9

IPC

Program

Instr

ucti

on

issu

es p

er

cy

cle

0

10

20

30

40

50

60

gcc expresso li fpppp doducd tomcatv

10

15

12

52

17

56

10

15

12

47

16

10

1311

35

15

34

910 11

22

12

8 8 9

14

9

14

6 6 68

79

4 4 4 5 46

3 2 3 3 3 3

45

22

Infinite 256 128 64 32 16 8 4

Realistic HW for ‘9X: Window Impact

(Figure 4.48, Page 332)

Perfect disambiguation (HW), 1K Selective Prediction, 16 entry return, 64 registers, issue as many as window

64 16256Infinite 32128 8 4

Integer: 6 - 12

FP: 8 - 45

IPC

• 8-scalar IBM Power-2 @ 71.5 MHz (5 stage pipe) vs. 2-scalar Alpha @ 200 MHz (7 stage pipe)

Braniac vs. Speed Demon(1993)

Benchmark

SP

EC

Ma

rks

0

100

200

300

400

500

600

700

800

900

esp

ress

o li

eqnto

tt

com

pre

ss sc gcc

spic

e

doduc

mdljdp2

wave5

tom

catv

ora

alv

inn

ear

mdljsp

2

swm

256

su2co

r

hydro

2d

nasa

fpppp

Problems with scalar approach to

ILP extraction• Limits to conventional exploitation of ILP:

– pipelined clock rate: at some point, each increase in clock rate has corresponding CPI increase (branches, other hazards)

– branch prediction: branches get in the way of wide issue. They are too unpredictable.

– instruction fetch and decode: at some point, its hard to fetch and decode more instructions per clock cycle

– register renaming: Rename logic gets really complicate for many instructions

– cache hit rate: some long-running (scientific) programs have very large data sets accessed with poor locality; others have continuous data streams (multimedia) and hence poor locality

Predicated Execution

• Branch Prediction• Conditional move instruction

– LWC R8, 0(R1), R10

• Full Predicated Instruction– ADD R3, R1, R2, R4– Speculative execution– A conditional branch instruction is turned into a state

commit operation

Limits to Multi-Issue Machines

• Limitations specific to either Superscalar or VLIW implementation

– Decode issue in Superscalar: how wide is practical?– VLIW code size: unroll loops + wasted fields in VLIW

» IA-64 compresses dependent instructions, but still larger

– VLIW lock step => 1 hazard & all instructions stall» IA-64 not lock step? Dynamic pipeline?

– VLIW & binary compatibilityIA-64 promises binary compatibility

CPU-DRAM Gap

• 1980: no cache in µproc; 1995 2-level cache on chip(1989 first Intel µproc with a cache on chip)

Question: Who Cares About the Memory Hierarchy?

µProc60%/yr.

DRAM7%/yr.

1

10

100

1000198

0198

1 198

3198

4198

5 198

6198

7198

8198

9199

0199

1 199

2199

3199

4199

5199

6199

7199

8 199

9200

0

DRAM

CPU198

2

Processor-MemoryPerformance Gap:(grows 50% / year)

Per

form

ance

“Moore’s Law”

“Less’ Law?”

Generations of Microprocessors

• Time of a full cache miss in instructions executed:

1st Alpha: 340 ns/5.0 ns = 68 clks x 2 or 136

2nd Alpha: 266 ns/3.3 ns = 80 clks x 4 or 320

3rd Alpha: 180 ns/1.7 ns =108 clks x 6 or 648

Caching

• Principle: results of operations that are expensive should be kept around for reuse

• Examples:– CPU caching– Forwarding table caching– File caching– Web caching– Query caching– Computation caching

• Most processor performance improvements in the lastr

What is a cache?• Small, fast storage used to improve average

access time to slow memory.• Exploits spacial and temporal locality• In computer architecture, almost everything is a

cache!– Registers a cache on variables– First-level cache a cache on second-level cache– Second-level cache a cache on memory– Memory a cache on disk (virtual memory)– TLB a cache on page table– Branch-prediction a cache on prediction information?Proc/Regs

L1-Cache

L2-Cache

Memory

Disk, Tape, etc.

Bigger Faster

Example: 1 KB Direct Mapped Cache• For a 2 ** N byte cache:

– The uppermost (32 - N) bits are always the Cache Tag– The lowest M bits are the Byte Select (Block Size = 2 **

M)

Cache Index

0

1

2

3

:

Cache Data

Byte 0

0431

:

Cache Tag Example: 0x50

Ex: 0x01

0x50

Stored as partof the cache “state”

Valid Bit

:

31

Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :Byte 992Byte 1023 :

Cache Tag

Byte Select

Ex: 0x00

9Block address

Set Associative Cache• N-way set associative: N entries for each

Cache Index– N direct mapped caches operates in parallel

• Example: Two-way set associative cache– Cache Index selects a “set” from the cache– The two tags in the set are compared to the input in

parallel– Data is selected based on the tag result

Cache Data

Cache Block 0

Cache TagValid

:: :

Cache Data

Cache Block 0

Cache Tag Valid

: ::

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

OR

Hit

Disadvantage of Set Associative Cache

• N-way Set Associative Cache versus Direct Mapped Cache:

– N comparators vs. 1– Extra MUX delay for the data– Data comes AFTER Hit/Miss decision and set selection

• In a direct mapped cache, Cache Block is available BEFORE Hit/Miss:

– Possible to assume a hit and continue. Recover later if miss.

Cache Data

Cache Block 0

Cache Tag Valid

: ::

Cache Data

Cache Block 0

Cache TagValid

:: :

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

OR

Hit

Basic Units of Cache

• Cache Line (index)• Cache Block (tag)• Cache Sector or Subblock (valid bit)• S: cache size, A: degree of associativity, B:

block size, N: # of cache lines, I: # of index bits

S = B*A*N

N = 2I/B

Example: Device Interrupt(Say, arrival of network message)

add r1,r2,r3subi r4,r1,#4slli r4,r4,#2

Hiccup(!)

lw r2,0(r4)lw r3,4(r4)add r2,r2,r3sw 8(r4),r2

Raise priorityReenable All IntsSave registers

lw r1,20(r0)lw r2,0(r1)addi r3,r0,#5sw 0(r1),r3

Restore registersClear current IntDisable All IntsRestore priorityRTE

Exte

rnal In

terr

up

t PC sa

ved

Disable A

ll Ints

Superviso

r Mode

Restore PC

User Mode

“In

terr

up

t H

an

dle

r”

Disable Network Intr

subi r4,r1,#4slli r4,r4,#2lw r2,0(r4)lw r3,4(r4)add r2,r2,r3sw 8(r4),r2lw r1,12(r0)beq r1,no_messlw r1,20(r0)lw r2,0(r1)addi r3,r0,#5sw 0(r1),r3Clear Network Intr

Alternative: Polling(again, for arrival of network message)

Exte

rnal In

terr

up

t

“Handler”

no_mess:

Polling Point(check device register)

Polling is faster/slower than Interrupts.

• Polling is faster than interrupts because– Compiler knows which registers in use at polling point. Hence,

do not need to save and restore registers (or not as many).– Other interrupt overhead avoided (pipeline flush, trap

priorities, etc).

• Polling is slower than interrupts because– Overhead of polling instructions is incurred regardless of

whether or not handler is run. This could add to inner-loop delay.

– Device may have to wait for service for a long time.

• When to use one or the other?– Multi-axis tradeoff

» Frequent/regular events good for polling, as long as device can be controlled at user level.

» Interrupts good for infrequent/irregular events» Interrupts good for ensuring regular/predictable service of

events.

Exception/Interrupt classifications

• Exceptions: relevant to the current process– Faults, arithmetic traps, and synchronous traps– Invoke software on behalf of the currently executing process

• Interrupts: caused by asynchronous, outside events

– I/O devices requiring service (DISK, network)– Clock interrupts (real time scheduling)

• Machine Checks: caused by serious hardware failure

– Not always restartable– Indicate that bad things have happened.

» Non-recoverable ECC error» Machine room fire» Power outage

A related classification: Synchronous vs. Asynchronous

• Synchronous: means related to the instruction stream, i.e. during the execution of an instruction

– Must stop an instruction that is currently executing– Page fault on load or store instruction– Arithmetic exception– Software Trap Instructions

• Asynchronous: means unrelated to the instruction stream, i.e. caused by an outside event.

– Does not have to disrupt instructions that are already executing– Interrupts are asynchronous– Machine checks are asynchronous

• SemiSynchronous (or high-availability interrupts): – Caused by external event but may have to disrupt current

instructions in order to guarantee service

Interrupt controller hardware and mask levels

• Interrupt disable mask may be multi-bit word accessed through some special memory address

• Operating system constructs a hierarchy of masks that reflects some form of interrupt priority.

• For instance:

– This reflects the an order of urgency to interrupts– For instance, this ordering says that disk events can

interrupt the interrupt handlers for network interrupts.

Priority Examples0 Sof tware interrupts

2 Network I nterrupts

4 Sound card

5 Disk I nterrupt

6 Real Time clock

SPARC (and RISC I) had register windows

• On interrupt or procedure call, simply switch to a different set of registers

• Really saves on interrupt overhead– Interrupts can happen at any point in the execution, so

compiler cannot help with knowledge of live registers.– Conservative handlers must save all registers– Short handlers might be able to save only a few, but this

analysis is compilcated

• Not as big a deal with procedure calls– Original statement by Patterson was that Berkeley didn’t have

a compiler team, so they used a hardware solution– Good compilers can allocate registers across procedure

boundaries– Good compilers know what registers are live at any one time

Supervisor State• Typically, processors have some amount of state that

user programs are not allowed to touch.– Page mapping hardware/TLB

» TLB prevents one user from accessing memory of another» TLB protection prevents user from modifying mappings

– Interrupt controllers -- User code prevented from crashing machine by disabling interrupts. Ignoring device interrupts, etc.

– Real-time clock interrupts ensure that users cannot lockup/crash machine even if they run code that goes into a loop:

» “Preemptive Multitasking” vs “non-preemptive multitasking”

• Access to hardware devices restricted– Prevents malicious user from stealing network packets – Prevents user from writing over disk blocks

• Distinction made with at least two-levels: USER/SYSTEM (one hardware mode-bit)

– x86 architectures actually provide 4 different levels, only two usually used by OS (or only 1 in older Microsoft OSs)

Entry into Supervisor Mode

• Entry into supervisor mode typically happens on interrupts, exceptions, and special trap instructions.

• Entry goes through kernel instructions:– interrupts, exceptions, and trap instructions change to

supervisor mode, then jump (indirectly) through table of instructions in kernel

intvec: j handle_int0j handle_int1

…j handle_fp_except0

…j handle_trap0j handle_trap1

– OS “System Calls” are just trap instructions:read(fd,buffer,count) => st 20(r0),r1

st 24(r0),r2st 28(r0),r3trap $READ

• OS overhead can be serious concern for achieving fast interrupt behavior.

Precise Interrupts/Exceptions

• An interrupt or exception is considered precise if there is a single instruction (or interrupt point) for which all instructions before that instruction have committed their state and no following instructions including the interrupting instruction have modified any state.

– This means, effectively, that you can restart execution at the interrupt point and “get the right answer”

– Implicit in our previous example of a device interrupt:» Interrupt point is at first lw instruction

add r1,r2,r3subi r4,r1,#4slli r4,r4,#2

lw r2,0(r4)lw r3,4(r4)add r2,r2,r3sw 8(r4),r2

Exte

rnal In

terr

up

t

PC saved

Disable All In

ts

Supervisor M

ode

Restore PCUser Mode

Int h

andle

r

Precise interrupt point requires multiple PCs to describe in presence of

delayed branchesaddi r4,r3,#4sub r1,r2,r3bne r1,thereand r2,r3,r5<other insts>

addi r4,r3,#4sub r1,r2,r3bne r1,thereand r2,r3,r5<other insts>

PC:PC+4:

Interrupt point described as <PC,PC+4>

Interrupt point described as:

<PC+4,there> (branch was taken)or

<PC+4,PC+8> (branch was not taken)

PC:PC+4:

Why are precise interrupts desirable?

• Restartability doesn’t require preciseness. However, preciseness makes it a lot easier to restart.

• Simplify the task of the operating system a lot– Less state needs to be saved away if unloading process.– Quick to restart (making for fast interrupts)

• Many types of interrupts/exceptions need to be restartable. Easier to figure out what actually happened:

– I.e. TLB faults. Need to fix translation, then restart load/store

– IEEE gradual underflow, illegal operation, etc:

e.g. Suppose you are computing:Then, for ,

Want to take exception, replace NaN with 1, then restart.

0xoperationillegalNaNf _

0

0)0(

xx

xf)sin(

)(

Precise Exceptions in simple

5-stage pipeline:• Exceptions may occur at different stages in

pipeline (I.e. out of order):– Arithmetic exceptions occur in execution stage– TLB faults can occur in instruction fetch or memory stage

• What about interrupts? The doctor’s mandate of “do no harm” applies here: try to interrupt the pipeline as little as possible

• All of this solved by tagging instructions in pipeline as “cause exception or not” and wait until end of memory stage to flag exception

– Interrupts become marked NOPs (like bubbles) that are placed into pipeline instead of an instruction.

– Assume that interrupt condition persists in case NOP flushed

– Clever instruction fetch might start fetching instructions from interrupt vector, but this is complicated by need forsupervisor mode switch, saving of one or more PCs, etc

Another look at the exception problem

• Use pipeline to sort this out!– Pass exception status along with instruction.– Keep track of PCs for every instruction in pipeline.– Don’t act on exception until it reache WB stage

• Handle interrupts through “faulting noop” in IF stage

• When instruction reaches WB stage:– Save PC EPC, Interrupt vector addr PC– Turn all instructions in earlier stages into noops!

Pro

gram

Flo

w

Time

IFetch Dcd Exec Mem WB




Data TLB

Bad Inst

Inst TLB fault

Overflow

Approximations to precise interrupts

• Hardware has imprecise state at time of interrupt • Exception handler must figure out how to find a

precise PC at which to restart program.– Done by emulating instructions that may remain in pipeline– Example: SPARC allows limited parallelism between FP and

integer core:» possible that integer instructions #1 - #4

have already executed at time thatthe first floating instruction gets arecoverable exception

» Interrupt handler code must fixup <float 1>,then emulate both <float 1> and <float 2>

» At that point, precise interrupt point isinteger instruction #5

• Vax had string move instructions that could be in middle at time that page-fault occurred.

• Could be arbitrary processor state that needs to be restored to restart execution.

<float 1><int 1><int 2><int 3><float 2><int 4><int 5>

How to achieve precise interrupts

when instructions executing in arbitrary order?

• Jim Smith’s classic paper (will read next time) discusses several methods for getting precise interrupts:

– In-order instruction completion– Reorder buffer– History buffer

• We will discuss these after we see the advantages of out-of-order execution.

Summary• RISC was about the integrated systems view

– Intelligent tradeoffs in interfaces across compiler, applications, os, and hardware

– End-to-end view point

• Changes in control flow cause the most trouble with pipelining

• Some pre-decode techniques can transform dynamic decisions into static ones (VLIW-like)

• Interrupts and Exceptions either interrupt the current instruction or happen between instructions

• Machines with precise exceptions provide one single point in the program to restart execution

– All instructions before that point have completed– No instructions after or including that point have completed

• Hardware techniques exist for precise interrupts even in the face of out-of-order executionl

cse502 computer architecture textbook: computer architecture: a quantitative approach (3 rd edition)...

Documents

ns slide

xyr slide

year cost

year memory dram capacity

year disk capacity

novel systems software

class of computer

storage system chap