high performance computer architecturegiorgi/teaching/... · 13,14: bus architecture...

High Performance Computer Architecture

Dothan Core(0.09m)

145 mm2/55Mtr 84 mm2/140Mtr

217 mm2m/42Mtr

143 mm2/291MtrConroe Core (0.065m)

Intel-Pentium-4 (11/2000)

Intel-Pentium-4(01/2002)

Intel-Pentium-M(05/2004)

Intel-Core2-Duo(07/2006)Willamette Core (0.18m)

Northwood Core (0.13m)

Penryn Core(0.045m)

107 mm2/410Mtr

Bloomfield Core (0.045m)263 mm2/731Mtr

Intel-Core-i7 (11/2008)

Fermi 512G (0.040m)

Sandy Bridge (0.032m) 216 mm2/995Mtr

467 mm2/3000Mtr

Intel-Core2-Duo(01/2008)

IBM (08/2011)

Intel-Core-i7-2920XM (01/2011)

BlueGene/Q – 18 cores (0.045m)NVIDIA GF100 (09/2009) 360 mm2/1470Mtr

Llano 4C/400G (0.032m) 228 mm2/1450Mtr AMD A8-3850 (06/2011)1Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 56

10 mm

2Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 56

412 mm2m/174MtrFirst Commercial Dual-Core Chip (0.18m)

20 mm

661 mm2m/5560MtrXeon E5-2600 v3 (06/2015 7000$)

Haswell-EP – 18 Cores – 2SMT (0.022 m)

IBM Power4 (12/2001)

362 mm2m/2100Mtr12 Cores – 8SMT (0.022m)

IBM Power8 (12/2014)

128 mm2m/3000MtrApple-iPad Air2 (11/2014)

A8X 11 Cores (0.020m)Ivy Bridge (0.022m) tri-gate

160 mm2/1400MtrIntel-Core-i73770 (04/2012)

413 mm2m/7100MtrXeon Phi (06/2015)(estimated)

Knights Landing – 72 Cores – 4SMT (0.014 m)567 mm2m/8900Mtr

AMD Radeon R9 Fury-X (06/2015)

Fuji XT – 2048 GPU-Cores (0.028 m)

Where are High Performance Computers ?


Among users

Where you need to make this happen:“I have a limited battery and need to… take a picture, share it with my friends, ...”

In the Internet Infrastructure

Where you need to connect anybody with anything

Every electronic devicehas a Computers inside

Electronic Devices

In the Datacenters Where you need to Storeand Retrieve YOUR data

Cars may have many as 50+ computers:(California approved a billfor autonomous vehicles)

AMD Opteron 6200 ARCHITECTURE AMD Opteron 6200 CORE (“Bulldozer”)

AMD Opteron 6272HP Proliant DL 585 G7

Computer Architects• Computer Architects UNDERSTAND and CAN BUILDthe Computing Infrastructure… and almost ALL details of it ! :-)


AMD Opteron 6200 CHIP

AMD Opteron 6200 characteristics

Objectives of this course• This course constitutes a deeper study of current computers and aims to provide:

• Principles of high-performance microprocessors (superscalar, VLIW)

• An understanding of the basic mechanisms for the programming of applications that take advantage of the parallelism made available by the system

• Principles of Multi-Core / Multi-Processor Systems

• Tools for programming Parallel Machines


Course Administration• Teacher: Roberto Giorgi ( [email protected] )• Telephone: 0577-191-5182• Office-hours: Monday 16:30/19:00• Slides: http://www.dii.unisi.it/~giorgi/teaching/hpca2

• Adopted Textbook:• M. Dubois, M. Annavaram, P. Stenstrom,

"Parallel Computer Organization and Design", Cambridge University Press, 2012, ISBN: 978-0-521-88675-8

• Other Reference Textbooks• Hennessy and Patterson,

“Computer Architecture: A Quantitative Approach” 5th Ed.,Morgan Kauffman, 2012,ISBN 978-0-12-383872-8

• D. Culler, J.P. Singh, A. Gupta,"Parallel Computer Architecture: A Hw/Sw Approach",Morgan Kaufman/Elsevier, 1998, ISBN 1558603433

• M.J. Flynn, "Computer Architecture: Pipelined and Parallel Processor Design",Jones and Bartlett Publishers, Inc., 1995, ISBN 0867202041

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 567

Rules for exams, dates, slides, tools• Check out the course website:


http://www.dii.unisi.it/~giorgi/teaching/hpca2

Computer Architecture“The term ARCHITECTURE is used here to describe the set of attributes of a system, as these appears to the programmer*, i.e., its conceptual structure and its operation, with a distinctive organization of the networks that manage the flow of data and control networks, as compared to the logical design and physical implementation”

-- Gene Amdahl, IBM Journal of R&D, Apr. 1964

*programmer == system programmer (OS) engineer or the compiler


Architecture: an overloaded term• In the strict sense: Interface Hardware / Software

• Set of instructions• Memory management and protection• Interruptions and exceptions (traps)• Data formats (for example, IEEE 754 floating point)

• Organization: also called "Microarchitecture"• In this sense, it is "the implementation" of architecture

(this is a part that Gene Amdahl had excluded)• Specifies the functional units and connections• Configuration of the pipeline• Position and configuration of cache memory

• As a discipline, "Architecture of Computers" also includes the microarchitecture• To avoid confusion when it comes to interface HW / SW we use "Instruction Set

Architecture" (ISA)• "COMPUTER ARCHITECTURE concerns the interface between what the technology provides and what the market demands" - Yale Patt, ISCA, Jun 2006


Levels of Computer Architecture

I/O devicesand

Networking

Controllers

System Interconnect(bus)

Controllers

MemoryTranslation

Execution Hardware

Drivers MemoryManager Scheduler

Operating System

Libraries

ApplicationPrograms

MainMemory

1

2

33

4 5 6

7 78888

9

10 10

1111 12

13 14

ISA

Software

Hardware

1: User Interface2: API3,7: ABI4,5,6: internal interface of

the Operating System7,8: ISA9: Memory architecture10: I/O architecture11,12: RTL architecture13,14: Bus architecture

API=Application Program InterfaceABI=Application Binary InterfaceISA=Instruction Set ArchitectureRTL=Register Transfer Level

Interfaces:


What technology provides: Moore's Law• “The number of TRANSISTORS doubles every 18 months”

(Later revised to "24 months"), this is due to:- higher density (transistors / area)- availability of bigger chips

DATA FROM SLIDE 1

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5612Moore's Law and purely PSYCHOLOGICAL!

Mtr

What the market demands: Applications• Application Trend

• FROM numerical, scientific TO commercial, entertainment• FROM few "big" TO ubiquitous, "small“

- mainframes minis microprocessors handheld, embedded• FROM little TO big memory storage (primary and secondary)• FROM single-thread TO multiple-threads• FROM standalone TO networked (cloud computing)• FROM character-oriented TO multimedia (graphics and sound)• FROM personal data TO “BIG DATA”


Main Applications• Numerical/Scientific

• Computational Fluid Dynamics, Weather Prediction, ECAD• Long word length, floating point arithmetic

• Commercial• inventory control, billing, payroll, decision support• byte oriented, fixed point, high I/O, large secondary storage

• Real-Time/Embedded• control, some communications• predictable performance• interrupt architecture important, low power, cost critical

• Home Computing• multimedia, entertainment• high bandwidth data movement, graphics• cryptography, compression/decompression


App. Trends: Multimedia, Networked, Web-servers• A large choice of multimedia devices with

• Graphic displays (LCD, etc.).• High Definition Audio• Large capacity of secondary storage for images, sound, etc…

• Services via the Web and high-performance networks require• Many independent threads• Wide band communication


MICROPROCESSOR ARCHITECTURE• The increasing number of transistors (cheaper and faster)has fueled the demand for higher performance CPU

• 1970s – Serial CPU, 1-bit for integers

• 1980s – 32-bit RISC with a pipeline- The ISA simplicity allows the integration

of the entire processor chip

• 1990s – bigger CPUs, superscalar- Also for CISC

• 2000s – Multiprocessors on a chip...


Course Structure

1. High Performance Pipelining 2. Branch Prediction3. Superscalar processor4. Media Processing: VLIW processors5. Multiprocessors and related problems6. TLP: Thread Level Parallelism7. Evaluation of High Performance Architectures8. Tools for Parallel programming machines

(Cilk, OpenMP, MPI, CUDA, ...)


EVALUATING COMPUTERS


POWER

• TOTAL POWER: DYNAMIC + STATIC(LEAKAGE)

Pdynamic = αCV2f

Pstatic = VIsub ≈ Ve-KVt/T

• DYNAMIC POWER FAVORS PARALLEL PROCESSING OVER HIGHER CLOCK RATE• DYNAMIC POWER ROUGHLY PROPORTIONAL TO f3

• TAKE A CORE AND REPLICATE IT 4 TIMES: 4X SPEEDUP & 4X POWER• TAKE A CORE AND CLOCK IT 4 TIMES FASTER: 4X SPEEDUP BUT 64X DYNAMIC POWER!

• STATIC POWER • BECAUSE CIRCUITS LEAK WHATEVER THE FREQUENCY IS.

• POWER/ENERGY ARE CRITICAL PROBLEMS• POWER (IMMEDIATE ENERGY DISSIPATION) MUST BE DISSIPATED

• OTHERWISE TEMPERATURE GOES UP (AFFECTS PERFORMANCE, CORRECTNESS AND MAY POSSIBLY DESTROY THE CIRCUIT, SHORT TERM OR LONG TERM)

• EFFECT ON THE SUPPLY OF POWER TO THE CHIP

• ENERGY (DEPENDS ON POWER AND SPEED)• COSTLY; GLOBAL PROBLEM• BATTERY OPERATED DEVICES


RELIABILITY

• TRANSIENT FAILURES (OR SOFT ERRORS)• CHARGE Q = C X V

• IF C AND V DECREASE THEN IT IS EASIER TO FLIP A BIT• SOURCES ARE COSMIC RAYS AND ALPHA PARTICLES RADIATING

FROM THE PACKAGING MATERIAL• DEVICE IS STILL OPERATIONAL BUT VALUE HAS BEEN CORRUPTED• SHOULD DETECT/CORRECT AND CONTINUE EXECUTION • ALSO: ELECTRICAL NOISE CAUSES SIMILAR FAILURES

• INTERMITTENT/TEMPORARY FAILURES• LAST LONGER• DUE TO

• TEMPORARY: ENVIRONMENTAL VARIATIONS (EG, TEMPERATURE)• INTERMITTENT: AGING

• SHOULD TRY TO CONTINUE EXECUTION• PERMANENT FAILURES

• MEANS THAT THE DEVICE WILL NEVER FUNCTION AGAIN• MUST BE ISOLATED AND REPLACED BY SPARE

PROCESS VARIATIONS INCREASE THE PROBABILITY OF FAILURES


PERFORMANCE METRICS (MEASURE)

• METRIC #1: TIME TO COMPLETE A TASK (Texe): EXECUTION TIME, RESPONSE TIME, LATENCY• “X IS N TIMES FASTER THAT Y” MEANS Texe(Y)/Texe(X) = N• THE MAJOR METRIC USED IN THIS COURSE

• METRIC #2: NUMBER OF TASKS PER DAY, HOUR, SEC, NS• THE THROUGHPUT FOR X IS N TIMES HIGHER THAN Y IF

THROUGHPUT(X)/THROUGHPUT(Y) = N• NOT THE SAME AS LATENCY (EXAMPLE OF MULTIPROCESSORS)

• EXAMPLES OF UNRELIABLE METRICS:• MIPS: MILLION OF INSTRUCTIONS PER SECOND• MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

PER SECOND

EXECUTION TIME OF A PROGRAM IS THE ULTIMATE MEASURE OF PERFORMANCE BENCHMARKING


WHICH PROGRAM TO CHOOSE?

• REAL PROGRAMS: • PORTING PROBLEM; COMPLEXITY; NOT EASY TO UNDERSTAND THE CAUSE OF

RESULTS

• KERNELS• COMPUTATIONALLY INTENSE PIECE OF REAL PROGRAM

• TOY BENCHMARKS (E.G. QUICKSORT, MATRIX MULTIPLY)

• SYNTHETIC BENCHMARKS (NOT REAL)

• BENCHMARK SUITES• SPEC: STANDARD PERFORMANCE EVALUATION CORPORATION

• SCIENTIFIC/ENGINEEING/GENERAL PURPOSE• INTEGER AND FLOATING POINT• NEW SET EVERY SO MANY YEARS (95,98,2000,2006)

• TPC BENCHMARKS: • FOR COMMERCIAL SYSTEMS• TPC-B, TPC-C, TPC-H, AND TPC-W

• EMBEDDED BENCHMARKS• MEDIA BENCHMARKS


REPORTING PERFORMANCE FOR A SET OF PROGRAMS

LET Ti BE THE EXECUTION TIME OF PROGRAM i (out of N progams):1. (WEIGHTED) ARITHMETIC MEAN OF EXECUTION TIMES:

OR

THE PROBLEM HERE IS THAT THE PROGRAMS WITH LONGEST EXECUTION TIMES DOMINATE THE RESULT

2. DEALING WITH SPEEDUPS• SPEEDUP MEASURES THE ADVANTAGE OF A MACHINE OVER A REFERENCE

MACHINE FOR A PROGRAM i (let TR,i be the execution time on the reference machine)

• ARITHMETIC MEAN OF SPEEDUPS

• HARMONIC MEAN

T i Ni

T i W ii

SiTR iTi

-----------=

= 1= ∑ 1


REPORTING PERFORMANCE FOR A SET OF PROGRAMS

• GEOMETRIC MEANS OF SPEEDUPS

- MEAN SPEEDUP COMPARIONS BETWEEN TWO MACHINES ARE INDEPENDENT OF THE REFERENCE MACHINE

- EASILY COMPOSABLE- USED TO REPORT SPEC NUMBERS FOR INTEGER AND FLOATING POINT


=

Example1 – Quantative comparison depends on the reference machine


Program A Program B Arithmetic Mean Speedup (ref 1) Speedup (ref 2)Machine 1 10 sec 100 sec 55 sec 91.8 10Machine 2 1 sec 200 sec 100.5 sec 50.2 5.5Reference 1 100 sec 10000 sec 5050 secReference 2 100 sec 1000 sec 550 sec

25

Example 2 – contrasting results with Arithmetic and harmonic mean


Program A Program BMachine 1 10 sec 100 secMachine 2 1 sec 200 secReference 1 100 sec 10000 secReference 2 100 sec 1000 sec

Program A Program B Arithmetic Harmonic GeometricWrt Reference 1

Machine 1 10 100 55 18.2 31.6Machine 2 100 50 75 66.7 70.7

Wrt Reference 2

Machine 1 10 10 10 10 10Machine 2 100 5 52.5 9.5 22.4

In terms of speedup:

GM: whichever reference machine we choose, the relative speed between the two machines is always the SAME !!

FUNDAMENTAL PERFORMANCE EQUATIONS FOR CPUs(also known as “IRON LAW”)

Texe = IC X CPI X Tc

• IC: DEPENDS ON PROGRAM, COMPILER AND ISA• CPI: DEPENDS ON INSTRUCTION MIX, ISA, AND

IMPLEMENTATION• Tc: DEPENDS ON IMPLEMENTATION COMPLEXITY AND

TECHNOLOGY

CPI (CLOCK PER INSTRUCTION) IS OFTEN USED INSTEAD OF EXECUTION TIME

• WHEN PROCESSOR EXECUTES MORE THAN ONE INSTRUCTION PER CLOCK USE IPC (INSTRUCTIONS PER CLOCK)

Texe = (IC X Tc)/IPC


AMDAHL’S LAW

• ENHANCEMENT E ACCELERATES A FRACTION F OF THE TASK BY A FACTOR S

1-F F

Apply enhancement

1-F F/S

without E

with E

Texe withE Texe withoutE X 1 F– FS--+=

Speedup E Texe withoutE Texe withE -------------------------- 1

1 F– FS--+

---------------= =


LESSONS FROM AMDAHL’S LAW

1) IMPROVEMENT IS LIMITED BY THE FRACTION OF THE EXECUTION TIME THAT CANNOT BE ENHANCED

• LAW OF DIMINISHING RETURNS – MARGINAL SPEEDUP• The difference between SPEEDUPk+1 and SPEEDUPk is smaller and smaller as S goes from k to k+1

2) OPTIMIZE THE COMMON CASE• EXECUTE THE RARE CASE IN SOFTWARE (E.G. EXCEPTIONS)

F=0.5

SPEEDUP E 11 F–-------<


Amdhal’s maximum

Amdhal’s Law

Remaining Speedup

Marginal Speedup

PARALLEL SPEEDUP

• NOTE: SPEEDUP CAN BE SUPERLINEAR. HOW CAN THAT BE??

OVERALL NOT VERY HOPEFUL

F=0.95


= = 11 − + / = + 1 − < 11 −“mortar shot”

Amdhal’s Law

Amdhal’s maximum

Ideal speedup

GUSTAFSON’S LAW

• REDEFINE SPEEDUP• THE RATIONALE IS THAT, AS MORE AND MORE CORES ARE INTEGRATED ON

CHIP OVER TIME, THE WORKLOADS ARE ALSO GROWING• STARTS WITH THE EXECUTION TIME ON THE PARALLEL MACHINE WITH P

PROCESSORS:

• s IS THE TIME TAKEN BY THE SERIAL CODE AND p IS THE TIME TAKEN BY THE PARALLEL CODE

• EXECUTION TIME ON ONE PROCESSOR IS

• Let F=p/(s+p). Then SP = (s+pP)/(s+p) = (s+p–p+pP)/(s+p)=1-F+FP = 1+F(P-1)

TP s p+=

T1 s pP+=


Gustafson observes that even if the single algorithm/program completes faster only if the parallel portion is dominant (Amdhal), the same algorithm will complete more and more faster as we add processors (P) compared to a purely sequential execution that just repeats the parallel portion (p) for P times.

F

Sp

Course Structure

1. High Performance Pipelining 2. Branch Prediction3. Superscalar processor4. Media Processing: VLIW processors5. Multiprocessors and related problems6. TLP: Thread Level Parallelism7. Evaluation of High Performance Architectures8. Tools for Parallel programming machines

(Cilk, OpenMP, MPI, CUDA, ...)


PIPELINING


Pipelining

• Pipelining principles• Simple Pipeline• Structural Hazards• Data Hazards• Control Hazards


Pipelining principles

• Let T be the time to execute an instruction• Without pipelining

• Latency = T• Throughput seq = 1 / T

• With an ideal n-stage pipeline• Latency = T• Throughput pipe = n / T

• Speedup = Throughput pipe /Throughput seq = n

T

1 2 n. . .

1 2 n. . .

1 n. . .1 2 n. . .

The (ideal) speedup obtainable from an ideal pipeline is equal to n


• Consider instructions composed of n phases of equal duration

2

Implementation of a Simple Pipeline• Simple 5-stage pipeline

• F -- Instruction Fetch• D -- Instruction Decode + Operand Fetch• X -- Execution and Effective Address • M -- Memory Access• W – Write-back Results

latch

clock

latchlatchlatchlatchlatch

F D X M W


5-STAGE PIPELINE

INSTRUCTIONS GO THROUGH EVERY STAGE IN PROCESS ORDER, EVEN IF THEY DON’T USE THE STAGE• NOTE: CONTROL IMPLEMENTATION

• INSTRUCTION CARRIES CONTROL• THIS IS A GENERAL APPROACH: “INSTRUCTION CARRIES ITS BAGGAGE”


Notation

5-stage pipeline1 2 3 4 5 6 7 8 9

i F D X M Wi+1 F D X M Wi+2 F D X M Wi+3 F D X M Wi+4 F D X M W

accessexecute backwrite

M WXF Dmemory

inst. fetchinst.decode


Pipeline Hazards• Conditions that lead to a malfunction if certain countermeasures are not taken

1) Structural Hazards• Two instructions want to use the same hardware resource in the same

cycle (conflict over resources, e.g. Instruction Mem. and Data Mem.)2) Data Hazards

• Two instructions use the same data: must happen in the order defined by the programmer, even if the execution overlaps parts of the instruction execution (see RAW, WAW, WAR)

3) Control Hazards• An instruction (branch, jump, call) can irrevocably determine which

instructions are executed next, because the pipeline has already taken instructions from the initial branch even if there is a jump


1) Structural Hazards• Two instructions want to use the same hardware resource in the same cycle Example:• A load / store uses the same memory location that is used by the

instruction fetch

i F D X M W <-- load instructioni+1 F D X M W i+2 F D X M W i+3 * F D X M W <-- i-fetch stallsi+4 F D X M . . .


Resolving structural hazards• Stall one of the involved instructions

+ Cost-effective and simple- Reduces the performance- Used for some rare events

• Pipelining the resource• Useful if possible (e.g.,

for resources that require more cycles)+ Good performance- In some cases too complex to do

(e.g. RAM)• Replicate the resource

+ Good performance- Costly- Probably introduces delays- Used for cheap resources (or indivisible) De-mux Mux


Guidelines to reduce the structural hazards• The structural hazards can be avoided if each instruction uses the resource:• At most once:

- (e.g., Separated Instruction memory and Data memory)• In the same pipeline cycle

- (e.g. I-Fetch in stage F, R/W in stage M)• For a single cycle

- (e.g. HIT in the data or instruction cache)

• Many RISC processor ISAs were designed with this in mind• Example of problematic situation:

• MISS in cache: pipeline stalls


2) Data Hazards• Two instructions use the same data: this must happen in the order indicated by the programmer, even if the execution overlaps parts of the instruction execution. Example :

R1 <- R2 + R3R2 <- R1 - R7R1 <- R5 OR R6


Reading r1

Writing r1

Data Hazards -- examples

i add r1, r2, r3 F D X M Wi+1 sub r2, r1, r7 F D X M W

r1 ?? Read-After-Write (RAW) Hazard

r1 ?? Write-After-Read (WAR) Hazard

i+1 sub r2, r1, r7 F * * * * * D X M Wi+2 or r1, r5, r6 F D X M W

Writing r1

Reading r1

Note: PURELY HYPOTHETICAL SITUATION it can not happen in this pipeline, by construction

i add r1, r2, r3 F * * * D X M Wi+1 sub r2, r1, r7 F D X M W i+2 or r1, r5, r6 F D X M W

Writing r1

Writing r1r1 ?? Write-After-Write (WAW) Hazard

Note: PURELY HYPOTHETICAL SITUATION it can not happen in this pipeline, by construction


Dependency and HazardDependency: situation in the code that can potentially create hazards

• Read-After-Write (RAW, true-dependence)• There is a real “data exchange" from an instruction to another

• Write-After-Read (WAR, anti-dependence)• An artificial dependence that comes from a bad assignment of registers

• Write-After-Write (WAW, output-dependence)• An artificial dependence that comes from a bad assignment of registers

• Read-After-Read (RAR)• Will not cause problems

HAZARDS• The dependencies can be translated into hazards depending on the hardware


True Dependence and MIPS Data Hazards• Read After Write (RAW) The instruction J tries to read an operand before the instruction I writes it

• Caused by a “dependence” (in the terminology of the theory of compilers) called “true dependence"• The hazard results from a real need of data communication

• In the MIPS processor True Dependence normally generates a hazard

I: add r1,r2,r3J: sub r4,r1,r3


Anti-Dependence e MIPS Data Hazards• Write After Read (WAR) The instruction J tries to write an operand before the instruction I reads it

• Also called "anti-dependence" (in the terminology of the theory of compilers)

• It results from having reused the name "r1", while I could easily use another register (*)

• Does not conflict in case of a 5-stage pipeline MIPS because:• All instructions take 5 stages, and• the reads from registers occur in stage 2 (D) and• all writes are always in stage 5 (W)

I: sub r4,r1,r3 J: add r1,r2,r3

(*) This, however, is not always possible to sw, v. Lesson-2Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5647

Output Dependence e MIPS Data Hazards• Write After Write (WAW) The instruction J tries to write an operand before the instruction I writes it

• Also called "output dependence" (in the terminology of the theory of compilers)

• Also in this case, it results from having reused the name "r1", while I could easily use another register (*)

• Does not conflict in case of a 5-stage pipeline MIPS because:• All instructions take 5 stages, and• all writes are always in stage 5 (W)

I: mul r1,r4,r3 J: add r1,r2,r3

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5648 (*) This, however, is not always possible to sw, v. Lesson2

Simple resolution of the RAW hazard

• The hardware detects the RAW hazard, and then...• Generates a stall to allow the "producer" instruction to finish

F D X M W R1<-R2+R3 F D X M WR2<-R1-R7 F * * D X M W

+ Cost-effective and simple- Reduces the performance

NOTE: It is assumed that the registers can be written in the first half of the cycle (W) and read in the second half of the cycle (D)


Implementation: stall control network• Add latches to remember the RS1/RS2/RD register identifiers at each stage

• The stall is detected by making the following comparisonif (RS1(D)==RD(X) || RS1(D) == RD(M)) then STALL (generates stall in F)• Similarly for RS2

unitExecution

fileRegister

D-CacheA

B

RS1

RD RD

StallControl

stall

Stage D Stage X Stage M


Inserting Stalls (detail)• Related to the instruction that creates the stall, it is necessary:

• On the previous stages: block all "inter-stage latch"• On the next stages:

- Turn off the valid bit associated with inter-stage latch, so that the "bubble" in the pipeline can continue to proceed without creating problems

Previous stageStage of the instructionwhich STALLS Next Stage Next Stage

V VValid Bit = 0 Valid Bit = 1To Flip-Flop HOLD signal

STALL SIGNAL


Reduction of the RAW stalls• Bypass/Forward/Short-Circuit network

• Idea: use the data before it is written in the registers+ Reduces (potentially avoid) stalls- Additional Complexity

bypasses

ME WBEXIF ID


Bypass• Additional Hardware

• Multiplexer to select the input value to the ALU:or from Registers or from Bypass network

• Hazard detection logic (called interlock) that controls these multiplexers


bypass control

Unit(ALU)

Execution

operandlatches

bypass control

bypass

MUX

MUX

fileRegister

Resultlatch

Network to detect the possibility of Bypass• Add latches to remember RS1/RS2/RD register names at each stage (similar to the network for hazard detection)

• E.g. on the input A of the ALU, the network will act like this:if RS1(D)==RD(X) then select ALU-OUT(X)else if RS1(D)==RD(M) then D-CACHE-OUT(M)else select (A)

…similarlyon B input…


Unit(ALU)

ExecutionMUX

MUX

fileRegister

Resultlatch

D-Cache

RS1

RD RD

BypassControl

A

B

ALU-OUT(X)

D-CACHE-OUT(M)

Interaction between control networks Stall/Bypass• The stall logic is aware of the presence of the bypass logic• The bypass logic is activated independently at each stall condition


unitExecution

MUX

MUX

fileRegister

D-Cache

RS1

RD RD

BypassControl

A

B

StallControl

Pipeline Scheduling• Scheduling of instructions at compile-time (Reorder

instructions to reduce stalls caused by instruction load):

BEFORE:a= b + c; R1 <- mem(b)

R2 <- mem(c)stall

R3 <- R1 + R2mem(a) <- R3

d = e - f; R4 <- mem(e)R5 <- mem(f)

stallR6 <- R4 - R5mem(d) <- R6

AFTER:R1 <- mem(b)R2 <- mem(c)R4 <- mem(e)R3 <- R1 + R2R5 <- mem(f)mem(a) <- R3R6 <- R4 - R5mem(d) <- R6


Dynamic Instruction Scheduling(Introduction)


-Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ02-SL di 19-1

Example: limits of the sequential execution• Multiplication of the elements of two vectors, store the result in a third vector


r1 <- looplengthr2 <- 0r4 <- addr(b) # loadr5 <- addr(c) # the pointersr6 <- addr(a) # to the variables

loop: r3 <- mem(r4+r2) # load b(i)r7 <- mem(r5+r2) # load c(i)r7 <- r7 * r3 # b(i) * c(i)r1 <- r1 - 1 # decr. the countermem(r6+r2)<- r7 # store a(i)r2 <- r2 + 8 # update the indexP <- loop; r1!=0 # close the loop

“In-Order” execution – standard pipeline The situation:

The Problem:• The Load is launched (“issue” phase)• The Multiply stalls because of the true dependency (on r7)• The Subtract stalls becasue the Multiply stalls

(and also the next Branch stalls)- Why the Subtract should (needlessly) stall?

• The In-order execution limits the performance

.

.r7 <- mem(r5+r2) # load c(i)r7 <- r7 * r3 # b(i) * c(i)r1 <- r1 - 1 # decr. Counter..


Possible solutions• Static Scheduling (Software)

• The compiler re-orders the instructions+ it can implement “tricks” much more powerful than the (high-level) programmer may know+ it can make use of simpler (and therefore faster) hardware - it requires additional work to the compiler- it may does not adapt to events (e.g., cache miss) that occur at runtime• Adopted in the Intel Itanium and VLIW processors in general

• Dynamic Scheduling (Hardware) or “Out-Of-Order Issue”• The hardware re-orders the instructions+ it can handle events not known at compile-time+ the software is less dependent on specific hardware (portability)- the hardware is certainly more complex• Adopted in almost all superscalar processors


Dynamic Scheduling or “Out-Of-Order“ Issue• RULES for proper operation:

RU1) The out-of-order issue must respect “True dependences”,i.e., should not generate to RAW hazards RU2) The out-of-order issue must avoid “False dependences”,i.e., should not generate (avoidable) WAR and WAW hazards


tIN-ORDER issue (and execution)

I1 I2 I3

t OUT-OF-ORDER issue (and execution)I1 I3 I2

(in this case resulting in a WRONG execution)

A code snippet with false dependecies:r7 <- mem(r5+r2) # first instruction (I1)r7 <- r7 * r3 # second instrucion (I2)r3 <- r6 + 1 # third instruction (I3)

if r3 is changed (out-of-order) by the third instruction (I3), the multiplication (I2) would read an incorrect value from r3

Resolving False Dependencies by SW ?

• The compiler can avoid some false dependenciesBUT NOT ALL (some are desired). E.g.:

r1 <- xif y !=0 r1 <- zr3 <- r1 + r2

• When the “y!=0” condition is true, there is an output dependence on r1 (then a true dependence with the next instruction)

• There is no “false dependence” when the branch is not taken

• An instruction may also create dependencies with itself- (e.g. it happens in the dynamic scheduling of instructions in loops)


Resolving False Dependencies by HW !

• We can effectively eliminate false dependencies through renaming the registers by hardware

• The register identifiers have than a one-to-one association with the processed values• NOT with physical locations

• The hardware can rename registers to remove false dependencies. E.g. (same example as in slide 5):

r7 <- mem(r5+r2) t1 <- mem(r5+r2)r7 <- r7 * r3 t2 <- t1 * r3r3 <- r6 + 1 t3 <- r6 + 1

t1 and t2 are the new names of r7


A first approach: Thornton's Scoreboard• Implemented in the CDC 6600 (1964)• The CDC 6600 had 18 non-pipelined

functional units• 4 FP units: 2 multiply, 1 add, 1 divide• 7 Memory units: 5 load, 2 store• 7 Integer units: 3 add, 1 shift, 1 logical, ...

• Scoreboard: a centralized control scheme• Check the launch of

any instructions• Detects hazards

• Realizes the '"Out-Of-Order issue“ but does not use the "renaming" of the registers

• The critical WAR, WAW may stall the instruction issue

• Mainly of historical interest

Instruction"stack" Decode

Boolean

Shift

FixedAdd

FloatAdd

FloatMult.

FloatMult.

FloatDivide

ShortFixedAdd

ShortFixedAdd


IBM 360/91 and Tomasulo’s Algorithm• "Fast" version of the IBM 360 for scientific programs

• The chief architect was Gene Amdahl• Announced in 1964 and available since 1965

• Tomasulo’s Algorithm• Published in 1967 (before the introduction of cache memory

concept)• The Floating Point units are pipelined and include:

- Adder- Multiplier (the Division is executed in the Multiplier)

• Dynamic Scheduling in the FP unit• Uses an algorithm known to history as “Tomasulo’s Algorithm”• First used in the IBM 360/91


Generalized Tomasulo’s Algorithm• Improves the architecture of a parallel pipeline(i.e., a pipeline with functional units that operate in parallel)

• Extends the Tomasulo’s Algorithm to all functional units- Not only to the Floating Point units- Moreover: it manages the loads/stores as any other instruction

• Introduces the concept of “Reservation Station” (RS)• Each Functional Unit (FU) has a set of associated RSs• The RS hold information about available operands and those one that

are not yet available because they are produced later in time in the pipeline processing

• The RS account for “future operands” by associating a “TAG” which exactly tells which is the functional unit that will produce the future value

• Once the result is produced by the given FU then it is broadcast on a Common Data Bus (CDB) along with the TAG of the FU

• A matching logic associates the CDB value+tag with the waiting RSs-Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ02-SL di 19-10

Generalized Tomasulo’s Algorithm -- Scheme

Com

mon D

ata Bus

(I-cacheAccess)

NIP CIP

(Decode)

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache 1 D-Cache 2

Integer

Mult. 4

(DISPATCH) (ISSUE)

LQ

SQ

M RS

LS RS

A RS

(Complete)

2 RSM

3 RSLS

3 RSA

3 ELEMLQ

3 ELEMSQ

NIP=Next Instruction to disPatch, CIP=Current Instruction to disPatch

F D P I CX1 X2 X3 X4

M FU

LS FU

A FU


Structure of a Reservation Station (RS)

Busy The element is not availableOp OpcodeQj, Qk Source FU designators

associated with this instruction at dispatch time* if ZERO the corresponding Vx is the current value* if !=ZERO the Qx indicates the FU that WILL produce the value

Vj, Vk Operand ValueI Immediate value (if specified by the instruction)

• The Register File fields, in addition to the value Vi, also include the tag Qi that may indicate the RS that will produce such value (when Qi !=0):

Busy Qj IVKQk Vj

1 bit e.g. 6 bits 4 bits 4 bits 32 bits

Op

32 bits 16 bit

RSx:

-Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ02-SL di 19

Ri: Qi Vi

Note-1: only one between Vx or Qx is valid at a given timeNote-2: For loads, Vk contains an immediate value (e.g., an offset)

Qi Tag fieldVi Operand value

-12

Generalized Tomasulo’s Algorithm -- Dispatch• 3 fundamental steps (**)

• Dispatch (book: “Issue”)• Issue (book: “Execute”)• Complete (book: “Write”)

• disPatch• Take the next instruction from the

Decode unit• Locate the first free appropriate RS• If no appropriate and free RS structural hazard• If appropriate and free RS Dispatch the instruction• Copy the operands (1 or 2, if ready) from the registers to the RS• If one or both operands are not ready, instead of values , write in the

RS tag values of RS identifier of the producer instruction• This corresponds to resolve false dependencies by “renaming" the

registers with the tags (RULE RU2, see slide 5)

Comm

on Data Bus

I-cacheaccess

NIP CIP

Decode

Regs

WRITE-BACK



Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

M RS

LS RS

A RS

** NOTE: We use here the terminology "more modern” than in the Hennessy-Patterson book


Generalized Tomasulo’s Algorithm -- Issue• Issue

• If one of the RS associated with a FU has the operands ready, then the associated instruction it is launched (issued) and later executed in the stage(s) that follow(s)

• If the operands are not yet available, it means that they will become available on the CDB later on

• In this way, the execution is delayed until the arrival of the operands, automatically solving the true dependencies (RULE RU1, see slide 5)

Comm

on Data Bus

I-cacheaccess

NIP CIP

Decode

Regs

WRITE-BACK



Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

M RS

LS RS

A RS


Generalized Tomasulo’s Algorithm -- Complete• Complete

• If the CDB is available, the FU result will be written on the CBD, along with the identifier ('Id') of the producer RS

• The RSs and registers act as an associative memory with respect to the tag: all RSs that contain a tag equal to Id, will store the data seen on the CDB

• The registers are updated too in this phase, with the same mechanism

• If the CDB is not available, we have a stall (for structural hazard)

Com

mon Data Bus

I-cacheaccess

NIP CIP

Decode

Regs

WRITE-BACK



Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

M RS

LS RS

A RS


Generalized Tomasulo’s Algorithm -- Commit• Instructions can now complete out-of-order• The machine does not need a “commit stage” but we can think about a “safe state” of the machine like if it was a subsequent “commit stage” after the “complete stage”

• This “safe state” is useful to know at which cycle we can consider that the machine has completed **in-order** all previously issued instructions

• See some exercise like 23/06/2005 test to better understand this point

-16-Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ02-SL di 19

Generalized Tomasulo’s Algorithm – the RSs• Reservation Stations

• Implement a distributed hazard control- in the dispatch phase, the tags realize the renaming and avoid WAW and WAR- in the issue phase, the tag matching on the CDB realize the necessary delay

for a previous data thus eliminating RAW (key observation: we delay only the single instruction, NOT the whole pipeline ! )

• In our scheme: the RSs are numbered from 1 to 8- The ‘zero’ is a reserved tag value to indicate that the address is not from a

RS but it’s ready in the ‘operand field’- The tag takes a 4-bits- All the "receivers" use the 'Id' field to identify the incoming data on the CDB

• During Dispatch- The tags are created and associated with the produced values (initially not

yet available), i.e. the substitute the registers identifiers


Loads and Stores• The Load is managed like the other instructions

• They are processed by the Load / Store FU• The address passes through the Load Queue (LQ)

• The EFFECTIVE address calculation phase of store instructions is separated from the writing-into-memory phase• The address A is sent to a queue associated to the Store FU (called Store

Queue or SQ), in the Dispatch phase• The data to be written is sent directly into the SQ• The Load/Store FU then sends the EFFECTIVE address to the SQ• When the data arrives, it is then associated to the effective address

within the waiting instruction that is in the SQ -- through the tag field• Resolving data hazards that happen through the memory locations

• The load and store must be completed "in-order"• A load address must be compared with all previous addresses in the SQ

- If there is a match, you must wait for the completion of the corresponding store (stall)- If there is NO MATCH, then the load can "go ahead" in SQ and complete out-of-order


Loads and Stores• Store Queue (SQ)

Ai Address valueQi Data Tag fieldVi Data value

• Load Queue (LQ)Ai Address value

XX

to memory

hazard control

Load Queue

compare

address

issueinstruction

Store Queue

load addresses

store addresses

& translation

addaddress

store datafrom CDB


SQEi: Qi ViAi

LQEi: Ai

http://www.dii.unisi.it/~giorgi/teaching/hpca2

Dynamic Instruction Scheduling(Example)


-Roberto Giorgi, Universita' degli Studi di Siena, C217ES01--SL di 18-1

Example

-2-Roberto Giorgi, Universita' degli Studi di Siena, C217ES01--SL di 18

loop: r3 <- mem(r4+r2) # load b(i)r7 <- mem(r5+r2) # load c(i)r7 <- r7 * r3 # b(i) * c(i)r1 <- r1 - 1 # decr. Countermem(r6+r2)<- r7 # store a(i)r2 <- r2 + 8 # bump indexP <- loop; r1!=0 # close loop

I-cacheaccess

NIP CIP

Decode

Regs

WRITE-BACK


AddressAdd D-Cache

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

M RS

LS RS

A RS

Com

mon D

ata Bus

Reg Q V01 0 1002 0 03 64 0 10005 0 20006 0 30007 0 49

r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 – 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0


RS Id Busy Op Vj Vk Qj QkA1 1A2 2A3 3M1 4M2 5LS1 6 1 load 1000 0 0 0LS2 7LS3 8

CYCLE 1

I-cacheaccess

NIP CIP

Decode

Regs

WRITE-BACK


AddressAdd D-Cache

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

M RS

LS RS

A RS

Com

mon D

ata Bus

Reg Q V01 0 1002 0 03 64 0 10005 0 20006 0 30007 7



RS Id Busy Op Vj Vk Qj QkA1 1A2 2A3 3M1 4M2 5LS1 6 1 load 1000 0 0 0LS2 7 1 load 2000 0 0 0LS3 8

CYCLE 2

I-cacheaccess

NIP CIP

Decode

Regs

WRITE-BACK


AddressAdd D-Cache

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

M RS

LS RS

A RS

Com

mon D

ata Bus

Reg Q V01 0 1002 0 03 64 0 10005 0 20006 0 30007



RS Id Busy Op Vj Vk Qj QkA1 1A2 2A3 3M1 4 1 mult 6 7M2 5LS1 6 1 load 1000 0 0 0LS2 7 1 load 2000 0 0 0LS3 8

7 - The first load complets: let’s assume that reads ’13’

CYCLE 3

4

I-cacheaccess

NIP CIP

Decode

Regs

WRITE-BACK


AddressAdd D-Cache

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

M RS

LS RS

A RS

Com

mon D

ata Bus

Reg Q V01 12 0 03 0 134 0 10005 0 20006 0 30007 4



RS Id Busy Op Vj Vk Qj QkA1 1 1 sub 100 1 0 0A2 2A3 3M1 4 1 mult 13 0 7M2 5LS1 6 0LS2 7 1 load 2000 0 0 0LS3 8

- The first load writes on the CDB (the value 13)- The sub goes in dispatch- The second load is issued- The mult can’t be issued until it gets Qi=Qk=0

CYCLE 4

Reg Q V01 12 0 03 0 134 0 10005 0 20006 0 30007 4


Com

mon D

ata Bus

I-cacheaccess

NIP CIP

Decode

Regs

WRITE-BACK


AddressAdd D-Cache

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

M RS

LS RS

A RS


RS Id Busy Op Vj Vk Qj QkA1 1 1 sub 100 1 0 0A2 2A3 3M1 4 1 mult 13 0 7M2 5LS1 6 1 sto 3000 0 0 0LS2 7 1 load 2000 0 0 0LS3 8

- The second load complets and let’s assume it reads ’11’- The mult waits and the sub is issued- The store goes in dispatch

-Simultaneously we allocate one element in the SQ-The sub is going to conflict on the CDB with the load, then will have to wait

CYCLE 5

SQ: A Q V

4

conflicton the CDB

Com

mon D

ata Bus

I-cacheaccess

NIP CIP

Decode

Regs

WRITE-BACK


AddressAdd D-Cache

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

M RS

LS RS

A RS

Reg Q V01 12 23 0 134 0 10005 0 20006 0 30007 4

LQ A



RS Id Busy Op Vj Vk Qj QkA1 1 1 sub 100 1 0 0 A2 2 1 add 0 8 0 0A3 3M1 4 1 mult 13 11 0 0M2 5LS1 6 1 sto 3000 0 0 0LS2 7 0 LS3 8

- The second load writes on the CDB (the value 11)- The mult is issued, and the sub is waiting the CDB- The store is issued: in the SQ it gets the effective address A

- but it can’t advance, until Qi != 0- The add goes in dispatch

CYCLE 6

SQ: A Q V

3000 4

Com

mon D

ata Bus

I-cacheaccess

NIP CIP

Decode

Regs

WRITE-BACK


AddressAdd D-Cache

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

M RS

LS RS

A RS

Reg Q V01 0 992 23 0 134 0 10005 0 20006 0 30007 4

LQ A



RS Id Busy Op Vj Vk Qj QkA1 1 0 A2 2 1 add 0 8 0 0A3 3 1 brch 99 0 0 0M1 4 1 mult 13 11 0 0M2 5LS1 6 1 sto 3000 0 0 0LS2 7 0 LS3 8

- The mult proceeds and the store waits- The sub complets and updates R1 (and the CDB) with ’99’- The add is issued- The branch goes in dispatch

CYCLE 7

Com

mon D

ata Bus

I-cacheaccess

NIP CIP

Decode

Regs

WRITE-BACK


AddressAdd D-Cache

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

M RS

LS RS

A RS

Reg Q V01 0 992 0 83 0 134 0 10005 0 20006 0 30007 4



RS Id Busy Op Vj Vk Qj QkA1 1 0 A2 2 0 A3 3 1 brch 99 0 0 0M1 4 1 mult 13 11 0 0M2 5LS1 6 1 sto 3000 0 0 0LS2 7 0 LS3 8

- The mult proceeds and the store waits- The add writes on the CDB- The branch is issued

CYCLE 8

Com

mon D

ata Bus

I-cacheaccess

NIP CIP

Decode

Regs

WRITE-BACK


AddressAdd D-Cache

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

M RS

LS RS

A RS

Reg Q V01 0 992 0 83 0 134 0 10005 0 20006 0 30007 4

LQ A

SQ A Q V



RS Id Busy Op Vj Vk Qj QkA1 1 0 A2 2 0 A3 3 0 M1 4 1 mult 13 11 0 0M2 5LS1 6 1 sto 3000 0 0 0LS2 7 0 LS3 8

- The mult complets and calculates 13*11=143- The store waits- The branch complets

CYCLE 9

Com

mon D

ata Bus

I-cacheaccess

NIP CIP

Decode

Regs

WRITE-BACK


AddressAdd D-Cache

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

M RS

LS RS

A RS

Reg Q V01 0 992 0 83 0 134 0 10005 0 20006 0 30007 0 143



RS Id Busy Op Vj Vk Qj QkA1 1 0 A2 2 0 A3 3 0 M1 4 0M2 5LS1 6 1 sto 3000 0 0 0LS2 7 0 LS3 8

- The mult writes on the CDB (the value ‘143’)- The store gets the value ‘143’ and can finally complete

CYCLE 10

SQ: A Q V

3000 0 143

Com

mon D

ata Bus

I-cacheaccess

NIP CIP

Decode

Regs

WRITE-BACK


AddressAdd D-Cache

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

M RS

LS RS

A RS

Reg Q V01 0 992 0 83 0 134 0 10005 0 20006 0 30007 0 143

LQ A

SQ A Q V



RS Id Busy Op Vj Vk Qj QkA1 1 0 A2 2 0 A3 3 0 M1 4 0 M2 5LS1 6 0LS2 7 0 LS3 8

CYCLE 11

Tomasulo: Summary• Reservation Stations

• Allow the "out-of-order issue" based on the availability of data (E.g. sub and add issued without waiting for the mult)

• Register Renaming (tags)+ Avoids the WAR and WAW hazards

Especially important when there are few registers available(as originally in the IBM 360)

+ Realize a dynamic "loop unrolling"- Requires a relatively complex logic

• Common Data Bus+ Simultaneously broadcast the results to more waiting instructions- It’s a "bottleneck", but it can be replicated more times

(of course at a cost greater hw)• The scheme does not handle "precise exceptions"


Tomasulo: hazard management summary

Hazard Management methodStructural on RS (RS finite) Stall in the Dispatch stage (*1)Structural on CDB (CDB occupied) Stall in the Issue stage (*2)Structural on FU (FU occupied) Stall in the Issue stage (*3)

RAW Avoided by using the tags WAR Avoided by coping operands in RS

at dispatch-timeWAW Avoided by using SW Register Renaming

(*1) avoidable with a larger number of RSs(*2) avoidable with a larger number of CDBs(*3) avoidable with multiple FUs (or can reduce with pipelined FUs)


Reservation Station -- implementation


dispatch

Qi Busy

CDB data

REGISTER

RES STAT.

issue: move to functional unitdispatch: move to res. station

Vk ld

MUX

OR

AND

ldQk

clr

unit

tofunctional

BusyclrOp

=0?

tofunctional

tofunctional

unitunit

compare

CDB tagRS No.

issueclrset

Full

=0?

AND

to issue logic

Qj

j operandlike

k operand

tag 1 cyclebefore data

ready

set

value

Vj

busyto dispatch logic

FFenbl

compare

MUX

General organization of IBM 360/91 pipeline• “In-order” pipeline with the following stages:

• I-fetch, decode, address generation• Floating point decoupled from the Integer (Fixed Point) through memory buffers

• Effective-address generation done in the integer unit• A memory pipeline for loading the data


IBM 360/91 -- Floating Point Unit


From: R.M. Tomasulo, “An efficient Algorithm for Exploring Arithmetic Units”, IBM Journal, Jan.1967, pp.25-33

http://www.dii.unisi.it/~giorgi/teaching/hpca2High Performance Computer Architecture

Branch prediction(first part)


Stalls due to control hazards (e.g., branch)


F D X M

F

F

F

D

D

X

instruction 1

instruction 2(branch, e.g., bne $1, $2, label)

instruction 3

instruction 4

time

branch penalty(2 cycles)

W

F D X M W

F

D

D

X

X

MFinstruction i+1

instruction i+2

fetch instructions from the new branch

the pipeline must be flushed

branch (the condition $1==?$2 is known only after the X stage)

instruction ilabel:

Introduction• Programs are not linear sequences of instructions• Programs are FULL of “branch”-like instructions


bne $1,$0,ELSE

i =: $1, k=:$2

THEN: addi $2, $0, 1jr NEXT

ELSE: addi $2, $0, 2NEXT:

• In this case we would have 50% of branch/jump instructions(in this example 2 out of 4)

• In average, branch instructions are 15% of the total,i.e., 1 every 6-7 instructions

Example:if (i==0)

k = 1;else

k = 2;

Structure of programs


Basic-Block

An instruction sequencewith a single entry point ANDa single exit point

(e.g., an instruction sequencewith a branch or jump at the end)

Each block is about 6-7 instructions in average

Branching• branch instructions are “expensive”

• 2-3 pipeline cycles are wasted• 15-30 cycles in the case of wide-issue deeply pipelined processors

(e.g., Pentium 4)• Memory bandwidth is wasted to fetch useless instructions• The instruction cache is likely to miss on the taken branch

• Terminology• Branch Penalty: number of wasted cycles because we need to flush the

instructions on the not taken branch (that are already in pipeline)• Branch Instruction Address (BIA): address of the branch instruction (in mem.)• Branch Target Address (BTA): address of the new Program Counter (PC)• Branch Taken: it means the control goes to a PC different from PC+(1 instr.)• Branch Not Taken: the next instruction below the branch is executed


Upipe down, BWmem up, I$miss up

Example (32-bit instructions)0x1000 -----

----------

0x100C BNE R1, R2, LABEL----------…-----

0x374C (LABEL:) ADD …SUB …-----…


BIA

BTA

BRANCH TAKEN BRANCH NOT-TAKEN

Branch cost reduction - Software solutions (1)• Loop unrolling

• The loop is replicated k times and the branch instructions between replicas are eliminated

+ Increases the distance between consecutive branches


loop: ---------------b … loop

*loop: -----

----------------------------------------b … loop

*

*

*

e.g.,3 unrollings

7

Branch cost reduction - Software solutions (2)• Instruction scheduling/Delay slot

• After the branch there may be 1 or more delay slots


loop: instr1instr2instr3b … loopinstr4instr5

DELAY SLOT Behaves like…loop: instr1

instr2instr3instr4b … loopinstr5

I can choose the “instr4” in such a way that it does work inside the iteration loop

+ the branch penalty is then hidden !+ don’t need to flush instructions from the pipeline (simplified hardware)- the software MUST always consider something to put as “instr4”(if no instruction is available, then it must put a NOP as instr4)

- Potentially dangerous if not properly managed by the programmer/compiler- In practice, we may NOT exploit the delay slot ALWAYS

Branch cost reduction - Hardware solutions• Anticipate the target calculation (a dynamically calculated address)

and the branch decision (see next slide)

• Branch prediction & speculative execution (few slides ahead)• Try to predict WHERE to branch and IF to branch• The instruction fetch continues from the predicted address the execution becomes speculative- need to validate the prediction

(if wrong (called misprediction) we must undo what done (safe recovery))


Anticipate the decision: passing from 2 to 1 delay slot


PC Instructionmemory

4

Registers

Mux

Mux

Mux

ALU

EX

M

WB

M

WB

WB

D/ X

0

X/M

M/W

Datamemory

Mux

Hazarddetection

unit

Forwardingunit

F.Flush

F/ D

Signextend

Control

Mux

=

Shiftleft 2

Mux

Example: 5-stage pipeline MIPS BRANCH PENALTY = 1 CYCLE

target calculation (BTA)Decision (T/N)

T/N

BTA

Branch Prediction• Idea: try to predict the outcome of the jump

• Choice of branch instructions to predict• Preferably: better to calibrate the predictor depending on the functionality of the

branch or the context in which the branch is located• Objective of Branch Prediction

• Minimize the branch penalty• Maximize instruction throughput• Maximize the accuracy of the predictions

• Branch prediction Accuracy: % of correctly predicted branch instructions

ABP= (no. of correctly predicted branch instr.) / (total no. of branch instr.)

• Elements of action:B1) Branch target predictionB2) Prediction of the outcome of the branch condition

• Prediction validation• It is necessary to insert mechanisms to:

- Test ‘a posteriori’ if the prediction was right or wrong (misprediction)- To restore the pipeline situation as before the prediction (safe recovery)


B1) Branch Target Prediction• BTB (Branch-Target Buffer) or BTAC (Branch-Target Address Cache)

• Small cache that- In the “tag” field holds the Branch Instruction Address (BIA)- In the “data” field holds the PREDICTED Branch Target Address (BTA)

• Located in the Fetch stage• At the first execution of a branch instruction miss static prediction

- A BTB entry is allocated (similarly to what happens in a cache)• Then (at a next hit):

+ the (predicted) next instruction is available even BEFORE FETCHING IT+ Automatically stores those branch instructions that are more recently or more frequently

used


BIA BTA

Branch InstructionAddress (BIA) field

Branch targetaddress (BTA) field

PC(Program Counter)

AccessI-cache

Predicted Target Address (used as new PC)

Note: depending on chosen cache type to implement the BTB (direct access, set-associative, full-associative), the tag field will contain ALL or SOME of the BIA bits

Note2: we can use the hit-rate HBTBto characterize the BTB performance

“The TAG” “The DATA”

BTB update• Having predicted the target, the branch is anyway processed

• After a few cycles we will know the actual address of the target (calculated). Let's call it “real-BTA” (real Branch Target Address)

• The calculated target address is compared with the BTA that is stored in the BTB• If real-BTA==BTA then it is not necessary to make any action• If real-BTA != BTA MISPREDICTION

- In this case we need to (safe recovery):-Undo what was done the wrong branch-Jump to the correct target

• After a branch in every case we need to update the BTB• In the BIA field we put the previous PC (old-PC)• In the BTA field we put real-BTA


BTB location in the pipeline


PC Instructionmemory

4

Registers

F/ D

Signextend

Mux

=

Shiftleft 2

Target calcualtiondecision

RS1RS2

RD

D/ X

Bpred

BTB

instruction

T/N

BTB hit

BTA

Mux

Bpred+BTB

Mux

real-BTA

real-T/N

Note: for the sake of simplicity, the misprediction and safe recovery logic is NOT represented here

Bpred is the logic block that tries to predict the branch outcome (T/N)


Branch Prediction(second part)


B2) Prediction of the Branch Condition outcome• Based on the branch condition outcome, then we go to the target address (Taken = T) or to the next instruction (Not-taken = N)• There are several ways to make this type of prediction

• STATIC NOT-TAKEN (always-not-taken) PREDICTION• Assumes that the branch is NOT taken (N)+ Simple to implement+ No branch penalty (target instructions are already in the pipeline)• Used in the processors: Motorola 68020 (1984), VAX-11/780 (1977)- Not very efficient (ABP < 40%)


BPRED32

PC

T/N32

PC

T/N0

BPRED ALWAYS NOT-TAKEN

Branch condition prediction: static techniques• STATIC TAKEN (always-taken) PREDICTION

• Assumes that the branch is always TAKEN (T)+ Simple to implement- 1 cycle branch penalty

(can fetch target instructions AFTER recognizing that the instruction is a branch in the decode stage)

+ works well for loops(for a n-iteration loop, at most we have 1/n wrong predictions)

- Not very efficient (ABP < 70%)


BPRED32

PC

T/N32

PC

T/N1

BPRED ALWAYS TAKEN

Taken-branch Probability


Total probabilitytaken

not taken

not taken

not taken

taken

taken

90%

50%

60% - 70%

Probability of branches with offset <0

Benchmarks: SPEC2000

Probability of branches with offset >0

Static Prediction: Compiler based• Requires that the branch instructions have a reserved bit that could be

changed by the compiler in the following way:- This bit is set to 1 if TAKEN is considered the most probable outcome - This bit is set to 0 if NOT-TAKEN is considered the most probable outcome

• The compiler can decide the value of such bit according to the type of instruction or based on the result of the profiling of the program- This technique is static in the sense that the prediction is fixed for each execution of the

corresponding branch instruction (even if the input data is changed in a subsequent instance)- It has been used in the Motorola 88110 and PowerPC 601 [Smith95]


Static Prediction: direction-based (BTFN)• At compile time, the compiler calculates the OFFSET of the jump (= BTA-

LC)- Note: the actual PC is not known at compile time, so the Location Counter LC is used

• If OFFSET> 0 then a special branch instruction is chosen- e.g., FB or Forward-Branch- This will be predicted not-taken at run time

• If OFFSET <0 then another special branch instruction is chosen- e.g., BB or Backward-Branch- This will be predicted taken at run time

• This technique is also called Backward-Taken/Forward-Not_taken (BTFN)


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

dirent doom fmath gcc go llong lswlr math printf

dirtaken

taken

nottaken

Static prediction comparison:always-taken, always-not_taken, direction-based

Example: benchmarks SPEC2000


ABP

Dynamic BP: Bimodal Predictor• The branches tend to repeat their behavior (T or N)

• We can say that the branches are "biased" towards T or N Bimodal Predictor [Smith81]


BIABranch Instruction Address

PHTPattern History Table

p

• A table indexed by p bits of the BIA (or in tandem with the BTB)

• Holds the state of a saturation counter which records the behavior (pattern) of the predominant branch (T or N)

• The saturation counter has typically two bits

Predicts Taken or Not-taken

FSM=Finite State Machine(e.g., Moore or Mealy)

0/N 1/N 2/T 3/T not takentaken

start stateFSM

j

The states are labeled from 0 to 3, and the prediction is taken if the state is >= 2A “Taken” branch (T) increments the value of the countNot-taken (N) viceversa

Updating the Status of the PHT – phase 2• Once the actual outcome of the branch is available, the counter is incremented in case of Taken or decremented viceversa


BIABranch Instruction Address

PHTPattern History Table

p

NewState

real T/N (2) (2)FSM

j

Old State

j

DYNAMIC BP based on previous history

• The prediction is based on previous branch outcomes (T / N)• Parameters

• How many previous outcomes have to keep track of? Number of states k• Which prediction could be associated with a certain pattern? Algorithm A

• Branch History Shift Register (BHSR)• Each BTB entry may have an associated register which holds the k previous branch

outcomes (Sj==1 T, Sj==0 N)

• It works as a shift register, since for every new outcome of the corresponding branch, this register has to be updated (in order to have current information)


S1 S2 Sk-1 SkS3 …

S2 S3 Sk Sk+1

SHIFT REGISTER which holds the latest k outcomes of a branch

S4 …

BHSR

BHSR

S1

Dynamic BP: prediction algorithm• Prediction Algorithm

• Starting from a certain value of the BHSR, the algorithm is responsible for generating the prediction (T or N)

• The typical solution is to use a Finite State Machine (FSM), i.e. Mealy or Moore FSM, that given a k-bit input produces a 1 bit output


FSM

Prediction: T (Taken) or N (Not-taken)

kBHSR

PREDICTION

FSM

BHSR

STATUS UPDATE(1) (2)

History based BP• The BTB is extended to associate a BHSR (Branch History Shift Register) to each entry (branch or set of branches)• When the PC hits in the BTB, the bits are sent to the BHSR FSM


BIA BTA

Branch InstructionAddress (BIA) field

Branch targetaddress (BTA) field

PC

AccessI-cache

Speculative target address

Branch History Table(BHT=array of BHSRs)

Predicts Taken or Not-taken

a bits a bits k bits

a

BHSR

FSM

FSM options: k=1• Only 1 bit of history (k = 1), a possible FSM is the following• it remembers the last direction taken• If the prediction is correct remains in the current status• If the prediction is not correct change status (and outcome)


T/T N/N

T

TN

N

History

Prediction

Actualdirection Initial

state

2-bit FSM• With two history bits (k=2), this is a possible FSM

• The prediction changes in case of two consecutive mispredictions• Otherwise it repeats the same prediction


TT/T NT/N

NN/NTN/T

TT

TT

N

N N

N

HistoryPrediction

Actual direction

Initial state

How many history bits? (Lee & A.Smith) [Lee84]• 26 programs from six different types of workload (WL)

- Three different machines (IBM 370, DEC PDP-11, CDC 6400)- On average, 67.6% of the branch are "taken“- Prediction based on the opcode: ABP = 55.2% -79.8%- Prediction that uses 1 bit of history: ABP = 79.7% -96.5%- Prediction using 2 bits of history: ABP = 83.4% -97.5%- Prediction that uses 3 bits of history: ABP = 83.5% -97.7%- Prediction that uses 4 bits of history: ABP = 83.7% -98.1%- Prediction that uses 5 bits of history: ABP = 83.9% -98.2%

2 bits are sufficient for an effective prediction

BTB: test with A=1…256,C=1…4KB,WL=IBM/CPL-mix• E.g., with a BTB 4-way set associative (A=4), 128 set (C=512B) hit rate HBTB = 86.5%• A’BP=93.8%, but among the latter, PCH=4.2% changes target (i.e., “case” statements)

• Combining ABP and HBTB• ABP=(A’BP-PCH)*HBTB=(0.938-0.042)*0.865= about 78% are correctly predicted


Type of Algorithm (Optimal FSM) [Nair95]• Searching an optimal algorithm for a 2-bit predictor

• The 2-bit FSM generate 220 possible FSMs• The algorithms have been tested with the SPEC-89 suite on an IBM RS/6000• After eliminating the not interesting cases, the remaining FSMs are 5248

• Conclusions1) Identification of the optimal FSM (that maximizes ABP)

- The accuracy of these predictors varies from 87.1% to 97.2%2) Comparison with the “saturating counter” FSM

- In three cases the optimal predictor coincides with the saturating counter

• In the remaining cases, the optimal FSM is very close to the saturating counter


0/N 1/N 2/T 3/T not takentaken

start state

The states are labeled from 0 to 3, and the prediction is taken if the state is >= 2A “Taken” branch (T) increments the value of the countNot-taken (N) viceversa

Optimal Algorithm for 6 benchmarksBenchmark Optimal Counter Optimal FSM

spice2g6 97.2 97.0

doduc 94.3 94.3

gcc 89.1 89.1

espresso 89.1 89.1

li 87.1 86.8

eqntott 87.9 87.2


**

**

*

*branch taken

branch not taken

start state predict taken

predict not taken

• The results are in accordance with other previous studies

Saturating Counter

Saturating Counter

Saturating Counter

Limits of the previous predictors• The predictions are made by considering the history of a single branch

• Is there influence from the other branches?• Experimentally, it is found that a given branch

IS influenced by the preceding branches

• Previously, the dynamic context of the branch was not considered• In particular, the path before arriving at a given branch is relevant• The prediction algorithm may be adapted depending on the path

where the execution comes from

More accurate predictions are obtained by taking into account the history of other branches and adapting the algorithm


“Two-level adaptive” Branch Prediction (1)• Scheme introduced by Yeh and Patt [Yeh91][Yeh92]

• On 9 SPEC-89 benchmarks (floating point (fp): doduc, fpppp, matrix300, spice2g6, tomcatv; integer (int): eqntott, espresso, gcc, li) with a M88110 simulator, the technique reaches on average ABP=97%, when other techniques reach only 94.4%

• The implementation consists of two sets of tables1) To record the history of the outcomes of a given jump (like others)

BHT (Branch History Table) -- each branch has its BHSR2) To record the behavior on a given history pattern each branch

has a PHT (Pattern History Table) that maintains (for each possible pattern) the current state of the FSM


“Two-level adaptive” Branch Prediction (2)• Prediction scheme (phase 1)

- A certain number p of bits of the address of the branch (BIA) is used to select the table PHT related to the branch (or set of branches)

- A number m of bits of the branch address (BIA) is used to select the BHSR within the BHT relative to a branch (or set of branches); such BHSR serves to index the corresponding PHT, to get j bits of the current state of the FSM

• The output logic of the FSM produces the prediction


Branch History TableBHT

(array of BHSR) 2m entries, k bits each

00 … 0000 … 0100 … 10

11 … 1011 … 11

Pattern History Table (PHT) 2p sets of 2k entries, j bits each

Branch Instruction address(BIA), a bits

PHT bits

FSMoutputlogic

Old

Prediction

1 1 1 01

m

p

j

k bits

p ≤ m

If p == m then each BHSR has a corresponding PHT, which is indexed by the BHSR content

If p and/or m are == a, then the information BHSR or PHT exists for each branch

In general, the choice of p and/or m can be made not only from the address, but also, for example depending on the type of opcode instead of 2^m, I have s BSHRs

“Two-level adaptive” Branch Prediction (3)• Predictor Update (phase 2)

• Done when the real branch target is finally known(you know the target and consequently the direction with certainty)

• 1a) The BHSR is updated by a left shift and inserting the result T or N• 1b) The PHT entry is updated with the new current status as determined

by the logic state of the FSM (depending on the outcome T or N)


Branch History TableBHT

(array of BHSRs) 2m entries, k bits each

00 … 0000 … 0100 … 10

11 … 1011 … 11

Pattern History Table (PHT) 2p sets of 2k entries, j bits each

Branch Instruction address(BIA), a bits

Branch direction (T or N)

1 1 1 01

m

p

k bits

p ≤ m

PHT bits

FSMstatelogic

Newstate

jOldstate

(1a)(1b)

“Two-level adaptive” Branch Prediction (4)• This scheme is very general and in a specific implementation choices can be made for 1) m (or s), 2) p and 3) the algorithm

1) Implementation of BHT (the BHSR has always k-bit)• G (Global, m=0) – one BHSR shared by all BIA addresses• P (Per-address or individual, 0<m<=a) – each of the 2^m BHSR has an associated

group of addresses identified by the BIA m least significant bits• S (per-Set, not m, but s) – each of the BHSR has an associated set of branches

2) Implementation of the PHT• g (Global, p=0) – a single PHT is used for all BIA addresses• p (Per-address or individual, p=a) – each PHT is dedicated to a single BIA• s (Shared, 0<p<a) – each PHT is dedicated to a set of BIA addresses

3) Implementation of the algorithm• When the state of the FSM is dynamically updated,

we say that the algorithm is Adaptive and is denoted by A


s = number of branch sets, it is a number that can be determined by the opcode of instructions, a class of branch (identified by the compiler), or from the elements addressed by BIA least significant bits

Possible implementations of 2-level predictorsName DescriptionGAg Global Adaptive branch prediction

using one global pattern history table

GAs Global Adaptive branch predictionusing per-set pattern history tables

GAp Global Adaptive branch predictionusing per-address pattern history tables

PAg Per-address Adaptive branch predictionusing one global pattern history table

PAs Per-address Adaptive branch predictionusing per-set pattern history tables

PAp Per-address Adaptive branch predictionusing per-address pattern history tables

SAg Per-Set Adaptive branch predictionusing one global pattern history table

SAs Per-Set Adaptive branch predictionusing per-set pattern history tables

SAp Per-Set Adaptive branch predictionusing per-address pattern history tables


2-level schemes with Global history


…

BHSR

… … …

GA

g

s

p

…

… … ……

k

p

a

All branches use the same BHSR to hold the history

All branches use the same PHT

Each PHT is shared by a set of branches

Each branch has an associated PHT

2-level schemes with Per-address history


…

BHSRs

… … …

PA

g

s

p

…

… … ……

k

p

a

Each branch has its own history; there are 2^m BHSR

…

m

Each PHT is shared by a set of branches

Each branch has an associated PHT

All branches use the same PHT

The PHT group selected by s has a PHT for each branch

… … ……

a

… … ……

2-level schemes with per-Set history


…

BHSRs

SA

g

s

p

… … ……k

p

… … ……

a

BHSR are shared by a set of branches; there are s of such sets

The PHT selected by s is shared by all the branches that belong to the set selected by s

The PHT group selected by s allows to select a PHT by p. This PHT is then shared by the branches associated with p…

…

s…

s

s

s p

Cost of implementation

Scheme name BSHR elements PHT tables Cost estimation

GAg (k) 1 1 k + 2k * jGAs (k, 2p) 1 2p k + 2p * 2k * jGAp (k) 1 2a k + 2a * 2k * jPAg (k) 2m 1 2m * k + 2k * jPAs (k, 2p) 2m 2p 2m * k + 2p * 2k * j PAp (k) 2m 2a 2m * k + 2a * 2k * jSAg (k) s sx1 s * k + s x 2k * jSAs (k, sx2p) s sx2p s * k + s x 2p * 2k * jSAp (k) s sx2a s * k + s x 2a * 2k * j


• The history length is k bits

Cf. [ Yeh93-isca]

s = number of branch sets, it is a number that can be determined by the opcode of instructions, a class of branch (identified by the compiler), or from the elements addressed by BIA least significant bits

Performance of Global history (GAx) scheme

0.88

0.9

0.92

0.94

0.96

0.98

1

0 2 4 6 8 10

Aver

age

Pred

ictio

n Ac

cura

cy

p=log2(Number of PHTs)

fp, k=12bit_BHSR

fp, k=8bit_BHSR

fp, k=4bit_BHSR

int, k=12bit_BHSR

int, k=8bit_BHSR

int, k=4bit_BHSR


From: [Yeh93] p=0 GAg, p!=0 GAs, p∞ GAp

Benchmarks: SPEC89, FP: eqntott, gcc, espresso, li, INT: doduc, fpppp, matrix300, spice2g6, tomcatv

GAx

Performance of Per-address history (PAx) scheme


0.88

0.9

0.92

0.94

0.96

0.98

1

0 2 4 6 8 10

Aver

age

Pred

ictio

n A

ccur

acy

p=log2(Number of PHTs)

fp, k=12bit_BHSRfp, k=8bit_BHSRfp, k=4bit_BHSRint, k=12bit_BHSRint, k=8bit_BHSRint, k=4bit_BHSR

From [Yeh93]

0.880.9

0.920.940.960.98

1

2 4 6 8 10 12 14 16 18Ave

rage

Pre

dict

ion

Acc

urac

y

k=Branch History Length (bits)

fp, 256_PHTs (p=8)fp, 16_PHTs (p=4)fp, 1_PHT (p=0)int, 256_PHTs (p=8)int, 16_PHTs (p=4)int, 1_PHT (p=0)

p=0 PAg,p!=0 PAs,p∞ PAp

Benchmarks: SPEC89,FP: eqntott, gcc, espresso, li,INT: doduc, fpppp, matrix300,spice2g6, tomcatv

PAx

PAx

Performance of per-Set history (SAx) scheme

0.88

0.9

0.92

0.94

0.96

0.98

1

0 2 4 6 8 10Aver

age

Pred

ictio

n ac

cura

cy

p=log2(Number of PHTs in each set)

fp, k=12bit_BHSR

fp, k=8bit_BHSR

fp, k=4bit_BHSR

int, k=12bit_BHSR

int, k=8bit_BHSR

int, k=4bit_BHSR


From: [Yeh93] p=0 SAg, p!=0 SAs, p∞ SAp

Benchmarks: SPEC89, FP: eqntott, gcc, espresso, li, INT: doduc, fpppp, matrix300, spice2g6, tomcatv

SAx

Predictor comparison (SPECint95, 8KB)


From: Oklobzidja, “The Computer Engineering Handbook - Digital Systemsand Applications, 2nd Ed”, CRC Press 2008

“Hyb”=Hybrid predictor (McFarling, see next slides)

Predictor comparison (SPECint95, 64KB)


From: Oklobzidja, “The Computer Engineering Handbook - Digital Systemsand Applications, 2nd Ed”, CRC Press 2008

Branch Folding• Technique that reduces the "misprediction penalty"

• The idea is to "replace" the jump with the target instruction ("Branch folding“... to make it disappear)

• The technique is used in the PowerPC-601 (1993) analyzing the last 4 instructions in the instruction queue (already fetched)


B F D X M WIP F D X M WIP+1 F D X M WIP+2 F D X M WB F D X M WIP F D O O OIP+1 F O O O OIT F D X M WIT+1 F D X M WJ F SIT F D X M WIT+1 F D X M WIT+2 F D X M W

Correct prediction branch penalty =0

Wrong prediction misprediction branch penalty =2

O=pipeline bubble

IF the jump is unconditional always Taken

S=squashing the J is abandoned and the fetch continues from the

target address (BTA) branch penalty =0

STANDARD SITUATION

STANDARD SITUATION

BRANCH FOLDING APPLIED TO «JUMPS»

Advanced Branch Folding [Kavi97]• Insert the Target INstruction (TIN) besides the Branch Target Address (BTA); the TIN is the instruction that would be fetched in case of misprediction

• In this way the penalty can be reduced to 1• It can become -1 for the jumps !


B F D X M WIP F D O O OIP+1 F STIN (IT) D X M WIT+1 F D X M WJ F STIN (IT) D X M WIT+1 F D X M WIT+2 F D X M W

Advanced Branch Folding for an unconditional jumpS=squashing branch penalty = - 1

Advanced Branch Folding for a conditional branchS=squashing branch penalty = 1

ADVANCED BRANCH FOLDING APPLIED TO «BRANCH»

ADVANCED BRANCH FOLDING APPLIED TO «JUMP»

Correlation Predictors: gselect and gshare• To generate PHT index, Pan and (later) McFarling suggested using either the address of the branch (BIA) or the global history (BHSR)• The simplification is NOT to use a BHT with several elements

• gselect [Pan92]• Some bits of the BIA are combined with bits of global history

• gshare [McFarling93]• The bits of the BIA are "mixed" (hashed) with those of global history• The "mixing" function is usually the XOR operation

Branch address (BIA)

Global history(BHSR)

PHT indexgselect 4/4

PHT indexgshare 8/8

0000 0000 0000 0001 0000 0001 00000001

00000000 00000000 00000000 00000000

11111111 00000000 11110000 11111111

11111111 10000000 11110000 01111111

=(BIA7-4,BHSR3-0) =BIA7-0 XOR BHSR7-0


gselect [Pan92]• m bits from the branch address (BIA) are justified to kbits of the BHSR

• Very simple scheme: 1 BHSR and 1 small PHT…

BHSR

BIA

k

prediction

PHT

m

k+m

2k+m x jentries

BIAm-1,0

BHSRk-1,0(BHSR k-1,0,BIAm-1,0)


gshare [McFarling93]• m bits of the branch address BIA are XOR-ed with the BHSR

• Very simple scheme: 1 BHSR and 1 small PHT• Used in the Alpha 21264

…

BHSR

BIA

k

prediction

PHT

m

max{k, m}

2max{k, m} x jentries

Usually k == m

BIAm-1,0

BHSRk-1,0BHSR k-1,0 BIAm-1,0


Performance of gshare, gselect, GAg

84

86

88

90

92

94

96

98

32 64 256 1K 4K 16K 64K

Predictor Size(bytes)

Pred

icto

r Acc

urac

y (%

)

gsharegselectglobal

From: [McFarling93]

“global” refers to the GAg predictor of Yeh and Patt

The benchmarks are the same 9 SPEC-89 used in the work of [Yeh92] and [Yeh93-ics]Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7638

Competitive Predictors (“Tournament”)• Some predictors work well with certain branch types

- The first proposal to use two competitive predictors is by McFarling [McFarling93] (bimodal + gshare)

• Dynamically select between a prediction between predictors- Use their history to select a predictor

• Example: Alpha 21264• Bpred1=PaG, Bpred2=Gag. Total predictor size: 29k bits• ABP = 97.4%(SPEC89 average)

99.9% (SPECfp95 average), 99% (SPECint95 average)

BHT Bpred 1

Bpred 2

MUX

prediction

Predictorselection

logic

BIA

Path History


References• [Smith95] J. E. Smith, S. Weiss, “Power PowerPC 601 and Alpha 21064: A Tale of Two RISCs”,

IEEE Computer , June 1995, pp.46-48.• [Smith81] J. E. Smith, “A study of branch prediction strategies” In Proc. of the 8th Annual

Symposium on Computer Architecture, May 1981, pp. 135-148. • [Lee84] J.K.F. Lee, A. J. Smith, “Branch Prediction Strategies and Branch Target Buffer Design”,

IEEE Computer, Jan. 1984, pp. 6-22.• [Nair95] R. Nair, “Optimal 2-bit branch predictors”, IEEE Transactions on Computers, May 1995,

pp. 698-702.• [McFarling93] McFarling, S . Combining branch predictors. Technical Report TN-36, Digital

Western Research Laboratory, June 1993.• [Pan92] S. T. Pan, K. So, and J. T. Rahmeh. Improving the accuracy of dynamic branch prediction

using branch correlation. In Proceedings of ASPLOS V, pages 76–84, Boston, MA, October 1992. • [Yeh91] T. Yeh, Y. N. Patt, “Two-Level Adaptive Training Branch Prediction”, in Proc. the

International Symposium on Microarchitecture, Dec. 1991, pp.51-61. • [Yeh92] T. Yeh, Y. N. Patt, “Alternative implementations of two-level adaptive branch prediction”,

in Proc. the 19th Annual ISCA, May 1992, pp.124-134. • [Yeh93] T. Yeh, Y. N. Patt, “A comparison of dynamic branch predictors that use two levels of

branch history”, in Proc. the 20th Annual ISCA, May 1993, pp.257-266.• [Kavi97] K.M. Kavi, “Branch folding for conditional branches”, IEEE CS Technical Committee on

Computer. Architecture (TCCA) Newsletter, Dec. 1997, pp 4-7.• [Oklobzidja08] V.G. Oklobzidja, “The Computer Engineering Handbook - Digital Systems

and Applications, 2nd Ed”, CRC Press, 2008.


Multiple predictions• Branch instructions can be one close to the other

• Higher performance can be achieved by fetching multiple instructions - In such case we need to predict multiple targets in the same cycle

• In the next example we try to have two predictions• We assume that the we have a 2-level GAg predictor

- (Gag has di advantage of not needing a BIA)

Global BHSR(k bits)

PHT

kk-1

MUX

Primary prediction

Secondary prediction

With k bits we select an element of the PHT (primary prediction). With the least significant k-1 bits, we select the next two possible elements of the PHT and we use the primary prediction to select among the two outcomes (secondary prediction)


FSM

FSM

FSM

k

k0

1

Loop predictor• Goal: predicting the outcome of the n-th branch in a n-interation loop

• A history-based predictor would require a BHSR with k >= n• The loop predictor consists in a table (similar to the BHT)

where each element is represented below• The limit field stores the number of iterations of a previous loop execution (n)• The prediction field stores the corresponding outcomes (T or N)• The count field tracks the number of iterations of a new loop instance• It works well only for loops (not for other situations), therefore other generic

predictors are needed anyway

Count Limit Prediction (T/N)

+1

0=

prediction

Current loop count (starts from zero and goes up to ‘limit’)

It produces sequences like:

Detects when we are at the n-th iteration (=‘limit’)

TTTTTTN TTTTTTN ...orNNNNNNT NNNNNNT ...

n n

Incremented at eachiteration

Zeroed once ‘limit’ is reached

Used in Pentium-M and Pentium-4 [Gochman03]Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7642

Trace Cache [Rotenber96]• It captures sequences of basic-blocks (dynamic traces)

• It replaces the classical instruction cache• It is indexed by the BTA (Branch Target Address)• The elements are basic-blocks that are assembled dynamically,

while the processor executes the program instructions• On a trace-cache hit

• A sequence of several basic blocks is fetched (i.e. containing several “taken branches”)• Design semplifications

• The processor does not need to fetch several branch targets• No need to predict the basic-block sequence!• No need for a multi-ported access

B1

B3

B4

B1

B2 B3

B4

When the code is executed the first time: the trace cache stores the basic blocks AND their sequence

The next time that the beginning of thissequence hits:there is no need topredict it it is justloaded for execution

B1 B3 B4 goes intothe trace cache

Usato nel Pentium-4Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7643

Further References[Alpha99] Compaq Computer Corporation, Alpha 21264 Microprocessor Hardware Reference Manual, 1999.[Boggs04] Boggs D., et. al. The Microarchitecture of the Intel® Pentium® 4 Processor on 90nm Technology.

Intel ® Technology Journal, Vol 08, Issue 01, February 18, 2004.[Gochman03] Gochman S., et. al. The Intel® Pentium® M Processor: Microarchitecture and Performance. Intel ®

Technology Journal, Vol 07, Issue 02, May 21, 2003.[Hennessy02] Hennessy, J. L. and Patterson, D. A. 2002 Computer Architecture: a Quantitative Approach. 3rd

Edition. Morgan Kaufmann Publishers Inc. 2002.[Kaeli91] D. R. Kaeli and P. G. Emma. Branch history table prediction of moving target branches due to

subroutine returns. In Proc. ISCA-18, pages 34–41, May 1991.[Shen02] Shen J. P. , Lipasti M. Modern Processor Design, McGraw Hill Higher Education; Beta Ed edition

(November 1, 2002).[Yeh93-ics] Yeh, T., Marr, D. T., and Patt, Y. N. 1993. Increasing the instruction fetch rate via multiple

branch prediction and a branch address cache. In Proceedings of the 7th International Conference on Supercomputing (Tokyo, Japan, July 19 - 23, 1993).

[Uht95] Uht, A. K., Sindagi, V., Hall, K. Disjoint eager execution: an optimal form of speculative execution. In Proceedings of the 28th Annual international Symposium on Microarchitecture (Dec. 1995).

[Rotenberg96] E. Rotenberg, S. Bennett, and J. E. Smith. Trace Cache: a Low Latency Approach to High Bandwidth Instruction Fetching. 29th International Symposium on Microarchitecture, Dec. 1996


EXTRA SLIDES


Branch Misprediction Recovery (1)• La predizione dinamica dei salti consiste di due parti

- Parte iniziale che effettua la speculazione negli stadi iniziali della pipeline- Parte finale che effettua la validazione negli stadi successivi della pipeline

• Speculazione sui branch• Mentre si prelevano istruzioni dal ramo predetto, si puo’ incontrare un

altro branch- Es. il predittore ci consiglia “Taken” per il branch b1- Il processore preleva b2 prima che b1 sia risolto

• Soluzioni- Attendere che b1 sia risolta prima di predirre b2 ? spreco di risorse…- Predirre b2 anche se b1 non e’ risolto ? si complica la gestione del

“recovery” nel caso di misprediction

b1

b2

b3

N

N

N N

N

N N

T

T

TTTT

T


Branch Misprediction Recovery (2)• Come si recupera (recovery) in questo caso

• Vorremmo recuperare da una situazione di misprediction multipli• Situazione piu’ complessa: primo branch mispredicted, secondo giusto• Esempio: stiamo speculando su 3 branch - b1, b2, b3

- Le predizioni sono evidenziate con la linea tratteggiata- Le istruzioni di ogni ramo predetto sono residenti nel processore

• Idea: a ogni istruzione su un dato ramo speculativo si assegna un TAG- Ogni ramo speculativo ha il proprio tag (Tag1, Tag2, Tag3)

N

N

N N

N

N N

T

T

TTTT

T

(Tag 1)

(Tag 2)

(Tag 3)Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7647

Branch Misprediction Recovery (3)• Validazione dei branch

• Quando il branch e’ risolto (direzione e target sono noti)- CASO DI PREDIZIONE CORRETTA

-Il tag viene rimosso e le istruzioni su quel ramo divengono non-speculative- CASO DI PREDIZIONE ERRATA (MISPREDICTION)

-Il ramo errato viene bloccato ed eliminato dalla pipeline-Devono essere rimossi anche tutti i rami speculativi successivi-Il ramo corretto viene eseguito dall’inizio inserendolo in pipeline

• Esempio- Il secondo branch non e’ stato azzeccato- Tutte le istruzioni con Tag2 e Tag3 devono essere rimosse

NT

NT

NT NT

NT

NT NT

T

T

TTTT

T(Tag 2)

(Tag 3)

Ricominciare da qui!


Esempio - PowerPC 604 (3)• Hit nel BTAC

• Indica la presenza di un branch nella coda di fetch• L’indirizzo target prelevato dal BTAC viene usato al ciclo successivo

• Al secondo ciclo viene consultato il BHT• La predizione deve essere “taken”, se avevo fatto hit in BTAC al ciclo prima

• Se le due predizioni NON sono in accordo?• Si butta la predizione del BTAC (significa che BHT ha predetto “not-taken”)• Il fetch continua dal ramo in cascata• La predizione del BHT prevale su quella del BTAC

• Dopo aver risolto il branch si devono aggiornare sia BTAC che BHT

• Perche’ servono entrambi ?• Il BTAC e’ piu’ veloce: se la predizione e’ giusta non attendo alcun ciclo• Il BHT e’ piu’ accurato: anche se arriva un ciclo dopo, la predizione puo’

sempre essere utile


Esempio - PowerPC 604 (4)• Il PowerPC 604 e’ superscalare (usa lo scheduling dinamico)

• Nelle reservation-station ci possono essere fino a 4 istruzioni di salto• E’ necessario usare dei tag a 2 bit per gestire la speculazione• Segue lo schema introdotto poco sopra• In particolare le risorse occupate dalle istruzioni speculative

debbono essere liberate in caso di misprediction(es. Reorder Buffer, una struttura tipica dei processori superscalari)


Predittore dell’indirizzo di ritorno• Alcuni salti variano l’indirizzo target nello stesso programma

• Tali salti sono tipicamente “indirect jumps”• In particolare, fra questi ci sono i salti di ritorno da procedura• Nel caso di SPEC-89 i salti da ritorno da procedura sono l’85%

• Predizione delle istruzioni di ritorno da procedura• L’esito e’ facile da predirre: always taken!• Il target non e’ facile da predirre: la stessa procedura puo’ essere

invocata da diversi punti di un programma• Il BTB per predirre il target puo’ condurre a misprediction• E’ stato proposto un piccolo stack per mantenere gli indirizzidi ritorno [Kaeli91]• Al momento della chiamata si fa push dell’indirizzo di ritorno in tale stack• Al momento del ritorno basta fare una pop da tale stack• Funziona come una cache dei piu’ recenti indirizzi di ritorno• Se tale stack e’ sufficientemente grande predice tutti i ritorni


Predizioni multiple [Yeh93-ics]• Idea: predirre i salti successivi anche se i precedenti non sono risolti

• Effettuare 1 predizione per ciclo

• Mentre si cerca di effettuare le predizioni successive,puo’ essere utile aggiornare speculativamente PHT e BHT

• La tecnica funziona bene se l’accuratezza della prima predizione e’ alta• 1a predizione – 96% di accuratezza =>

- 2a predizione – 92.16% accuratezza =>-4a predizione – 84.93% accuratezza


Eager Execution [Uht95]

•In un gruppo di 4 istruzioni e’ possibile che tutte e 4 siano branch• Si deve far ricorso ad un BTB a 4 porte

(similmente al predittore multiplo di Yeh)

•Eager Execution• Si eseguono sia il ramo taken che quello not-taken senza predizioni

- Il fetch avviene da entrambi i rami taken e not-taken- Nel lavoro di Uht, il fetch viene limitato a 6 ramificazioni- Vengono via via buttati i rami sbagliati nel momento in cui i branch sono risolti- Naturalmente viene buttato parecchio lavoro… ma e’ veloce!

• Disjoint Eager Execution- In questo caso si considera la branch prediction mentre si fa il fetch- Si prelevano solo istruzioni dai rami predetti, fino a 6 ramificazioni- Se il ramo e’ sbagliato, si fa ripartire la pipeline


ALPHA 21264

Branch Prediction case study


Processore Alpha 21264 (Feb.1998)• 500-600 MHz, 15x106 transistors, 2.2V, 0.35μ CMOS• Parola a 64-bit• Pipeline a 7 stadi• Esecuzione Superscalare a 4 vie

• Puo’ prelevare ed eseguire fino a 4 istruzioni per ciclo• L’esecuzione avviene “fuori ordine” (out-of-order)

• Secondo Hennessy e Patterson ha il predittore piu’ sofisticato implementato fino al 2003• Questo e’ basato sul predittore introdotto da McFarling nel 93.

Nota: Il processore Alpha, prodotto da Digital Equipment Corp. (DEC).DEC fu acquistata da Compaq nel 1998. Compaq fu acquistata da HP nel 2002.


Predittore dell’Alpha 21264 [Alpha99]• Tournament Predictor

• Sceglie dinamicamente fra due predittori- Il “predittore locale” (sinistra) – equivalente ad uno schema PAg- Il “predittore globale” (destra) – equivalente ad uno schema GAg- La selezione del predittore avviene attraverso una storia degli esiti

local/globale data in pasto al solito contatore a saturazione a 2 bit in pratica un altro predittore “di scelta”

Local historytable

(1024 x 10)

Programcounter

Local prediction(1024 x 3)

Global prediction(4096 x 2)

Choice prediction(4096 x 2)

Path history

MUX

branchprediction

Local predictor


Alpha 21264: Predittore Locale (PAg,m=10,k=10,j=3)

• Local History Table (LHT)• Equivalente alla BHT dello schema PAg (k=10, 1024 entry BHT (m=10))• Mantiene gli ultimi 10 esiti per un massimo di 1024 branch• E’ indicizzata dall’indirizzo dell’istruzione di salto (BIA)

• Local Prediction Table (LPT)• Equivalente alla PHT dello schema PAg (j=3, 1024 entries(k=10))• E’ indicizzata dall’elemento di storia selzionato nella LHT• La FSM e’ un contatore a saturazione a 3 bit

• La LHT e la LPT sono aggiornate dopo che il branch vienerisolto

• Funziona bene per sequenze alternativamente T e N


Alpha 21264: Predittore Globale (GAg,k=12,j=2)

• Tabella da 4096 elementi• E’ equivalente alla PHT dello schema GAg (j=2, 4096-entry PHT (k=12))• E’ indicizzata con un registro di storia globale a 12 bit

• Predittore• La FSM e’ un contatore a saturazione a 2-bit

• Funziona bene per branch che sono influenzati da branch precedenti• Esempio:

if (x == 10){

…}if (x % 2 == 0)…

Se questo e’ taken…

…e x non e’ cambiata qui…

…anche questo branch sara’ taken

Un predittore Globale tipicamente apprende e predice correttamentesituazioni di questo tipo


Alpha 21264 - Dimensione totale del predittore• Predittore Globale (k=10,j=2)

• k+2k*j = 10+4096 x 2 =~ 8K bits• Predittore Locale

• Local History Table – 2m*k= 1024 x 10 = 10K bits• Local Prediction Table – 2k*j = 1024 x 3 = 3K bits

• Predittore di Scelta• k+2k*j = 10 + 4096 x 2 =~ 8K bits

• Totale• 29K bits• ~180,000 transistors


Uso del predittore Globale rispetto a Locale

Percentuale di branch predetti dal local predictor

Numero di predizioni del predittore locale, normalizzato rispetto al numero totale di predizioni (locali+globali)

Fraction of predictions by local predictor

98

100

94

90

55

76

72

63

37

69

0 10 20 30 40 50 60 70 80 90 100

nasa7

matrix300

tomcatv

doduc

spice

fpppp

gcc

espresso

eqntott

li

Benchmarks SPEC89, Grafico da [Hennessy02]Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7660

Prestazioni dei predittori – ABP

94%

96%

98%

98%

97%

100%

70%

82%

77%

82%

84%

99%

88%

86%

88%

86%

95%

99%

0% 20% 40% 60% 80% 100%

gcc

espresso

li

fpppp

doduc

tomcatv

Branch prediction accuracy

Profile-based

2-bit counter

Tournament


Prestazioni dei predittori: mispred.rate vs. dim.

0%

1%

2%

3%

4%

5%

6%

7%

8%

9%

10%

0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128

Total predictor size (Kbits)

Con

ditio

nal b

ranc

h m

ispr

edic

tion

rate

Local

Correlating

Tournament

Benchmark Suite: SPEC89


P6 i.e. PENTIUM PRO,II,III(Nov.1995)Branch Prediction case study


Branch Prediction nel P6 [Shen02]• Risoluzione del branch (taken/not_taken)

• E’ effettuata nella JEU (Jump Execution Unit)• Il BTB predice il target non appena la IFU (Instruction Fetch Unit) lo

preleva• Tutti gli indirizzi sono verificati dal BAC (Branch Address Calculator) o

dalla JEU• Branch Target Buffer (BTB)

• Opera nei primi stadi della pipeline• Parte dall’indirizzo dell’IP (Instruction Pointer) e produce una

predizione dell’esito e del target• L’indirizzo target predetto viene inviato alla IFU per il fetch

• Aggiornamento del BTB• Il BTB viene aggiornato non appena la JEU risolve il branch• Questo puo’ essere troppo tardi, se il branch successivo arriva

nelle istruzioni immediatamente successive-Il BTB viene pertanto speculativamente aggiornato al momento della predizione


P6 - Algoritmo di Branch Prediction• Basato sullo schema 2-level adaptive [Yeh92]

• Primo livello – storia degli esiti dei branch• Secondo livello – comportamento del branch per un dato pattern di storia• Differenze rispetto a [Yeh92]

- C’e’ una copia speculativa del BHT che consente di effettuare le predizioni prima che si abbia la risoluzione (e l’aggiornamento)

• Per ogni branch…• Il BTB mantiene k bit di storia “reale” (detta BHR ==BHT)

- Taken/Not-taken per gli ultimi k salti• Il BHT indicizza una tabella di 2k elementi di stato (Pattern Table – PT ==PHT)

- La FSM relativa e’ il solito contatore a saturazione• La BTB usa una pattern table “semilocale” per set

• Ogni elemento ha 4 bit di storia• Tutti gli elmenti di un set usano la stessa pattern table

• Aggiornamento speculativo del BHR• Una copia speculativa del BHR viene aggiornata con l’attuale predizione

- Tale copia viene utilizzata nel caso arrivi un branch prima che il precedente sia risolto• La BHR reale viene aggiornata con l’esito effettivo dopo la risoluzione del branch


Algoritmo di Branch Prediction - BTB• Se non c’e’ hit nel BTB

• Si utilizza una predizione di tipo statico:BTFN=Bachward Taken, Forward Not-taken

• Return stack• La BTB mantiene inoltre un “return stack” [Kaeli91] di 16 elementi• Questo aiuta a predirre l’indirizzo di ritorno delle funzioni


PENTIUM-4 (Nov.2000)PENTIUM-M (Mar.2003)Branch Prediction case study


Pentium4 - Branch prediction [Boggs04]• Il predittore e’ 8 volte piu’ grande di quello del P6 (4KB)

• Secondo Intel, il piu’ sofisticato schema di predizione al 2007• L’algoritmo preciso non e’ stato divulgato

• Il predittore si combina con la Trace Cache• La Trace Cache sostituisce la Cache Istruzioni

• Componenti• Return Address Stack – 16 elementi (v. [Kaeli91])• Indirect Branch Predictor (v. slide successiva)• Loop detector (v. slide precedenti)

• Si avvale di due maniere di predirre il branch (simile al P6)• In caso di BTB miss, si usa una predizione statica BTFN• Nella versione a 90nm la predizione statica e’ migliorata (v. slide succ.)

• Miglioramenti introdotti nel Pentium-M [Gochman03]• Si usa una combinazione di:

- Meccanismi di predizione gia’ presenti nel Pentium-4- Indirect Branch Predictor- Loop detector


Indirect Branch Predictor• Risolve i branch indiretti ovvero dipendenti dai dati

• Il target del branch e’ posto cioe’ in un registro• Sono molto frequenti nei programmi object-oriented (Java, C++)

• Ci sono due casi• Branch indiretti con 1 target (piu’ facili da predirre)• Branch indiretti con piu’ target (es. statement “case”)

in cui il target dipende dai dati del programma

• Il predittore differenzia fra questi due casi• Data-independent

- Viene usato solo il PC per selezionare il branch target- Si memorizza il target in una tabella indicizzata dal PC

• Data-dependent- Si usa la storia globale degli esiti per selezionare il branch target- Si memorizza il target in una tabella indicizzata dalla storia globale


Suggerimenti software per la branch prediction• Il Pentium4 consente al software di passare suggerimenti (hints) al processore• L’hardware di Branch Prediction e di formazione della traccia

consulta questa informazione per migliorare le prestazioni• Cambiamenti nell’ISA

• Le istruzioni di branch devono essere modificate per supportare i suggerimenti

• Si aggiungono dei prefissi ai salti condizionali

• La tecnica e’ usata solo nel momento in cui si crea la traccia• Dopo la creazione della traccia gli hints del software non vengono piu’

considerati


Pentium4 – Predizione statica (versioni 90nm)• Per i backward branch:

• Se l’offset del salto e’ maggiore di un certo valore (empiricamente trovato)allora il branch ha scarsa probabilita’ di trovarsi in fondo ad un loop

• Viceversa si usa la predizione (statica) Taken

• Per i forward branch:• Si usa ancora Not-taken (come in BTFN)


Misprediction rate nel Pentium4• Confronto tra due generazioni di architetture Intel (130nm vs 90nm)

SPECint_base2000 130nm 90nm164.gzip 1.03 1.01175.vpr 1.32 1.21176.gcc 0.85 0.70181.mcf 1.35 1.22186.crafty 0.72 0.69197.parser 1.06 0.87252.eon 0.44 0.39253.perlbmk 0.62 0.28254.gap 0.33 0.24255.vortex 0.08 0.09256.bzip2 1.19 1.12300.twolf 1.32 1.23

Nella tabella a lato si riporta il numerodi misprediction su 100 istruzioni di branch, nal caso dell’architetturaPentium-4 a 130 nm e a 90 nm.

Dati forniti da Intel [Boggs04]


References[Alpha99] Compaq Computer Corporation, Alpha 21264 Microprocessor Hardware Reference Manual, 1999.[Boggs04] Boggs D., et. al. The Microarchitecture of the Intel® Pentium® 4 Processor on 90nm Technology.

Intel ® Technology Journal, Vol 08, Issue 01, February 18, 2004.[Gochman03] Gochman S., et. al. The Intel® Pentium® M Processor: Microarchitecture and Performance. Intel ®

Technology Journal, Vol 07, Issue 02, May 21, 2003.[Hennessy02] Hennessy, J. L. and Patterson, D. A. 2002 Computer Architecture: a Quantitative Approach. 3rd

Edition. Morgan Kaufmann Publishers Inc. 2002.[Kaeli91] D. R. Kaeli and P. G. Emma. Branch history table prediction of moving target branches due to

subroutine returns. In Proc. ISCA-18, pages 34–41, May 1991.[Shen02] Shen J. P. , Lipasti M. Modern Processor Design, McGraw Hill Higher Education; Beta Ed edition

(November 1, 2002).[Yeh93-ics] Yeh, T., Marr, D. T., and Patt, Y. N. 1993. Increasing the instruction fetch rate via multiple

branch prediction and a branch address cache. In Proceedings of the 7th International Conference on Supercomputing (Tokyo, Japan, July 19 - 23, 1993).

[Uht95] Uht, A. K., Sindagi, V., Hall, K. Disjoint eager execution: an optimal form of speculative execution. In Proceedings of the 28th Annual international Symposium on Microarchitecture (Dec. 1995).

[Rotenberg96] E. Rotenberg, S. Bennett, and J. E. Smith. Trace Cache: a Low Latency Approach to High Bandwidth Instruction Fetching. 29th International Symposium on Microarchitecture, Dec. 1996


MORE ONBRANCH PREDICTION


CORE, CORE2, CORE-I7


Esempio - PowerPC 604 (2)

PC

Branch History Table (BHT)

Branch Target Address Cache

(BTAC)

I-cache

decode

dispatch

branch

+4

BHT prediction

BTAC prediction

BHT update

BTAC update

decode buffer

dispatch buffer

Reservation stationsBRN SFX SFX CFX FPL LS

Re-order buffer

execute

PC

Branch Prediction

PCPC

commit


Introduction to Linux

Roberto GiorgiUniversita’ degli Studi di Siena

2

Objectives

• High Performance Computer mostly rely on Linux• Awereness of how to interact with Linux

• Basic commands to interact with the Shell and how to productively use Linux

• File-System structure• Protection, Sharing• Advanced commands to control the machine

3

UNIX• At the beginning Linux was called UNIX…

• UNIX was born at BELL/ AT&T Labs in 1969by the effort of researcher that neededmodern toolsto help them in their research projects

• UNIX was born as on Operating System featuring:• multiuser:

can differentiate among several users• multitasking:

can safely share machine resources(such as CPU, memory, storage, …)

4

LINUX

• 1991: Linus Torvalds, as a hobby, creates an OS for educational use, with the same capabilities as classical UNIX and he makes it available for free on the web

• 1998: Linux reaches a large popularity and starts stealing market shares to Microsoft

• 2014: Linux is the de-facto standard for high performance computers (but also for embedded systems)• It’s largely supported by big companies like IBM, Oracle,

Intel and almost all other player in the computer world

• Several “distributions” (distros) are available that contains not only the OS but also many other applications including the source code: the user has freedom to modify the software

The Shell

ControlCommands

Internet

WEB/DBServer

Shell

FileSystem

InformationProtection

CombinedCommands

6

User Authentication (Login)

• Every user has the illusion of having the whole machineunder his/her complete control (virtual machine)

Login: jennyPassword:

Welcome to Unix!$

Name of the user (username)For security reasons the passwordis never shown

This message generally identifies the version of the operating system and may show some announcement from the administrator of the machine

Prompt of the command interpreter (or shell)

7

Trivial security measures

• When you type a wrong name or password, the login program show the following message, but only after that the user has written *both* username and password

Login incorrect• This message advise that you did not type correctly:

• either username• or password• or that both are invalid

• This message does NOT spcify if the error is in the username or the password• This shoudl discourage unauthorized user to try names or

password just to discover them and gain access

8

After the login

• After the login,you are talking to the command interpreteralso know as shell

• The Shell plays an important rolein all the interactions with the Linux Operating System

• When you type a word(in response to the prompt of the shell),the shell interprets such wordand starts consequent actions

• Such actions can be• The execution of a program• The production of an error message which tells you that

such word has not been typed correctly(or it doesn’t make sense for the OS)

9

Logout

• To log out, press CTRL-Din response to the shell prompt• If CTRL-D does not work,

try the exit or the logout command• The logout command

is typically used by certain shell types (e.g., C-Shell),whilst other shells (Bourn Shell, Korn Shell, Bourn Again Shell) use CTRL-D or exit

• If the terminal is inside a graphical user interface• The logout procedure, besides disconnecting the shell

associated to the terminal, closes the window

10

Correcting errors

• Since the Shell and many other utilities and toolsdo not interpret what the user has writtenuntile the key RETURN is pressed,it is possible to correct errors avoiding to send a wrong command to be blocked later

• There are three ways to correct typing errors:• Delete un key at a time by pressing a specific key:

BACKSPACE or CTRL-H• Delete the whole line by pressing the «kill key»

CTRL-U (this clean up the shell prompt)• Interrupt the shell by pressing the «interrupt key»

CTRL-C (this generates a new prompt)

• After pressing RETURN … it’s too late! You will then wait the completion of processing by the shell

11

Interrupt the execution of a program

• To interrupt a running program you can pressCTRL-C• This has to be done as extreme solution since it will leave

the program in undeterminate state (data is not saved)

• When CTRL-C is pressed, the OS sends a signal to ALL executing processes (including the Shell)

• The executing program, however, can decide what to do when the interrupt signal is received• Some program can terminate immidiately• Some program can decide to ignore the signal

• When the Shell receives the interrupt signal, it creates a new prompt and waits again for new user input

12

The Shell

• The Shell is the native interface of Linux/Unix• Simple• Powerful

• The Shell can appear «full screen» or as a window inside a graphical interface (terminal)• The shell interacts with the user through the

«command prompt»

• The users can use the shell as a programmin language that can execute commands for the OS• Such commands can be typed on the terminal

or read from a file (in such case the file is called «script»)

• The original shell of AT&T UNIX is the Bourne Shell (sh)• Today the most popular is the Bourne Again Shell (bash)

13

Shell Commands

• The shell provides you several commands for• list, copy, move, remove files: ls, cp, mv, rm• create, remove directories: mkdir, rmdir• change working directory: cd

• Some other commands are not shell commands,but can be accessed as if they were shell commands• Utilities - e.g., to search files, search words in a file:find, grep

• Other general programs (e.g., lynx,gpg,…)• Application programs: e.g., <myprogram>

• To obtain help on using commands you can typeman <command>

14

Predefined commands and Scripts

• Shell command typically include predefined programs like cd or logout

• A command can also be a «personalized command», also known as scripts• A script is a file that contains a sequence of commands• They enable user to execute compless tasks as if they

were a single command (e.g., installing software, …)

• Scripts are typically developed by system administrators or advanced users• backup procedures, exchanging files• controlling simulations, mining result data• …

15

Command Syntax

• This is the typical syntax of a command:command [arg1] [arg2] … [argn] <RETURN>

• Uno or more space or TAB must be presentbetween the elements of a command line

• Arguments arg1, arg2, ... can be:• The name of a file (“filename”)• A string of text or numbers• Other objects on which the command acts• The square brackets […] mean that such argument is optional• The «angular» brackets <…> mean that the argument is mandatory

• Options are special arguments that start with «-’• Options modify the behaviour of a command, e.g.:• ls -l example <RETURN>

16

Executing a command

1) The shell searches the directory list (specified by $PATH) for a file with name as the commandand which is executable

2) If the shell finds the executable file,then it creates a new process

• A Linux process is the execution itself of the program

3) The shell passes to such process the argumentsincluding the options and the name of the command

4) While the command is executing, the shall waits, sleeping, that the generated process ends

• Later will see some advanced shell characteristics

17

Editors

• There exist several NON-GRAPHICAL editorswhich are useful to simply interact with the OS:• vi , VIsual editor (most popular)• joe, editor (a Wordstar emulator)• emacs editor: mostly liked by developers• vim , Vi IMproved (an improved version of vi)

• Several GUI (Graphical User Interface) editors• gedit, xemacs, kedit, kate, openoffice, ...• (not object of this introduction, ...)

• The simpler to use is joe: “joe <filename>”• You write your text and to save press CTRL-K CTRL-X

File system

Shell

FileSystem

Information Protection

ControlCommands

Internet

WEB/DBServer

CombinedCommands

19

Hierarchical Structure of the file systemThe structure of the Linux file system is hierarchicalThis allows the users to organize files in order to

easily find what they are looking for

DIRECTORIES

FILES

This structure is calleda “tree”

The top node is called“root” of the tree

ROOT

20

Home Directory• Each user «starts» with a

directory called…home directory

• From a directory, the users can create all thesub-directoriesthat they wish

• In this way, users canexpand the directory structure to match their own needs

HOME DIRECTORY

FILES

SUB-DIRECTORY

21

Filename (1)

• Each file as a name (filename)• Linux can use up to 255 characters for a filename

• For a better portability and usability 14 characters are recommended

• To avoid confusion (although is possible to use any characther), it is also recommended to use preferrebly• Capital letters “A-Z”, small letters “a-z”, numbers “0-9”,

underscore “ _ ”, dots “.” and commas “,”

• The only exception to this naming scheme is theroot directory, always indicated by the name “/”

• In the same directory we can’t have to files with the same filename

22

Filename (2)

• If we share files among users of generic Unix systems (e.g., non Linux) it’s recommended to avoid too long names• They will be easier to type• Some Unix versions limit the filenames to 14 characters

• The disadvantage of short filenames is the typically they are less descriptive than longer ones

23

Filename Extension

• A filename extensionis the part of the filename that follow the last dot “.”

• The extensions helpdescribe the content of the file

• The extensions can be used freelyin order to make file contents more easy to understand

• Examples:exemple.txt a file contening textexemple.html a file contening a WEB pageexemple.c a file in C language

24

Invisible Filenames

• A filename that starts with dot «.» is called invisiblesince the command ls does not show it, normally

• The command ls -a shows whole files, also the invisible ones• The configuration files of programs are usually invisible:

.login, .cshrc, .logout, .profileConfiguration files provide the OS with user-specific information

or user-specific application preferences• Twp special entries are also invisible files («.» and «..»)

. Is an alias of the current directory

.. Is an alias of the parent directory

Note: the file “…” (3 dots) is a legal filename, but is by no means a reserved name like . or ..If you see a “…” filename, it may mean that some hacker is trying to cheat you !

25

Absolute Pathnames

• You can type the file pathnameby tracing a path from root directory,through all intermediate directories,reaching our working file

• All filenames of the path are listed in a sequenceand separated by the «/» character

• The full pathname start with root filename (again «/»)

usr/

example/usr/example

This path is also called absolute since it locates a file uniquely in the file system

26

Home directory and Working directory

• When you connect to a Linux filesystem,you will be always starting in a working directory

• This initial working directory is called home directory• The alias name of the home directory is “~”

• Once you change directory with the cd command you will be in a possibly different working directory

• The command pwd always prints the pathname of the current working directory

27

Relative Pathnames

• A relative pathname: traces a path from the working directory to a given file

• Every pathname with do not start with «/» is a relative pathname

usr

/

example A

f1

Working Directory

Pathname of “f1”

Pathname of “example”

PathnameType

- /usr/A/f1 /usr/example absolute

/usr A/f1 example relative

/usr/A F1 ../example relative

28

Important Standard Directories (1)

• / (root) is presente in any Linux system• /home each user has a home directory which is

typically a subdirectory of the /home directory• /bin and /usr/bin contain Linux standard programs• /etc contains configuration and administration files

for managing the system• /etc/passwd contains the list of authorized users

• /usr contains programs that are specific to a system installation• /usr/lib contains standard libraries all programs use «common» functions, that are

external to the program itself (e.g. the «printf» function to print a message): such functions are the same for any user and are available in such system libreries

29

Important Standard Directories (2)

• /var files with a size that varies during the system life, e.g.: • /var/log system log: file which contains the log of all events

that happen in the system (connection, commands issued by the users, statistics on resource usage, …)/var/spool spooled files : contain files that temporarily are in the printer queue, email queue, …

• /var/lock synchronization files : generated during the execution of a program and deleted once it ends

• /dev All files that represent peripheral devices (device), like terminals, printers, disks, …

• /tmp Many programs use this directory for temporary files• /export A common directory where are files that can be shared

across systems

30

Patterns

• If we need to type several similar filenameswe can make use of wildcards to specify a pattern, e.g.:• All files that start with “report”:report*

• A pattern is specifed by using special characters(called wildcards) such as “*”, “?”, “[]”

• When a wildcard appears in the command line,the shell expands that wildcardin a list of matching filenames of the working directory and passes such list tothe program to be launched by the shell

31

The special character ?• The question mark specifies a single character in a

filename$ lpr memo?

• The shell expands memo? and generates a list of matching filenames of the working directory that have their name composed by «memo» followed by any single character$ ls memo?memo5 memo9 memoa memos

Note: this doesn’t match the filename «memo»

32

The special character *

• The star characters may correspond to any number of characters (as a limit case to zero characters)$ ls memo*

• The star cannot correspond to an initial «.», which specifies an invisible file• To indicate a filename which starts with a «.»

you must explicitily indicated the «.» at the beginningfollowed by the desired pattern

33

The special character[]• A couple of square brackets (with some characters in

between) specify that the place, where such brackets are positioned, will be substituted by any one of the characters listed in between the square brackets

• E.g.:memo[17a] corresponds to

memo1 memo7 memoa

Information Protection

Shell


FileSystem

ControlCommands

Internet

WEB/DBServer

CombinedCommands

35

Scheme of Access Permissions (1)

• The scheme of access permissions• Regulates the access to your files by other users;

you may want to share or not-share such files• Permits to keep private your confidential files

• E.g., you can decide that other users:• Can read and write into a given file that you own• Can only read (but not write) a file that you own• Can only write (but not read) a file that you own• Cannot read or write a file that you own

• You can protect a whole directory content to prevent that other (unwanted) users could read the content of the directory

• Warning: the system administrator has full access to any file(despite any permission that the file owner has specified)

36

Scheme of Access Permissions (2)

• The scheme of access permission is based on the following three concepts associated to EACH file:• User is the owner of the file (one -and only one- of the users)• Group of the file (one -and only one- or the system groups)• Mask of access permission: specifies what can be done

and what is forbidden according to one of these attributes:• (r) read permission• (w) write permission• (x) execution permission

the mask specifies in compact notation (see next slides)what are the permission for the User (u), Group (g) and for any Other (o) user (who is not the associated User or belongs to the associated Group) of the file

37

Access Permission MASK

• The access permission mask specifies these 3 attributes (r w x)• For the associated User (or owner)• For the associated Group (a list of users which may not own the file)• For any Other user

(implicitely defined as any user who is not the owner or not belonging to the associated group)

Hence, in total the mask will have to specify 3x3=9 permissions !• The owner and the administrator are the only two users that can

change the access permissions• For himself• For the associated group• For the others

• The access permission mask is then represented by a string like:rwx r-x r--

• The dash character « - » specifies that the corresponding permission is not enabled

38

Visualizing the access permissions• The command “ls -l” shows all the above attributes of the file

file/dir

Permissions for the owner User

Permissions of the Other usersPermissions for the Group

associated Group

size date hour filename

(Number of hard links to this file)owner User

$ls -l letter.txt check_spell-rw-r--r-- 1 alex pubs 3355 May 2 10:52 letter.txt-rwxr-xr-x 2 alex pubs 852 May 5 14:03 check_spell

39

Directory Access Permissions

• Access permission in this case have a slightly differentmeaning:• r permission to list the directory content (files, subdirs)• w permission to add or remove content (files, subdirs)• x permission to traverse that directory (to access subdirs)

• Morover there is a «tenth» attribute in the permission mask(the first character) which specifies whether this is• - a file• d a directory

• E.g.:$ls -ld /home/alex/infodrwxr-xr-x 3 alex pubs 512 May 2 10:52 /home/alex/info

Directory

40

Link

• A link is an «alias» of the file that can be created in any point of the file system• It’s useful to avoid replication of information (e.g., big files)• To permit a personalized cliassifications of the files

that are in the directory hierarchy

41

Using ln to create a link

• The ln command creates an additional link to a file• DOES NOT create another copy of the file !• It APPEARS like a copy of the file

U2

home

/

U1

f2I can then refer to the same file by using:

/home/U2/b or/home/U1/f1

a b

cd /home/U1ln -s /home/U2/b f1

f1

42

Checking if a file is a link

$ cd /home/U1$ ln -s /home/U2/b f1$ ls -l f1lrw-r--r-- 1 alex pubs May 2 10:52 f1 -> /home2/U2/b

link

• We can remove a link (like a file) by issuing the command rm

• This type of link is called «soft-link», which can be created by any user, to distinguish it from a «hard-link», which can only be set by the administrator

• When removing a hard link not only the link is removed but also the pointed object… soft links are preferred

43

Changing the permissions

• Changes the permission associated to file1• The <update_mask> has the following format

u + rg – wo x

$ chmod <update_mask> file1

Permession to be changed (r = read, w = write, x = execute)add (+) or (-) remove the permissionWhich of the three tuples has to be changed(u = user, g = group, o =others)

• E.g., to enable the group to write file1:$ chmod g+w file1

Control Commands

Shell

FileSystem

Protezione Informazioni

ControlCommands

WEB/DBServer

CombinedCommands

Internet

45

LINUX commands (1)• Linux provides a large variet of commands targeted to

the common needs of managing the system and for working with more complex applications (Databases like MySQL or Oracle, Web, Scientific Applications, …)

• It’s not sufficient to start the program… to make sure that the program did not generate any error during its execution one has to check the exit code

• At the end of the execution any command provides a numerical value (called exit code) which may tell if the program finished without errors (exit code = 0) or not (exit code 0)

• A same command may be implemented differently on different Linux version or distributions: the online manual (“man”) always provide help for the current implementation

46

LINUX commands (2)

• We can group commands in four main categories: • File and Directory management, e.g.:

• cp, copy a file

• Text processing commands, e.g.:• grep, search a given word (or pattern) in a file

• System status and monitoring commands, e.g.:• du, summarize the disk usage

• Software maintainance commands, e.g.:• tar, can be used for the backup and restore of files

47

File and Directory management (1)

• The commands for managing files enable the user to copy, find, rename, move, remove the files

• The commands for managing directories allow us to change wirking directory, remove, create, list content of directories

48

rm Remove a file or directory (with –r)cp Copy filesmv Move or rename files and directory

cd Change working directorypwd Print the pathname of the working directoryrmdir Remove a directorymkdir Create a directoryls Show the directory contentfind Search a file

File and Directory management (2)

49

The ls command

• The ls command provides information about the directory content

ls [-ACFRabcdefglmnopqrstux] [name]

• For each directory that is specified in name, ls lists the content of such directory; for each file that is speficied in name, ls repeats such name and the associated information (if any)

• By default, the output is in alphabetic order• If no argument is specified then the content of the

working directory is listed

50

The cd command

• The cd command serves to change the working directory

cd [directory]• If a directory is specified as argument, then such

directory becomes the working directory; if no argument is provided after cd, then the effect is to «move» the working point in the HOME-DIRECTORY

• In order to have a successful execution, it is necessary that the user who issues the cd command has the «traverse permission» on all the subdirectories that are specified in the relative pathname of the target directory, e.g.:

• cd /home/pippocd /usr/local/bin

51

The pwd command

• The pwd command prints the name of the working directory

pwd/home/marco/d1

52

The cp command

• The cp command generates copies of files

cp <file1> <file2>

copies file1 in file2(even if the two files are already identical)

cp <files> <directory>

copies one of more file into an existing directory(a common error, if the directory doesn’t exist, is generating instead a file with filename «directory»)

53

The rm command

• The rm command removes files. It has the following syntax:

rm [-fri] <filename>

• rm removes all the files that are indicated as command line arguments

• The option –i produces an interactive execution, so that the user is asked to answer yes/no for each file that us specified it’s very useful when the filename is a pattern

• The option –f on the contrary, forces to remove the files wihout any further questions (use with caution!)

• The option –r removes recursively files an directory in the path that has been specified as filename

54

The mv command

• The mv command moves or renames files or directoriesmv <file1> <file2>mv <directory1> <directory2>

• mv moves (changes name of) file1 into file2(or directory1 into directory2)

• If file2 already exists,it will be removed before file1 takes its place

mv <file> <directory>• In this latter situation the command moves file into directory

55

The mkdir command

• The mkdir command creates a directorymkdir <dirname>

• Note: when a directory is created, by default it contains always and automatically at least two elements (the directory “.”, alias of the directory itself, and the directory “..”, alias of the parent directory)

56

The rmdir command

• The rmdir command removes a directoryrmdir <directory>

• To have a successful completion• The directory most be «in use» by another user

(a different user may have set that directory has working directory)• The user who issues such command should have the permissions

to remove the directory and its content («w» permission)

57

The find command

• The find command is useful to find files within the filesystem and apply specific actions for each found file (the simplest one being just print its filename)

find <pathname-list> [options]

• E.g.: finding files named example (-name example) in the current directory and in its sub-directories; once found, print the pathname for each located file (-print)

find . -name example –print

• Note: this is the simplest form of the find command. More complex actions can be associated in order to process the found files.

58

Commands to check the status of the system

• Checing the system is the first step to verify its health • who list the names of the connected users• du summarize the disk usage• ps report process statistics• kill terminate a process or send a message to it• stty set or get terminal options• id print the id and groups of the current user• date shows or set system date and time• mail standard program to send/receive emails

59

The who command

• The who command lists the users connected to the system

who [-uATHldtasqbrp] [file]who mai list usernames, terminal line, login instant, how long the user has been inactive, name of the last command issued by such user

• The option “am i“ or “am I” show information about the current user

• With the appropriate options, who can list login, logout, reboot, system clock changes, etc.

60

The ps command (1)

• The ps command reports process statisticsps [options]

• ps (without options) prints information about processes associate to the current terminal• In practice, the process status can canche while ps itslef

is running; what is shown is a snapshot of the system during the command execution

• Some data could be stale since the related process could be already terminated

61

The ps command (2)

• This command prints th following data:• PID (Process ID) a number between 1 and 32768• TT (terminal ID) a number which identifies the terminal• STAT (state) a string that specifies the process state• TIME minutes and seconds since process started • COMMAND name of the command that is associated to

this process (issued at the shell prompt); the option –w shows the whole command line

“ID” in general specifies an numerical dentifiers that is a unique number that identifies a given obect in the system

$ psPID TT STAT TIME COMMAND24059 11 S 0:05 -sh (sh)24259 11 R 0:02 ps

R is running, S is sleeping in an interruptible wait, D is waiting in uninterruptible disk sleep, Z is zombie, T is traced or stopped (on a signal), and W is paging

62

The kill command

• The kill command sends a signal to a process, eventually killing it

kill [-signal] <processid>

• Upon reception of the signl the process may terminate (unless it is programmed to ignore or handle differently such message)

• A process can be terminated either by the owner user or by the system administrator

• If the argument is a number preceded by the «-» character, thenthe corresponding signal is sent kill -9 <pid> this is a signal that can’t be ignored by the process kill -0 <pid> is a non-kill: just query the state of the process kill -15 <pid> is an «educated kill»: the process should terminate

63

The stty command

• The stty command sets or gets the terminal options (in particular with relation to the «standard input», i.e., the keyboard)

stty [-a] [-g] [options]

• stty (without arguments) prints the main current terminal settings• stty –a show ALL current terminal settings

64

The du command

• The du command sumarizes the disk space usage (by default printed in ki-bytes) by a gien file or directory

du [-afrsu] [name]

• du (without arguments) prints the disk usage of the current working directory

• Disk usage should be monitored: once it is finished it can cause a system lock up

Combined Commands

ControlCommands

Internet

WEB/DBServer

CombinedCommands

Shell

FileSystem


66

Standard input and Standard output

• Standard input = input of a command/preocess• Standard output = output of a command/process• The shell directs the standard output of to a file which

represents the standard output of your terminal (or window)• e.g. /dev/tty0

• In the same way, the shell gets the standard input from the «keyboard» device

outputcommand

input

error

67

Redirecting Standard Output• Redirecting standard output: the character to redirect

output (>) instructs the shell to redirect the output of a command into a file instead of the standard output

command [arguments] > filename• command is any executable program and filename is the

name of an ordinary file to which the shell redirects the output• Redirect output with caution: if the file already exists, its content

will be overwritten

outputcommand

input

error

File

68

Redirecting the Standard Input

• The character to redirect input (<) instructs the shell to take the input of the command from the specified files instead of the standard input

command [arguments] < filename

outputcommand

input

error

File

69

Appending Standard Output to a File

• The character to append di output (>>) allow us to add the output produced by the gien command to an existing file, without modifying any pre-existing information in such file

• This also provides a convenient way to concatenate two files into one (by using the cat command which dumps the content of the list files)

$ cat pear >> fruit_list$ cat orange >> fruit_list

70

Pipelining commands

• In Linux, a pipeline is a sequence of one or more commands separated by a vertical bar (key |);the idea is that such bar represents a «pipe» between the commands• The standard output of the command before the pipe

«flows» into the standard input of the command after the pipe

• All the commands are launched as separate processes and can then send their outputs to next command of the pipeline as soon as it gets ready; the control is returned to the shell once the last command has finished

• Pipes reduce the need of intermediate files in a complex processing

command_a [arguments] | command_b [arguments]

71

Pipe syntax

command_a [arguments] | command_b [arguments] • This command uses a pipe to generate the same result

of the following command lines:command_a [arguments] > tempcommand_b [arguments] < temprm temp

• Other examples:$ ls | lpr$ ls | more$ gpg –sign msg.txt | mail [email protected]

72

Filters

• A filter is a command which processes a stream of upcoming data in order to produce a stream of output data

• Filters are ideally suited for pipelines

$ who | sort | lpr

• This example show the power of the shell combined with the versatilty of the Linux utilites

73

The utility tee• The utility tee tee in pipeline allows us to «snoop»

into a file the content that is flowing from pipe to pipe

$ who | tee who.out | grep chas

74

Launching background commands (1)

• When you launch a command, the shell waits that it finishes before returning another prompt• In this case we say that the command has gone in

foreground execution • It’s possible to execute a command also in background

• In this way, you do not need to wait for the shell prompt is it is returned immediately and you can launch another command

• Typically a command is launched in background once we now that it will take a long time to produce its output and we know that it doesn’t require interaction

• In other wors: the terminal will become immediately available to perform further work

75


• To launch a background command, we add the ampersand (&) character at the end of the command line (before issuing the RETURN)

• The shell will print the process identification numner (PID) which identifies the process just launched in backgriund; then the shell immediately returns a new prompt (while such background process will continue to run)

•$ ls -l | lpr &31725$

76


• If a background process generates something on the standard output and you did not redirect it, such output will be mixed to what you are doing on the same terminal afterwards !

• If a background process requests something from the standrd input and you did not redirect it, the shell automatically generates an empty string as standard input for the bacground command (at least this is the behavior of the Bourne Shell; the C-Shell stops the bacground process for input, instead)

77

• You cannot use CTRL-C to stop the background process: you must communicate with the background process through signals, i.e., through the «kill» command

• If you wish to interrupt all background process, you can use the following commnd (note: there is no dash charcters before the number)

$ kill 0


78


• As an alternative, the user can selectively use the kill command to stop some specific process• To retreive the PIDs of processes that are running in

background, you can use the ps command

• You can also send in background the current process in foreground (e.g., if we realize that is taking a long time to complete) by pressing the CTRL-Z key

• To resume that process in foreground you can type the command fg

79

Where to find more informaton

• http://www.linux.org(Linux)

• http://www.redhat.com(RedHat/Fedora)

• http://www.tldp.org/LDP/abs/html/(Advanced Bash-Scripting Guide)

• Mark G. Sobell, A Practical Guide to Fedora andRed Hat Enterprise Linux, Sixth Edition, Prentice Hall, 2011


Superscalar Processors(first part)


Superscalar Processors

• Introductory Concepts

• Superscalar Processors• Instruction Fetching• Dispatch, Issue• Register renaming• Speculative execution• Memory operation

• Case studies• MIPS R10000• DEC 21164• AMD K5• Intel P6


Superscalar Processors• Performance limitation of pipelining:

• Latch overheads & clock-skew• “Atomic” Instruction Issue logic

- (Flynn bottleneck: CPI 1)

• Improving performance by increasing the "width" of the processor: Instruction Level Parallelism (ILP)

1 2 3 4 5 6 7 8 9i IF ID EX MEM WBi+1 IF ID EX MEM WBi+2 IF ID EX MEM WBi+3 IF ID EX MEM WBi+4 IF ID EX MEM WBi+5 IF ID EX MEM WBi+6 IF ID EX MEM WBi+7 IF ID EX MEM WB


Superscalar Processors• For each clock cycle Multiple-issue of instructions superscalar

• Often implies dynamic issue• But a “static issue” solution exist too (Alpha 21064)

• Superscalar evolution with time• One integer + one floating-point• Any two instructions• Any four instructions• Any n instructions?

(n appears to have a limit at about 3-6)


The Big Picture

instr.fetch

&branchpredict.

instr.dispatch

instr.issue

executioninstr.

commit&

reorderinstr.

staticprogram

windowof execution


Characteristics of Superscalar Processors• High performance instruction fetching

• Fetching several instructions per cycle (multiple fetch)• Branch/Jump Prediction

• Dispatch and dynamic dependency resolution• Eliminating false dependencies (anti-, output)• Correct management of true dependencies

e.g. using Tomasulo-like tags• Parallel and out-of-order instruction issue• Speculative execution• More resources that operate in parallel

• +functional units, +paths, +register ports• High performance memory systems• Methods to manage the precise exceptions


Generic Superscalar Processor

functional units

functional unitsand

buffersinstruction

instructionbuffers

integer/ address

data cache

interface

memory

floating pt.

registerinteger

file

floating pt.

registerfile

instr.

bufferdecodepre- instr.

cache

decode,rename,

&dispatch

re-order and commit


Multiple Instruction Fetch (1)• Goal: collect X instructions per cycle from the cache (I $)• Problem: due to, e.g., a branch, the X instructions can start at any point of the cache block

- Can not be solved by simply by doubling the block size (the initial instruction can be at the end of the block)

- Needs to be done by fetching two blocks AND align properly

I$

10

0

1023

I$even blocks

(bank 0)

I$odd blocks

(bank 1)

+1

0 1 0 1

99

511

0

511

0 9

10

X INSTRUCTIONS X INSTRUCTIONS X INSTRUCIOTNS

2X INSTRUCTIONS

2 3

2

4 3

3Set Address Set address Example: direct access cache

with 1024 sets, and 4-blockinstructions


Multiple Instruction Fetch (2)• Once 2X instructions are fetched align them properly

0 1 2 3 4 5 6 7

2 (3)Set address

(e.g., 2 in the 1st case or 3 in the 2nd case)Instruction within the set (0, 1, 2 or 3)

1st case

0 1 2 3

align switch8x4

1 2 3 4or

2 3 4 5or

3 4 5 6or

4 5 6 7 0 1 2 3

(4) 32nd case

0 1 2 3

align switch8x4


The 8x4 switch sends one of the 8 quartets (***) to the output by exploiting:

- The two least significant address bits (**)0f the instruction within the block

- the bit 0 (*) of the set address

(***)Simplified scheme:

0123 1234 2345 3456 4567 5670 6701 7012

(**)

(*)

0123 4567

(**) (*)3

Branch/Jump Prediction• Eliminates several control hazards

• Goal: no interruptions in the instruction flow • Multiple prediction per cycle may be necessary

• Example

Static program:loop: r3 <- mem(r4+r2)

r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0

Dynamic Stream:r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0……


Instruction Window• Can be realized:A) as a queue (FIFO)B) as a small memory

- Trade-off between complexity and flexibility

• Moreover can be:C) partitioned in correspondence of the various types of functional units- In a manner similar to Tomasulo’s reservation-stations

D) Using a unified set of reservation stations from instruction

dispatch

to integer units

to floating point units

from instruction

dispatchto integer units

to floating point units

to load/store units

from instructiondispatch

to all units

to load/store units

A+D

A+C

B+C

In the following we consider the most general scheme B+DRoberto Giorgi, Universita' degli Studi di Siena, C218LEZ05-SL di 2811

Dispatch and Issue• Consider a generic implementation scheme

• At any time, the data can be:• in the registers• in the reservation stations• in the functional units• in the Re-Order Buffer (ROB)

• Renaming is used to keep track of produced values• Rx = Registers Logic, Px = Physical Registers (normally Px > Rx)

Registers

stationsreservation

functionalunits

renameregisters

reorder buffer


Tomasulo implemented in this model

• #Physical registers == #logical Registers (P==R)• Reservation Stations

• Partitioned according to the functional units• The access can be to any of them (random access)

- Except in the case of load/store instructions

• No reorder buffer imprecise interrupts

stationsreservation

functionalunitsRegisters

#Px == #Rx


“Modern” Renaming

• Instruction Window (IW) instead of the Tomasulo’s reservation stations

• Other names: "Instruction Buffer" (IB), "Instruction Queue" (IQ)• Only control info is stored in the IW• The ROB manages precise exceptions and branch (safe)

recovery

• There are at least two renaming methods:1) Use locations in the ROB to store renamed values

- Instead of storing them in the reservation stations2) Rename in the Register-File

- In this case we have: Physical registers # > # logical Registers

• First, we will consider the method 2)Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ05-SL di 2814

MIPS R10000 method

• #Physical registers > #logical Registers (P>R)

• Moving data between registers and functional units only• The ROB is used to manage the control (reservation bits)

Registers

stationsreservation

functionalunits

reorder buffer

renameregisters

registerreservation

bits


Register Renaming (P registers > R registers)• Key structures: Register Map + Free Pool• Register Map (RM)

• specifies the associations between each R register and a P register• the physical register number takes the place of the “tag”• the mapping changes over time• Converts the stream of instructions in a "single assignment" form

(a register is written once and then read)• Avoids the false dependencies

• Free Pool (FP)• Holds the list of available P registers

• Operation: for each instruction1) For source Rs: read the Ps corresponding to each Rs in the RM2) For the destination R: get a new P from FP and associate it to the R3) Update the RM to reflect the new mapping

• Example: 24P and 8R (see next slides)


Example (cycle 0)

Fetched Streamr3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0

Renamed Stream

Free Poolp8,p9,p10,p11,p12,p13, p14,p15,p16,p17,p18,p19,p20,p21,p22,p23,p24

RegisterMAPr1 p3r2 p4r3 p6r4 p1r5 p2r6 p7r7 p5


Example (cycle 0…)



Renamed Stream r3 <- mem(p1+p4)


11





Renamed Streamp8 <- mem(p1+p4)


2





Renamed Streamp8 <- mem(p1+p4)

Free Poolp9,p10,p11,p12,p13, p14,p15,p16,p17,p18,p19,p20,p21,p22,p23,p24

3





Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)

Free Poolp10,p11,p12,p13, p14,p15,p16,p17,p18,P19,p20,p21,p22,p23,p24





Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8

Free Poolp11,p12,p13, p14,p15,p16,p17,p18,P19,p20,p21,p22, p23,p24


Example (cycle 1)



Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1

Free Poolp12,p13, p14,p15,p16,p17,p18,P19,p20,p21,p22, p23,p24


Example (cycle 2)



Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1 mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0

Free Poolp13, p14,p15,p16,p17,p18,P19,p20,p21,p22, p23,p24


Example (cycle 3)



Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1 mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1

Free Poolp17,p18,P19,p20,p21,p22, p23,p24


Example (cycle 4)



Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1 mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 – 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0Free Pool

p18,P19,p20,p21,p22, p23,p24


Example (cycle 5)



Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1 mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 – 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1

Free Poolp22, p23,p24


Example (cycle 6)



Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

Free Poolp23,p24



Superscalar Processors(second part)


Superscalar Pipeline

I-cacheaccess

Regs



Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

Window

Integer

DECODE+ RENAME

COMPLETE oWRITE-BACK


Superscalar Processing

• 5 main stages• Rename• Dispatch• Issue• Complete or write-back• Commit or retire

I-cacheaccess

Decode

Regs

WRITE-BACK



Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

Window

Integer

RENAME


Superscalar Processing : RENAME• This is done while the instructions are still in order• Renaming registers using the Register Map• The ROB slots are reserved now (to be discussed later)

I-cacheaccess

Decode

Regs

WRITE-BACK



Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

Window

Integer

RENAME


Superscalar Processing : DISPATCH• Check availability of Instruction Window slots

- If there are none, you have a stall for structural hazard- If there are enough free ones, dispatch

• Copy the renamed instructions in the IW• Copy the register reservation status bits in the IW

I-cacheaccess

Decode

Regs

WRITE-BACK



Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

Window

Integer

RENAME


Fields of an IW-SLOT

Op OpcodePi Destination register designator

associated with this instruction at renaming timePj, Pk Source register designators

associated with this instruction at renaming timeQj, Qk Tags for source register reservation

(if not zero, mean waiting for the corresponding Pj o Pk)I Immediate value (if specified by the instruction)ROB# ROB entry #, is the index of the ROB entry

associated with this instruction at renaming timeBusy The element is not available


Busy Pi Qk I ROB#QjPj Pk

1 bit e.g. 6 bits 6 bits 6 bits 6 bits

Op

1 bit 1 bit 16 bit e.g. 8 bit

IW-SLOT:

Note: a window element contains only control information - no data (except 1 immediate value)

Superscalar Processing: ISSUE• Wake-up:

• Occurs when the tags (Qj, Qk) are both zero(the instruction has all the necessary data, as in Tomasulo)

• Select:• The issue logic maps the requested resources to those available• The selected instructions read their values from the physical

registers and send them to the functional units for the execution

I-cacheaccess

Decode

Regs

WRITE-BACK



Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

Window

Integer

RENAME


Note: the tag matching logic is not represented in the figure.Only the data paths are represented. The tag matching isperformed by sanding tags to the Instruction Window and Registers

Superscalar Processing: COMPLETE• Write-back of the results into the registers• The Pi / Pk of IW will indicate those elements that will receive the results too• The Qj / Qk will be zeroed accordingly to the received values

• If there has been an exception, it is annotated in the ROB(using an “Exc flag”)

• Finally, the ROBentry is markedas completed(but not deleted yet)

I-cacheaccess

Decode

Regs

WRITE-BACK



Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

Window

Integer

RENAME


Fields of a ROB-SLOT

Busy The element is not availablePC Program Counter of the instructionRi Logical Destination register at rename time

associated with this slot at renaming timePi,old Previous Pi in the Register Map when Ri was renamed

associated with this slot at renaming timeSt Indicates that this instruction was a STOREExc Indicates that this instruction got an exception

Cplt Flag to indicate that the instruction has completed


Busy Ri Pi,old

1 bite.g. 32 bits 6 bits 6 bits

PCROB-SLOT:

Note: The ROB is managed as a CIRCULAR QUEUE

St

1 bit

Cplt

1 bit

Exc

Superscalar Processing: COMMIT• New phase that handles the reordering of instructions

• Any instruction that:- i) it is on the top of the ROB- ii) AND it is marked has COMPLETED

It is then declared has COMMITTED• More instructions can commit at the same cycle (e.g., 4)

• If there was no exception• The committed ROB entries can be freed• The produced value in the Pi can be copied in the Ri• The Ri can be remapped to its “old Pi” in the RM

• However, if there has been an exception, we must step back!• The entry marked with the Exc flag has to be eliminated • And the same happens to all the preceding instructions (not yet

committed) in the ROB


Managing an exception• Activate the exception-bit in the Exception Register

• Stop issue and commit

• Entry-by-entry: use the old-Pi value in the ROB to restore the Register Map• The original situation, BEFORE the time when the instruction that

caused the instruction was dispatched, is restored

• The EXCEPTION HANDLER is then launched


Superscalar Pipeline

I-cacheaccess

(logical)Regs



Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

Integer

DECODE + RENAME COMPLETE oWRITE-BACK


Physicalregs

RegisterMap

Free

Poo

l

Re-OrderBuffer

InstructionWindow

COMMIT

R:8P:24IW:32ROB:36LQ:3SQ:3

Example (cycle 0)Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

I-cacheaccess

Decode

Regs

WRITE-BACK



Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

Window

Integer

track R7:

P5


Example (cycle 1)

I-cacheaccess

Decode

Regs

WRITE-BACK



Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

Window

Integer


track R7:

P5

P10P9


Example (cycle 2)

I-cacheaccess

Decode

Regs

WRITE-BACK



Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

Window

Integer


track R7:

P5

P10P9


Example (cycle 3)

I-cacheaccess

Decode

Regs

WRITE-BACK



Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

Window

Integer


track R7:

P5

P10P9

P15P14

*

*


* == in flightIssued CompletedeXecution Committed

Example (cycle 4)

I-cacheaccess

Decode

Regs

WRITE-BACK



Integer

Mult. 4

LQ

SQ

Window

Integer


track R7:

P5

P10P9

P15P14

*

*

*

**



Example (cycle 5)

I-cacheaccess

Decode

Regs

WRITE-BACK



Integer

Mult. 4

LQ

SQ

Window

Integer


track R7:

P5

P10P9

P15P14

P20P19

**

*

**

*



Example (cycle 6)

I-cacheaccess

Decode

Regs

WRITE-BACK



Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

Window

Integer


track R7:

P5

P10P9

P15P14

P20P19

*

*

**

*

*

**



Example (cycle 7)

I-cacheaccess

Decode

Regs

WRITE-BACK



Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

Window

Integer


track R7:

P10P9

P15P14

P20P19

*

*

*

**

**

*

*



Example (cycle 8)

I-cacheaccess

Decode

Regs

WRITE-BACK



Integer

Mult. 4

LQ

SQ

Window

Integer


track R7:

P10P9

P15P14

P20P19

**

**

*

*

*

**Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9521


Example (cycle 9)

I-cacheaccess

Decode

Regs

WRITE-BACK



Integer

Mult. 4

LQ

SQ

Window

Integer


track R7:

P10P9

P15P14

P20P19

*

*

**

*

*



Example (cycle 10)

I-cacheaccess

Decode

Regs

WRITE-BACK



Integer

Mult. 4

LQ

SQ

Window

Integer


track R7:

P10P9

P15P14

P20P19*

*

*

*

**

*



Example (cycle 11)

I-cacheaccess

Decode

Regs

WRITE-BACK



Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

Window

Integer


track R7:

P10P9

P15P14

P20P19

*

**

*



Example (cycle 12)

I-cacheaccess

Decode

Regs

WRITE-BACK



Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

Window

Integer


track R7:

P10

P15P14

P20P19

*

*

*



Example (cycle 13)

I-cacheaccess

Decode

Regs

WRITE-BACK



Integer

Mult. 4

LQ

SQ

Window

Integer


track R7:

P15P14

P20P19

P10

*

***



Example (cycle 14)

I-cacheaccess

Decode

Regs

WRITE-BACK



Integer

Mult. 4

LQ

SQ

Window

Integer


track R7:

P15P14

P20P19

**

*



Example (cycle 15)

I-cacheaccess

Decode

Regs

WRITE-BACK



Integer

Mult. 4

LQ

SQ

Window

Integer


track R7:

P15

P20P19

*

*



Example (cycle 16)

I-cacheaccess

Decode

Regs

WRITE-BACK



Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

Window

Integer


track R7:

P20P19

P15

*



Example (cycle 17)

I-cacheaccess

Decode

Regs

WRITE-BACK



Integer

Mult. 4

LQ

SQ

Window

Integer


track R7:

P20P19

*



Example (cycle 18)

I-cacheaccess

Decode

Regs

WRITE-BACK



Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

Window

Integer


track R7:

P20

*

*



Example (cycle 19)

I-cacheaccess

Decode

Regs

WRITE-BACK



Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

Window

Integer


track R7:

P20

*



Example (cycle 20)

I-cacheaccess

Decode

Regs

WRITE-BACK



Integer

Mult. 4

LQ

SQ

Window

Integer


track R7:

*



Example -- Summary

Renamed Stream dispatch issue completep8 <- mem(p1+p4) 2 3 6p9 <- mem(p2+p4) 2 4 7p10 <- p9 * p8 2 7 12 p11 <- p3 – 1 2 3 4mem(p7+p4)<-p10 3 5 14p12 <- p4 + 8 3 4 5P <- loop; p11!=0 3 4 5p13 <- mem(p1+p12) 4 6 9 p14 <- mem(p2+p12) 4 7 10p15 <- p14 * p13 4 10 15p16 <- p11 – 1 4 5 6mem(p7+p12)<- p15 5 8 17p17 <- p12 + 8 5 6 7P <- loop; p16!=0 5 6 7p18 <- mem(p1+p17) 6 9 12p19 <- mem(p2+p17) 6 10 13p20 <- p19 * p18 6 13 18p21 <- p16 – 1 6 7 8mem(p7+p17)<- p20 7 11 20p22 <- p17 + 8 7 8 9P <- loop; p21!=0 7 8 9

Performance:1 iteration every 3 cycles

Focusing on one interation:

IPClimite = NINST/Cycles=7/3 ≈ 2.3


Superscalar Processing: ROB• ROB operation

• Manage the instruction termination (in-order commit)• Freeing the physical registers that are no longer necessary• Recover the state to manage precise exceptions• Safe recovery in case of mispredicted branch

• ROB entry fields• PC Program Counter• Ri Number of the destination (logical) register• Pi,old Number of the previous mapping of Ri in the RM• St Store-flag (the instruction is performing a store operation)• Exc Exception flag (the instruction caused an exception)• Cplt Complete flag (the instruction has completed)

• The ROB is managed as a circular buffer


Memory Operations• We need separated Issue Buffers for Loads and Stores

• Necessary to calculate the Effective Address IN-ORDER(the EAs must be calculated in-order to avoid memory-based hazards)• The load/store issue is therefore FIFO

• As “usual”:• The stored is queued and waits the data to be written into the EA

• For Stores, the St flag is set to 1 in the ROB

• Advanced solutions allow a load to bypass a store(before that the store has completed the EA calculation)• In such case we can use the ROB to step back, in case of conflict


Op Pi Pj Pk Qj Qk I ROB#ld p8 p1 p4 0 0 - 1ld p9 p2 p4 0 0 - 2ml p10 p9 p8 1 1 - 3sb p11 p3 0 0 1 4

Example (cycle 2) – Dispatch of instr. 0,4,8,C

Entry PC Ri Pi,old St Exc. Cplt1 0 r3 p6 0 02 4 r7 p5 0 03 8 r7 p9 0 04 C r1 p3 0 0

RegisterMAP

r1 p11 p3r2 p4r3 p8 p6r4 p1r5 p2r6 p7r7 p10 p9 p5

Free Poolp12,p13, p14,p15,p16,p17,p18,P19,p20,p21,p22,p23,p24

WINDOW

ROB



CYCLE 1

CYCLE 0PC048C1014180”4”8”C”10”14”18”0”’4”’8”’C”’10”’14”’18”’

Example (cycle 3)

Entry PC Ri Piold St Exc. Cplt1 0 r3 p6 0 02 4 r7 p5 0 03 8 r7 p9 0 04 C r1 p3 0 05 10 - - 1 06 14 r2 p4 0 07 18 - - 0 0

Op Pi Pj Pk Qj Qk I ROB#ld p8 p1 p4 0 0 - 1ld p9 p2 p4 0 0 - 2ml p10 p9 p8 1 1 - 3sb p11 p3 0 0 1 4st p10 p7 p4 0 0 5ad p12 p4 0 0 8 6br p11 1 0 0 7

RegisterMAP

r1 p11r2 p12 p4r3 p8r4 p1r5 p2r6 p7r7 p10

Free Poolp13,p14,p15p16,p17,p18p19,p20,p21p22,p23,p24



CYCLE 2

Issued CompletedeXecution CommittedDispatch 10,14,18 – Issue 0,C

PC048C1014180”4”8”C”10”14”18”0”’4”’8”’C”’10”’14”’18”’

Example (cycle 4)

Entry PC Ri Pi,old St Exc Cplt1 0 r3 p6 0 02 4 r7 p5 0 03 8 r7 p9 0 04 C r1 p3 0 15 10 - - 1 06 14 r2 p4 0 07 18 - - 0 08 0 r3 p8 0 09 4 r7 p10 0 010 8 r7 p14 0 011 C r1 p11 0 0

Op Pi Pj Pk Qj Qk I ROB#ld p9 p2 p4 0 0 - 2ml p10 p9 p8 1 1 - 3st p10 p7 p4 0 0 5ad p12 p4 0 0 8 6br p11 0 0 0 7ld p13 p1 p12 0 1 8ld p14 p2 p12 0 1 9ml p15 p14 p13 1 1 10sb p16 p11 0 0 1 11

RegisterMAP


Free Poolp17,p18,p19p20,p21,p22p23,p24



CYCLE 3

Issued CompletedeXecution CommittedDispatch 0”,4”,8”,C” – Issue 4,14,18 – Complete C

PC048C1014180”4”8”C”10”14”18”0”’4”’8”’C”’10”’14”’18”’

Example (cycle 5)

Entry PC Ri Pi,old St Exc Cplt1 0 r3 p6 0 02 4 r7 p5 0 03 8 r7 p9 0 04 C r1 p3 0 15 10 - - 1 06 14 r2 p4 0 17 18 - - 0 18 0 r3 p8 0 09 4 r7 p10 0 010 8 r7 p14 0 011 C r1 p11 0 012 10 - - 1 013 14 r2 p12 0 014 18 - - 0 0

Op Pi Pj Pk Qj Qk I ROB#ml p10 p9 p8 1 1 - 3st p10 p7 p4 0 0 5ld p13 p1 p12 0 0 8ld p14 p2 p12 0 0 9ml p15 p14 p13 1 1 10sb p16 p11 0 0 1 11st p15 p7 p12 0 0 12ad p17 p12 0 0 8 13br p16 0 0 0 14

RegisterMAP

r1 p16r2 p17 p12r3 p13 r4 p1r5 p2r6 p7r7 p15

Free Poolp18,p19,p20p21,p22,p23p24



CYCLE 4

Issued CompletedeXecution CommittedDispatch 10”,14”,18” – Issue 10,C” – Complete 14,18 – Commit C

PC048C1014180”4”8”C”10”14”18”0”’4”’8”’C”’10”’14”’18”’

Getting back the Physical Registers (into the Free Pool)

• A physical register can be freed (ignoring exceptions) once the last read has completed

• However, taking into account precise exceptions,a physical register can be freed after the corresponding logical register has been updated

• Therefore the Pi,old indicates when such physical register can be freed as the ROB entry has reached the commit


Example (cycle 6)

Entry PC Ri Pi,old St Exc Cplt1 0 r3 p6 0 12 4 r7 p5 0 03 8 r7 p9 0 04 C r1 p3 0 15 10 - - 1 06 14 r2 p4 0 17 18 - - 0 18 0 r3 p8 0 09 4 r7 p10 0 010 8 r7 p14 0 011 C r1 p11 0 112 10 - - 1 013 14 r2 p12 0 014 18 - - 0 015 0 r3 p13 0 016 4 r7 p15 0 017 8 r7 p19 0 018 C r1 p16 0 0

Op Pi Pj Pk Qj Qk I ROB#ml p10 p9 p8 1 0 - 3ld p13 p1 p12 0 0 8ld p14 p2 p12 0 0 9ml p15 p14 p13 1 1 10st p15 p7 p12 0 0 12ad p17 p12 0 0 8 13br p16 0 0 0 14ld p18 p1 p17 0 1 15ld p19 p2 p17 0 1 16ml p20 p19 p18 1 1 17sb p21 p16 0 0 1 18

RegisterMAP


Free Poolp22,p23,p24



CYCLE 5

Issued CompletedeXecution CommittedDispatch 0”’,4”’,8”’,C”’ – Issue 0”,14”,18” – Complete 0,C” – Commit 14,18

PC048C1014180”4”8”C”10”14”18”0”’4”’8”’C”’10”’14”’18”’

Example (cycle 7)

Entry PC Ri Piold St Exc Cplt2 4 r7 p5 0 13 8 r7 p9 0 04 C r1 p3 0 15 10 - - 1 06 14 r2 p4 0 17 18 - - 0 18 0 r3 p8 0 09 4 r7 p10 0 010 8 r7 p14 0 011 C r1 p11 0 112 10 - - 1 013 14 r2 p12 0 114 18 - - 0 115 0 r3 p13 0 016 4 r7 p15 0 017 8 r7 p19 0 018 C r1 p16 0 019 10 - - 1 020 14 r2 p17 0 021 18 - - 0 0

Op Pi Pj Pk Qj Qk I ROB#ml p10 p9 p8 0 0 - 3ld p14 p2 p12 0 0 9ml p15 p14 p13 1 1 10st p15 p7 p12 0 0 12ld p18 p1 p17 0 0 15ld p19 p2 p17 0 0 16ml p20 p19 p18 1 1 17sb p21 p16 0 0 1 18st p20 p7 p17 0 0 19ad p22 p17 0 0 8 20br p21 1 0 0 21

RegisterMAP

r1 p21r2 p22 p17r3 p18r4 p1r5 p2r6 p7r7 p20

Free Poolp23,p24p6


ROB free


CYCLE 6

Issued CompletedeXecution CommittedDispatch 10”’,14”’,18”’ – Issue 8,4”,C”’ – Complete 4,14”,18” – Commit 0,C” – ROBfree 0

PC048C1014180”4”8”C”10”14”18”0”’4”’8”’C”’10”’14”’18”’

Example (cycle 8)

Entry PC Ri Pi,old St Exc Cplt3 8 r7 p9 0 04 C r1 p3 0 15 10 - - 1 06 14 r2 p4 0 17 18 - - 0 18 0 r3 p8 0 09 4 r7 p10 0 010 8 r7 p14 0 011 C r1 p11 0 112 10 - - 1 013 14 r2 p12 0 114 18 - - 0 115 0 r3 p13 0 016 4 r7 p15 0 017 8 r7 p19 0 018 C r1 p16 0 119 10 - - 1 020 14 r2 p17 0 021 18 - - 0 0

Op Pi Pj Pk Qj Qk I ROB#ml p15 p14 p13 1 1 10st p15 p7 p12 0 0 12ld p18 p1 p17 0 0 15ld p19 p2 p17 0 0 16ml p20 p19 p18 1 1 17st p20 p7 p17 0 0 19ad p22 p17 0 0 8 20br p21 0 0 0 21

RegisterMAP

r1 p21r2 p22r3 p18r4 p1r5 p2r6 p7r7 p20

Free Poolp23,p24,p6p5



Issued CompletedeXecution Committed ROB freeIssue 10”,14”’,18”’ – Complete C”’ – Commit 4,14”18” – ROBfree 4

PC048C1014180”4”8”C”10”14”18”0”’4”’8”’C”’10”’14”’18”’

Example (cycle 9)Op Pi Pj Pk Qj Qk I ROB#ml p15 p14 p13 1 0 10ld p18 p1 p17 0 0 15ld p19 p2 p17 0 0 16ml p20 p19 p18 1 1 17st p20 p7 p17 0 0 19




Issued CompletedeXecution Committed ROB freeIssue 0”’ – Complete 0”,14”’,18”’ – Commit C”’

PC048C1014180”4”8”C”10”14”18”0”’4”’8”’C”’10”’14”’18”’

Example (cycle 10)Op Pi Pj Pk Qj Qk I ROB#ml p15 p14 p13 0 0 10ld p19 p2 p17 0 0 16ml p20 p19 p18 1 1 17st p20 p7 p17 0 0 19




Issued CompletedeXecution Committed ROB freeIssue 8”,4”’ – Complete 4” – Commit 0”,14”’,18”’

PC048C1014180”4”8”C”10”14”18”0”’4”’8”’C”’10”’14”’18”’

Example (cycle 11)Op Pi Pj Pk Qj Qk I ROB#ml p20 p19 p18 1 1 17st p20 p7 p17 0 0 19




Issued CompletedeXecution Committed ROB freeIssue 10”’ – Commit 4”

PC048C1014180”4”8”C”10”14”18”0”’4”’8”’C”’10”’14”’18”’

Example (cycle 12)Op Pi Pj Pk Qj Qk I ROB#ml p20 p19 p18 1 0 17

Entry PC Ri Pi,old St Exc Cplt3 8 r7 p9 0 ovf 14 C r1 p3 0 15 10 - - 1 06 14 r2 p4 0 17 18 - - 0 18 0 r3 p8 0 19 4 r7 p10 0 110 8 r7 p14 0 011 C r1 p11 0 112 10 - - 1 013 14 r2 p12 0 114 18 - - 0 115 0 r3 p13 0 116 4 r7 p15 0 017 8 r7 p19 0 018 C r1 p16 0 119 10 - - 1 020 14 r2 p17 0 121 18 - - 0 1



Issued CompletedeXecution Committed ROB freeComplete 8,0”’ – OVERFLOW 8

PC048C1014180”4”8”C”10”14”18”0”’4”’8”’C”’10”’14”’18”’

Exception management (1)


Recover Register State

RegisterMAP

r1 p21r2 p22r3 p18r4 p1r5 p2r6 p7r7 p20

Free Poolp23,p24,p6,p5

All the changes on registers must be UNDONE, starting from the most recent instruction and going back in ROB





RegisterMAP

r1 p21r2 p17r3 p18r4 p1r5 p2r6 p7r7 p20

Free Poolp23,p24,p6,p5,p22





RegisterMAP

r1 p3r2 p4r3 p8r4 p1r5 p2r6 p7r7 p9

Free Poolp23,p24,p6,p5,p22,p21,p20, p19,p18,p17,p16,p15,p14,p13,p12,p11,p10

trap PC = 8


Speculative Execution• In case of branch prediction, we try to execute the instructions in a speculative way• Correct prediction: no loss of performance• Wrong prediction: elimination of speculative instructions

• The Reorder buffer can be used to manage such elimination• We add "speculative bits” (Spec field) to each ROB entry• An instruction cannot commit if its speculative bits are set If the speculation is correct, we clear the speculative bits If the speculation is wrong, we do not commit the speculative instructions that are in the ROB

- They are eliminated and we use an old-RM to restore the mapping


Example+speculation (cycle 2)Entry PC Ri Pi,old St Exc Cplt Spec1 0 r3 p6 0 0 02 4 r7 p5 0 0 03 8 r7 p9 0 0 04 C r1 p3 0 0 0

RegisterMAP

r1 p11r2 p4r3 p8r4 p1r5 p2r6 p7r7 p10



Example+speculation (cycle 3)Entry PC Ri Pi,old St Exc Cplt Spec1 0 r3 p6 0 0 02 4 r7 p5 0 0 03 8 r7 p9 0 0 04 C r1 p3 0 0 05 10 - - 1 0 06 14 r2 p4 0 0 07 18 - - 0 0 0

RegisterMAP

r1 p11r2 p12r3 p8r4 p1r5 p2r6 p7r7 p10

RegisterMAP 1

r1 p11r2 p12r3 p8r4 p1r5 p2r6 p7r7 p10

snapshot

Predict TAKEN



Issued CompletedeXecution Committed ROB free

Example+speculation (cycle 4)Entry PC Ri Pi,old St Exc Cplt Spec1 0 r3 p6 0 0 02 4 r7 p5 0 0 03 8 r7 p9 0 0 04 C r1 p3 0 0 05 10 - - 1 0 06 14 r2 p4 0 0 07 18 - - 0 0 08 1C r3 p8 0 0 19 20 r7 p10 0 0 110 24 r7 p14 0 0 111 28 r1 p11 0 0 1

RegisterMAP

r1 p16r2 p12r3 p13r4 p1r5 p2r6 p7r7 p15

RegisterMAP 1

r1 p11r2 p12r3 p8r4 p1r5 p2r6 p7r7 p10

Setting the speculative bits




Example+speculation (cycle 5)Entry PC Ri Pi,old St Exc Cplt Spec1 0 r3 p6 0 0 02 4 r7 p5 0 0 03 8 r7 p9 0 0 04 C r1 p3 0 1 05 10 - - 1 0 06 14 r2 p4 0 1 07 18 - - 0 1 08 0 r3 p8 0 0 19 4 r7 p10 0 0 110 8 r7 p14 0 0 111 C r1 p11 0 0 112 10 - - 1 0 113 14 r2 p12 0 0 114 18 - - 0 0 1

RegisterMAP

r1 p16r2 p17r3 p13r4 p1r5 p2r6 p7r7 p15

RegisterMAP 1

r1 p11r2 p12r3 p8r4 p1r5 p2r6 p7r7 p10

Reset the speculative bitsCorrect prediction




10101010101010

Example+speculation (cycle 6)Entry PC Ri Pi,old St Exc Cplt Spec1 0 r3 p6 0 1 02 4 r7 p5 0 0 03 8 r7 p9 0 0 04 C r1 p3 0 1 05 10 - - 1 0 06 14 r2 p4 0 1 07 18 - - 0 1 08 0 r3 p8 0 0 09 4 r7 p10 0 0 010 8 r7 p14 0 0 011 C r1 p11 0 1 012 10 - - 1 0 013 14 r2 p12 0 0 014 18 - - 0 0 015 0 r3 p13 0 0 216 4 r7 p15 0 0 217 8 r7 p19 0 0 218 C r1 p16 0 0 2

RegisterMAP

r1 p21r2 p17r3 p18r4 p1r5 p2r6 p7r7 p20RegisterMAP 1

r1 p11r2 p12r3 p8r4 p1r5 p2r6 p7r7 p10

RegisterMAP 2

r1 p16r2 p17r3 p13r4 p1r5 p2r6 p7r7 p15

Predict TAKEN

snapshotof RM0

at cycle 5



Issued CompletedeXecution Committed

Example+speculation (cycle 7)Entry PC Ri Piold St Exc Cplt Spec2 4 r7 p5 0 1 03 8 r7 p9 0 0 04 C r1 p3 0 1 05 10 - - 1 1 06 14 r2 p4 0 1 07 18 - - 0 1 08 0 r3 p8 0 0 09 4 r7 p10 0 0 010 8 r7 p14 0 0 011 C r1 p11 0 1 012 10 - - 1 0 013 14 r2 p12 0 1 0 14 18 - - 0 1 015 0 r3 p13 0 0 216 4 r7 p15 0 0 217 8 r7 p19 0 0 218 C r1 p16 0 0 219 10 - - 1 0 220 14 r2 p17 0 0 221 18 - - 0 0 2

RegisterMAP

r1 p21r2 p22r3 p18r4 p1r5 p2r6 p7r7 p20

RegisterMAP 2

r1 p16r2 p17r3 p13r4 p1r5 p2r6 p7r7 p15

Wrong prediction(assumption)

Restore themapping

The entries with speculative bits==2 are eliminated from ROB



Issued CompletedeXecution Committed

Example+speculation (cycle 8)Entry PC Ri Piold St Exc Cplt Spec3 8 r7 p9 0 0 04 C r1 p3 0 1 05 10 - - 1 1 06 14 r2 p4 0 1 07 18 - - 0 1 08 0 r3 p8 0 0 09 4 r7 p10 0 0 010 8 r7 p14 0 0 011 C r1 p11 0 1 012 10 - - 1 0 013 14 r2 p12 0 1 0 14 18 - - 0 1 0

RegisterMAP

r1 p16r2 p17r3 p13r4 p1r5 p2r6 p7r7 p15

The speculative instructionsare eliminated also from the pipeline



Renaming in the ROB

#logical registers == #physical registers• The ROB contains also the "renamed“ values• The ROB commits in-order• The registers provide the values only at dispatch time

• The values may also come from the FUs or the ROB itself• Similar to the method used in Intel P6 architecture

(i.e., Pentium Pro, II, III) and PowerPC 604

Registers

WindowIssue

functionalunits

reorder bufferRAT

Registers

WindowIssue

FunctionalUnits

reorder bufferRAT


Register Alias Table (RAT)

• Used during the renaming (at dispatch time)

• The RAT contains (for each register)Register flag (1 if the most recent value is in the Register File)ROB flag (1 if the most recent value is in the ROB)ROB# (if the value is in the ROB, it points to the entry)

If both Register flag and ROB flag are ZERO the value is yet to be produced


RAT-SLOT:1 bit

Rf ROBf

1 bit

ROB#

e.g. 6 bits

Issue Buffer

• Similar to Tomasulo’s Reservation Stations

• ContainsBusy Currently busy (not available)Op OpcodeQj, Qk Tag fields (indicates the assoicated ROB slots)Vj, Vk Value of the operands (sources)ROB # associated ROB elements that holds the result


Busy ROB#

1 bit e.g. 6 bits

Op

e.g. 8 bit

IB-SLOT: QkQj

1 bit 1 bit

Vj

16 bits

Vk

16 bits

ROB Renaming

• 4 major steps• Dispatch• Issue• Complete (writeback)• Commit (retire)

remove from head

when dispatched

reserve entry at tai l. . .

. . .when execution finished

place data in entry

. . .

when neededbypass to other instructions

w hen complete (commit)


ROB Renaming: Dispatch

• Check for slot in issue window• If not available, stall due to structural hazard• If window available, assign window slot and ROB

entry• Copy Opcode and ROB# to issue window slot• For each source operand, consult RAT

• If value in register, copy value to issue window slot• If value in ROB, copy value to issue window slot• Else, copy ROB# to issue window slot Q field• For destination register

- Place ROB# into RAT- Clear register and ROB flags in RAT


ROB Renaming Dispatch

Registers

WindowIssue

functionalunits

reorder bufferRAT

Registers

WindowIssue

FunctionalUnits

reorder bufferRAT


ROB Renaming: Issue

• Wakeup:• When all values are ready in window slot

(Instruction has all required data)• Select:

• Issue logic arbitrates requested and available resources• Selected instructions read register values and issue to

functional units for execution


ROB Renaming Issue

Registers

WindowIssue

functionalunits

reorder bufferRAT

Registers

WindowIssue

FunctionalUnits

reorder bufferRAT


ROB Renaming: Completion

• Write result into ROB entry• Update RAT; set ROB flag if ROB entry matches ROB# in RAT• I.e. this is the most recent version

• Window source entries monitor ROB#• All matching entries capture result data

• Place any exceptions in ROB• Mark ROB entry complete


ROB Renaming Completion

Registers

WindowIssue

functionalunits

reorder bufferRAT

Registers

WindowIssue

FunctionalUnits

reorder bufferRAT


ROB Renaming: Commit

• When instruction at head of ROB is complete• If no exception

• Remove instruction from ROB• Copy value to register file• Update RAT: if ROB entry matches ROB# in RAT

- set register flag

• If exception• Set exception bit in trap register• Stop issue and commit• Register file holds correct state• Vector to trap handler


ROB Renaming Commit

Registers

WindowIssue

functionalunits

reorder bufferRAT

Registers

WindowIssue

FunctionalUnits

reorder bufferRAT


Multi-Ported Caches

• For superscalar memory bandwidth• True multi-ported

- (higher cost, slower access)• Interleaved

- (control complexity)• Multi-access per cycle

- (slower clock)


cycle1/2 clock

mux

RAM

RAM

Cache

pipe 2 address

pipe 1 data

pipe 2 data

pipe 1 address

pipe 2 address

pipe 1 address EvenAddresses

OddAddresses

pipe 1 data

pipe 2 data

pipe 2 data

pipe 1 datapipe 1 address

pipe 2 address

SUPERSCALAR CASE STUDIES


Case Study: MIPS R10000

Decode

DispatchMap

BHT512x2

Adder

FPRegisters

32 logical64 physical

RegistersInteger

mul/ divALU2

shift

ALU1

FP

Add

FPMult

div/ sqrt

32 logical64 physical

Pre-

decode

PC

16FP buffers

Addr.

ITLB

8 entry

ActiveList

TableMap

16Buffers

16

Ld/ StoQueue

Integer

DTLB64

entries

DataCache32 KB2-way

Intsr.Cache32 KB

2-way

128

L2

Cache

Interface

ResumeCache

44


MIPS R10000, cont

• 4-way superscalar (dispatch and commit)• 1 memory• 2 ALU• 2 FP

• Register renaming (32 64)• Reorder buffer (active list)• 4-deep branch speculation• Resume cache: keeps instructions on non-predicted path

for fast recovery• 2-way 32 KB set assoc. instr. and data caches


Case Study: DEC 21164

L2

Cache

96 KB

3-way

Decode

BHT

PC

ITLB

Intsr.Cache

direct8 KB

2K x 2

32 logical

RegistersInteger

FP

32 logical

Registers

DTLB48 entry

FPAddDiv.

FPMult

ALU1

ALU2

branch

128

shift

128

48 entry

DataCache

InstructionBuffers2 x 4

Dispatch

8 KBdirect


DEC 21164, cont

• 4-way superscalar• 2 memory• 2 ALU• 2 FP

• Issue in-order• Interrupts RSR+bypasses; FP Imprecise• 1-deep branch speculation• Direct-mapped 8 KB instr. and data caches• On-chip L2 Cache


Case Study: AMD K5

BHT

PC

Intsr.Cache

1K x1

ROPDecode

Dispatch

40 words

Registers

ALU1

Ld/ Sto 0

Ld/ Sto 1

FPU

Branch

shiftALU0

DataCache

16 KB

8 KB

Reorder Buffer

Predecode

TLB128 entry


AMD K5, cont

• 4-way superscalar• 2 memory• 2 ALU• 1 FP• 1 Branch

• Use ROB renaming• Reservation Stations hold control only – not data• Data read from ROB + register file at issue time

• 16 entry reorder buffer• 1-deep branch speculation• 16 KB instr. and 8 KB data cache• CISC instructions converted to RISC Ops (ROPs)

Registers

stationsreservation

functionalunits

reorder buffer


CISC to ROP conversion

• ROPs are like microcode• Predecode helps

identify instruction boundaries

• CISC instructions placed into Byte Queue

• Byte Queue instructions broken into 4 ROPs per cycle ROP

ConvertROP

ConvertROP

ConvertROP

Convert

Instruction Cache

Byte Queue

add EAX,[EBP+d8] (2 ROPs)cmp EAX,imm32 (1 ROP)push ECX (2 ROPs)

add EAX,[EBP+d8] cmp EAX,imm32 push ECX

Parse/ Duplicate

load temp,[EBP+8]add EAX,temp

cmp EAX,imm32sub ESP,4

str [ESP],ECX

str [ESP],ECX

MROM


Case Study Intel P6

Registers

Reorder Buffer

DataStore

AddrStore

AddrLoad

ALU1

ALU0FPU

ROPDecode

Dispatch

PC

Intsr.Cache8 KB

BTB512 x4

queuebyte

instruction

MemoryReorderBuffer

DataCache8 KB

TLB64 entry

Stations (20)Reservation


Intel P6, cont

• Uses ROPs (Ops) similar to K5• 20 unified reservation stations• ROB renaming similar to K5• Up to 3 simple instr decode/cycle; only 1 complex

• Up to 5 ROPs issue per cycle • Up to to 3 ROPs retire per cycle• Max 1 load and 1 store per cycle

• Hit under miss L1 cache• CPU and L2 cache on MCM• 512 entry, 4-way set assoc. BTB


Intel P6 Pipeline Timing

• 12 stage pipeline• Requires good branch prediction

Reg.BTBAccess

I-Cache Access Decode/ ROP Gen.

Rename Dispatch Issue Execute

Retire

D-Cache AccessAddr.Add


Deep Pipelines

• Another important trend

Dec 1 Dec 2 Exec Wrt Bck

F1 F2 D1 D2 D3 Rn ROB Rd/Sch Disp Ex Ret1 Ret2

IP1 IP2 TC1 TC2 Dr Al Rn Q S1 S2 S3 Dp1 Dp2 R1 R2 Ex Flgs Br Dr WB

P5 5 stages

P6 12 stages

Willamette 20 stages

Pref


Source: "A Wire-Delay Scalable Microprocessor Architecture for High Performance Systems," S.W. Keckler, Doug Burger, C.R. Moore, R. Nagarajan, K. Sankaralingam, V. Agarwal, M.S. Hrishikesh, N. Ranganathan, and P. Shivakumar. International Solid-State Circuits Conference (ISSCC), pp. 1068-1069, February, 2003.


Intel Core-i7 (Nehalem Architecture)


Intel Core-i7-2 (Sandy-Bridge Architecture)


Intel Core-i7-4 (Haswell Architecture)


Intel Superscalar Comparison


ARM Cortex-M• Why it is important:


ARM Cortex-M7


AMD K10 (years 2007-2012)


By appaloosa - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=3535462

AMD 15h “Bulldozer” (years 2011-present)


By Shigeru23 - Made by uploader (ref:[1], [2], [3]), CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=17130257

AMD Ryzen (2017-present)


RISC-V (2011-present) – Open Source Chip


COTSON TUTORIAL

‘o e to Gio giO to e

Roberto Giorgi, University of Siena, of 861

COT“o Outli e

• COT“o Co epts• O e ie of featu es• Co pa iso ith othe e aluatio app oa hes• “i ulatio o figu atio• Vi tualize o e ie “i No• ‘u i g COT“o E a ples• Case stud : a al zi g CJPEG• Ad a ed featu es: defi i g a ‘egio Of I te est

2Roberto Giorgi, University of Siena, of 86

COTSON CONCEPTS


COT“o Wo ki g “pa es• COT“o eates o ki g spa es:

– HO“T “PACE– GUE“T “PACE

• The HO“T “PACE is he e all odelsa d the si ulatio - o t ol u– E.g., a a he odel

• The GUE“T “PACE is he e a Vi tualized Platfo u s– E.g., a s ste ith Li u /Wi do s/A d oid

• The t o spa es o u i ateth ough a ell defi ed i te fa e– The GUE“T a e i st u ted to se d fu tio al

i fo atio to the HO“T t a spa e tl to a GUE“T fu tio alit a d GUE“T soft a e

– The HO“T a o t ol the si ulatio , defi e e e ts a d thei ti i g, feed a k the guest ith the app op iate ti i g i fo as if it as i te a ti g ith a eal o po e t

GUE“T “PACE

HO“T “PACE


Fu tio al Di e ted “i ulatio

• Fu tio al Qua tu s o . s of “i se o ds “i ulated ti e• A out s of Host se o ds “i ulatio Ti e o Wall Clo k Ti e


Fu tio alMode GUEST

Ti i gMode HOST

F ee u i g fo sPassi g s of e e tsCal ulati g a tual ti i g e.g., s

Wall Clo k Ti e

“i ulated Ti ee.g. slo do

. . .

5

Ask fo s QUANTUM of e e ts CPI=

Classi al E aluatio Methodolog

• Pu poses of the o o , sha ed platfo– E aluate the esea h pe fo ed ea h pa t e– Tra sfer the respe ti e k o ledge to the other

part ers

COTSonEVALUATIONPLATFORM

APPS PERFORMANCE

METRICS

APPOUTPUT

fibx.c mmx.c

# cores

spee

dup

1 2 3 4 avg

mmulfib

APP

INPUT


OVERVIEW OF FEATURES


COT“o O e ie


COMPARISON WITH OTHER APPROACHES


Co pa iso a o g diffe e t app oa hes fo doi g esea h elated to ultiple odes e aluatio I fo atio e ised f o data of the ‘AMP p oje t

SMP Cluster FPGA E ulator Si ulator

S ala ilit1K ores

C A A A A

Cost 1K ores F € M C B € . - . M A+ € . M A+ € . M

Po er/Spa e K , ra ks D k , a ks

D k , a ks

A . k , . a ks A+ . k , . a ks A+ . k , . a ks

O ser a ilit D C A+ A+ A+

Reprodu i ilit B D A+ A+ A+

Re o figura ilit D C A+ A+ A+

Credi ilit A+ A+ B+/A- F/D C

De elop e t ti e B B C A+ A+

Perfor a e lo k A GHz A GHz C . GHz B ≈ . of o igi al C / to / “MP

6-6 ISA A+ A+ F A+ A+

Modifia le F F B A A

GPA D D B+/A- B A


Fu tio al-Di e ted app oa h• A ti i g si ulato a also use diffe e t app oa hes

depe di g o the elatio ship et ee the fu tio al odel f a d the ti i g odel t [Maue ]: i

fu tio al-first o tra e-dri e , the f is u fi st a d sepa atel a d the t is u late o i a o pletel de oupled fashio the all f is u efo e the t is u ; ii ti i g dire ted o e e utio d i e , the f a d t a e losel oupled o de oupli g ; iii ti i g-first , the t

d i es the f , oth a e o pletel de oupled, ut the fu tio has to e he ked late o a d e e tuall u do e; i fu tio al-dire ted , the f d i es the t , oth a e o pletel de oupled, the fu tio is al a s the ight o e ut e eed a ti i g feed a k f o t to o e t the

ti i g. [A gollo ].


FUNCTIONAL/TIMING SIMULATION

Functional Simulator

Timing Simulator

Functional-First (Trace-driven)+Fast- No timing feedback

+ Timing feedback- Tight Coupling- Very slow

Timing and FunctionalSimulator Integrated (SimOS)

- Complex, no reuse, very slow

Timing-Directed (Exec-driven)Functional Simulator

Timing Simulator

Complete TimingNo? Function

No TimingComplete Function

Timing-First (Multifacet)Functional Simulator

Timing Simulator

Complete TimingPartial Function

No TimingComplete Function

+ Timing feedback+ Using existing simulators+ Software development advantages- Slow

Sour e: Multifa et Proje t . s. is .edu/ ultifa et -[Mauer -sig etri s-Full_S ste Ti i g_First Si ulatio ]

speed

Complete Function Partial Timing

au

a


COTSon: FUNCTIONAL-DIRECTED

• A a ia t of fu tio al fi st• Adds ti i g feed a k at oa se g a ula it

s – s of i st u tio s

• Appli atio s see a app o i atio of ti e• Ma iss so e fi e-g ai ti i g i te a tio

• Co pati le ith fast a hi g e ulato s a d sa ple s


OTHE‘ ‘ECENT X “IMULATO‘“

Multi- ode

Adapted from:Heirman120401-ISPASS TutorialThe SNIPER multi-core simulator

Ti i g-dire ted/i tegrated


SIMULATION CONFIGURATION


COTSo si ulatio o figuratioThe COT“o si ulatio i f ast u tu e is o t olled setti g all the ele a t i fo atio a out si ulatio a d ta get s ste o figu atio i a i put o figu atio file.

COT“o uses lua s ripti g la guage to a age this o figu atio file:

Fi st se tio of the file des i es glo al optio s;

“e o d se tio of the file des i es the “i No o figu atio ;

Thi d se tio of the file des i es the ta get s ste o figu atio ;


COTSo si ulatio o figuratioA si ple e a ple of a COT“o o figu atio file o i g ith the si ulato i stallatio :

. Let s assu e the COT“o i stallatio di e to as:

./ otso

. Let s o e to the di e to :

otso @ otso 1$ d ./ otso /tru k/sr /e a ples

. Ope the si ple CPU si ulatio e a ple:

otso @ otso 1$ gedit o e_ pu_si ple.i


COTSo si ulatio o figuratio

optio s = {

a _ a os = " M", sa ple = {

t pe = "si ple", ua tu = " k

}, hea t eat = {

t pe = "file_last", logfile = "o e_ pu_si ple.log"

}, }

Glo al optio s se tio



optio s = {

a _ a os = " M", sa ple = {


}, hea t eat = {


}, }

“et the du atio of the e ti e si ulatio to illio s of

a ose o ds




optio s = {

a _ a os = " M", sa ple = {


}, hea t eat = {


}, }

Cal ulate ti i g fo all e e ts;HO“T/GUE“T s h o izatioe e k s . us , i.e. ua tu




optio s = {

a _ a os = " M", sa ple = {


}, hea t eat = {


}, }

“et the si ulato fo pe iodi all sto i g si ulatio statisti s. The pe iod et ee t o sto es is alled heart eat.




optio s = {

a _ a os = " M", sa ple = {


}, hea t eat = {


}, }

“i ulatio statisti s a e e o ded i a log file alled o e_ pu_si ple.log




si o . o a ds = fu tio

use_ sd ./ otso /t u k/data/ p. sd'

use_hdd ./ otso /t u k/data/ka i .i g'

set_jou al

se d_ke oa d 'g -O - - /ho e/use /test.i'

e d

Si No o figuratio se tio






set_jou al


e d


The set of “i No o a ds a e g ouped i a fu tio






set_jou al


e d


“pe if the positio i the host o pute , he e the “i No o figu atio B“D esides.

The p. sd set a si gle CPU a hi e.






set_jou al


e d


“pe if the positio i the host o pute , he e the ha d-disk i age esides.

The kar i 64.i g set s all ha d-disk ith the U u tu Ka i - Li u i age i stalled.






set_jou al


e d


E a le the jou ali g of the file s ste see the p e ious a tio use_hdd






set_jou al


e d


Allo the use to u a o a d i side the O“ of the si ulated a hi e i.e., U u tu Ka i -

Li u . I this ase the ha d-disk i age o tai s a test.i file that is o piled ith g

i side the si ulated s ste .



fu tio uildi=

hile i < disks dodisk = get_disk idisk:ti e { a e = 'disk'..i, t pe = "si ple_disk" }i = i+

e di=

hile i < i s doi = get_ i ii :ti e { a e = ' i '..i, t pe = "si ple_ i " }

i = i+e d…

Target s ste o figuratio se tio



fu tio uildi=


e di=


i = i+e d…


The o figu atio of the si ulated a hi e is des i ed ithi a fu tio alled uild .



fu tio uildi=


e di=


i = i+e d…


Fi st, the fu tio he ks the u e of atta hed disks to the s ste . Fo ea h dis o e ed disk, the fu tio set the p ope ti e .



fu tio uildi=


e di=


i = i+e d…


“i ila l to the ase of the disks, the fu tio he ks the u e of atta hed et o k

i te fa es NICs to the s ste . Fo ea h dis o e ed NIC, the fu tio sets the p ope ti e .


COTSo si ulatio o figuratio= pus

if ~= thee o "This e pe i e t o l a ts to ha dle pu"

e dpu = get_ pupu:ti e { a e = ' pu ', t pe = "ti e " }e = Me o { a e = " ai ", late = }

l = Ca he{ a e = "l a he", size = " kB",

li e_size = , late = , u _sets = , e t = e , ite_poli = "WB", ite_allo ate = "t ue"

}…pu:i st u tio _ a he ipu:data_ a he dpu:i st u tio _tl itpu:data_tl dt

e d









e d

Get the u e of CPU i stalled i the ta get s ste . If the u e is diffe e t f o , the si ulatio is stopped.

CPU u e i g sta ts f o .









e d

“et a e o de i e ep ese ti g the e te al D‘AM of the ta get s ste . Fo the e o is possi le to spe if the ai featu es, su h as the late .

“ele t the CPU ith ID , a d the set the ti e fo the CPU.









e d

I a hie a hi al fashio , all the o po e t of the ta get s ste a e spe ified. I this ase, the se o d le el a he is i sta tiated. Fo the o po e t the ai featu es a e o figu ed e.g., late , size, u e of sets .









e d

O e all the o po e ts a e i sta tiated, the a e o e ted to the CPU, alli g a set of

fu tio s.



VIRTUALI)ER OVERVIEW SIMNOW


Si No o er ieSi No is a i tualize that is used i COT“o as the fu tio al si ulato .It has ee de eloped AMD. It allo s the use to o figu e a full-s ste ta get a hite tu e ha gi g the a ious o po e ts e.g., CPU t pe, u e of CPUs, ai e o size a d o ga izatio , et . .

The ai featu es of “i No a e:

“e e al CPU odels a e a aila le: Tu io , Athlo , Opte o , …; D a i -t a slatio of i st u tio s: the i st u tio i put st ea ta get I“A

is t a slated i to a C-like e ui ale t ode that is the o piled fo the ati e a hi e;

Dete i isti e e utio .


Si No o er ieThe o figu atio of the fu tio all si ulated a hi e is sto ed i a o figu atio file alled B“D B oadS o d Do u e t . The file o tai s all the

de i es, the de i e att i utes, the o e tio poi ts AND a s apshot of the u e t poi t of si ulatio . The “i No dist i utio o es ith se e al B“D o figu atio s.

“i No is o t olled a shell hi h o t ols the si ulatio , dis o e s de i es, i statiates de i es, i te fa es ith de i es.

He e a de i e is a sha ed li a that i te fa es ith the shell e i o e t.

F o a use ie poi t, the si ulatio e i o e t is o posed of t o ai i do s oth a e hidde du i g a COT“o si ulatio :

“i No Co a d Wi do fo sake of si pli it e a ig o e this ; “i No Use I te fa e Wi do .


Si No User I terfa e Wi do

• The Video Output A ea: displa s the guest s ee as i the displa of a eal s ste Li u , Wi do s, A d oid


Si No Para etersWhi h de i es a e o figured?

Disk i ages: ep ese t ph si al d i e s su h as ha d-disks .hdd i age o d-d i e s .iso i age . Fo ha d-disks, DiskTool a e used to eate a e pt de i e o hi h i stalli g a O“;

BIO“: is t eated as a e o de i e, a d a e loaded ith a BIO“ i itializatio ‘OM;

D‘AM: the de i e ep ese ts a DIMM odule that a e o figu ed i te s of e o t pe, de sities, a k, et . th ough the “PD i te fa e;

CPU: a e o figu ed i te s of CPU fa il , steppi g, et .


“i No pa a ete s a e isualized i a g aphi al a – easie to a al ze a d/o ha ge o figu atio of the de i es ie → sho de i es .

AMD “IM o Vi tualize “etup


Si No XtoolsThe Guest O“ i the “i No a o u i ate ith the e te al o ld th ough Xtools. Xtools a e u i side the si ulated a hi e:

get <sou e_path> <dest_path> to t a sfe a file f o the host to the guest; put <sou e_path> <dest_path> to t a sfe a file f o the guest to the host;

A d also less used : si o t ol to o t ol si ulatio ; ti e; e ho;


Si No additio al i stru e tsThe “i No allo s the use to de ug the ta get s ste e ploiti g a set of possi le diffe e t i st u e ts:

“i No i te al De ugge ; “i No de uggi g ia “e ial Po t; A al ze I te fa e to uild all a k fu tio s i the host COT“o

A Aete o -- the ai COT“o si ulatio -loop -- is o e of the Mo ito I te fa e to uild all a k fu tio s i the host ith e e o e

guest state i fo atio o e a u ate ut slo e


Si No i ter al De uggerThe COT“o si ulato i f ast u tu e allo s the use to de ug the ta get s ste e ploiti g a set of possi le diffe e t i st u e ts:

“i No i te al De ugge ; “i No de uggi g ia “e ial Po t; A al ze I te fa e; Mo ito I te fa e;

A de i e that o e ts to a CPU odel ith the follo i g apa ilities:

“et eakpoi ts; Vie /Alte GP egiste s a d M“‘s; Vie /Alte li ea o ph si al e o ; Pe fo I/O o PCI o figu atio les;


Si No de ugger


Si No de uggi g ia serial portsThe COT“o si ulato i f ast u tu e allo s the use to de ug the ta get s ste e ploiti g a set of possi le diffe e t i st u e ts:

“i No De ugge ; “i No de uggi g ia “e ial Po t ; A al ze I te fa e; Mo ito I te fa e;

“e ial I/O po t a e i te fa ed usi g:

Na ed pipe:- ~/.si o / o /si o _i fo iti g- ~/.si o / o /si o _out fo eadi g

Th ough a se ial host po t e ui es supe iso p i ileges


Si No de uggi g ia serial portsThe COT“o si ulato i f ast u tu e allo s the use to de ug the ta get s ste e ploiti g a set of possi le diffe e t i st u e ts:

“i No De ugge ; “i No de uggi g ia “e ial Po t; A al ze I te fa e; Mo ito I te fa e;

“e ial I/O po t a e i te fa ed usi g:

Na ed pipe:- ~/.si o / o /si o _i fo iti g- ~/.si o / o /si o _out fo eadi g

Th ough a se ial host po t e ui es supe iso p i iledges

The pipe is ope ed lau hi g the follo i g o a d i “i No Co a d Wi do :

“e ial: .“etCo Po t pipe


Si No A al zer i terfa eThe COT“o si ulato i f ast u tu e allo s the use to de ug the ta get s ste e ploiti g a set of possi le diffe e t i st u e ts:

“i No De ugge ; “e ial Po t I te fa i g; A al ze I te fa e; Mo ito I te fa e;

A al ze s a e all a k fu tio s itte i C ode. The a al ze s a e d a i all li ked to the si ulato i o de to gathe statisti s.

“ee the a al ze “DK do u e tatio fo o e detail.


RUNNING COTSON EXAMPLES


E peri e tal sessioLet s sta t ith a full fu tio al si ulatio :

No ti i g odels a e used; O l the fu tio al eha io is ep odu ed; A si ple a hi e o posed of CPU, GB of ai e o a d the Ka i -

Li u dist i utio is u .

o e_ ode_s ipt='fu tio al'displa =os.gete "DI“PLAY"

si o . o a ds=fu tiouse_ sd ' p. sd'use_hdd 'ka i .i g'set_jou al

e d


E peri e tal sessioLet s sta t ith a full fu tio al si ulatio – o t d:

Assu e to e u de the otso i stallatio di e to : ./ otso /tru k/

Mo e to the e a ple di e to :

otso @ otso 1$ d ./sr /e a ples

‘u the si ulatio :

otso @ otso 1$ ake ru _fu tio al


E peri e tal sessio

The o a d fo u i g a e a ple is ake ru _<COTSo o figuratio >

Follo ed the a e of the Lua s ipt that o tai s the COT“o o figu atio .

Let s sta t ith a full fu tio al si ulatio – o t d:

Assu e to e u de the otso i stallatio di e to : ./ otso /tru k/

Mo e to the e a ple di e to :

otso @ otso 1 $ d ./sr /e a ples

‘u the si ulatio :

otso @ otso 1 $ ake ru _fu tio al


E peri e tal sessioE er ise: Tr to ru the o e_ pu_si ple e a ple.

What is the diffe e e et ee the t o o figu atio s?

Is the output diffe e t f o p e ious e a ple?





Fu tio al si ulatio :The e is a ope ed a ti e “i No i do he e ead the MIP“ fo the si ulated a hi e. No additio al i fo atio is p o ided.

Ti i g si ulatio : No a ti e “i No i do s appea ed i the output. The output of the si ulatio p o ides the i fo atio a out the pe fo a e easu e of the si ulated a hite tu e i te s of I stru tio s Per C le IPC .





Fu tio al si ulatio :The e is a ope ed a ti e “i No i do he e ead the MIP“ fo the si ulated a hi e. No additio al i fo atio is p o ided.

Ti i g si ulatio : No a ti e “i No i do s appea ed i the output. The output of the si ulatio p o ides the i fo atio a out the pe fo a e easu e of the si ulated a hite tu e i te s of I stru tio s Per C le IPC . This i fo atio is a aila le o l he the ti i g si ulatio is e a led, si e the o putatio of the IPC e ui es a detailed odeli g of the ta get s ste .


E peri e tal sessioA e e utio tra e is the esult of e o di g all the e e uted i st u tio s alo g

ith the e o a esses pe fo ed the ta get s ste .

The t a e is ge e all sto ed i a file, fo su essi e a al sis. The e e utio t a e is a i po ta t i st u e t to a al ze the eha io of the ta get s ste a d to a al ze the eha io of the e e uted appli atio .

The COT“o si ulatio i f ast u tu e allo s the use to e t a t a e e utio t a e, alo g ith the ai pe fo a e pa a ete s e.g., a he isses, u e of load/sto e i st u tio s, et . .


E peri e tal sessioCo side i g the follo i g si ple hile-loop. appli atio :

#i lude <stdio.h>

i t ai oid{

i t e [ ] = { , , , , , , , , , };i t su = ;i t i = ;

hile i < {

su = su + e [i];i++;

}etu ;

}


E peri e tal sessioLet s sta t odif i g the o e_ pu_si ple e a ple i o de to e t a t a e e utio t a e:

. Ope the o e_ pu_si ple.i lua file:otso @ otso 1 $ d COTSON/ otso -6 /tru k/sr /e a plesotso @ otso 1 $ gedit o e_ pu_si ple.i

. Modif i g the “i No o a d se tio i o de to load a d e e ute the si ple o piled hile-loop test p og a :

se d_ke oa d get /ho elo al/ otso /COT“ON/ hile-loop /ho e/use / hile-loop'se d_ke oa d h od + hile-loop; ./ hile-loop'

. Add the t a e logge to the CPU o figu atio :pu=get_ pupu:ti e { a e=' pu ',

t pe="t a e_stats", t a e_file="/ho elo al/ otso /COT“ON/o e_ pu_t a e.log" }


E peri e tal sessioE er ise: A al ze the e e utio tra e

Ho a a hes a e i the t a e?

Ho a a he isses ha e ee happe ed i the e e utio ?



Ho a a hes a d ju ps a e the e i the t a e?

Ho a i st u tio s ha e ee e e uted?

The a h a d ju p i st u tio s a e ep ese ted i the t a e i st u tio s ith Op-Code ith the fo J .



Ho a a hes a d ju ps a e the e i the t a e?

Ho a i st u tio s ha e ee e e uted?

The pe fo a e log fo the ta get s ste is e o ded i the file:ode.1.o e_ pu_si ple.log

ti er. les gi es ou the total les of the si ulatio ti er.i stru tio s gi es ou the total u e of e e uted i st u tio s

The e e utio t a e is e o ded i the file see ou ho e di e to :o e_ pu_tra e.log


E peri e tal sessioE er ise: Co ple appli atio a al sis.

T to a al ze the eha io of the ta get s ste a hite tu e CPU he a o ple appli atio is e e uted.

As o ple efe e e appli atio e a use the jpeg i age o p esso appli atio , alo g ith the i put i age .pp .

Modif i g the “i No o a d se tio i o de to load a d u the jpegappli atio ith the i put i age.


PRACTICAL TEST


COT“o i stallatio

• Follo the COT“ON U“E‘ GUIDE fo the ge e al i stallatio p o edu ehttp://sou efo ge. et/p/ otso / ode/HEAD/t ee/t u k/do /COT“ON_U“E‘_GUIDE- .pdfhe e e assu e that ou ha e i stalled COT“o i the otso di e to

• Do load: http:// .dii.u isi.it/~gio gi/tea hi g/hp a / etatools/ e h a k_ jpeg.ta .gz

• E a ples folde :~/ otso /s /e a ples/

• U o p ess CJPEG_e e ise i e a ples folde :$ ta zf e h a k_ jpeg.ta .gz -C ~/ otso /s /e a ples/


CJPEG p og a : e h a k• CJPEG p og a elo gs to li jpeg-tu o-utils: a utilities

fo a ipulati g JPEG i ages

• This e h a k o p esses the a ed i age file, o the sta da d i put if o file is a ed, a d p odu e JPEG file o the sta da d output. The u e tl suppo ted i put file fo ats a e: PPM, PGM, a d so o

• This e h a k eeds a INPUT jpeg i age a d p odu es a OUTPUT pp i age i ou ase

• The di e to jpeg_ e h a k o tai s i put a d e pe ted output files


CJPEG

PPM (192KB) JPEG(9.6KB)


CJPEG o pile a d e e utio

• Lau h the o plete e h a k ith:ake

- Co pa e files p odu ed ith those i the e pe ted_output di e to


Ho to lau h CJPEG a uall

• If ou a t lau h jpeg a uall , ou a use:$ ./ jpeg < i put-la ge.pp > output_la ge.jpeg

– A e the esults diffe e t?– What it ea s?


Ca he Co figu atio• E a ples a he o figu atio a e:

Linux commands:• vi <file name>: open the editor• i: insert mode• esc: esc for exiting insert mode• :wq write file and quit

L1 dcache Memory latency

Size Line size Num sets

A) 1KB 16 1 24

B) 32KB 16 1 100

• You a odif a he o figu atio i side lua e a ples file


“et a he pa a ete s• E a ple of a he o figu atio fo e o A:Mai e o- e =Me o { a e=" ai ", late = }

L a he:- l =Ca he{ a e="l a he", size=" kB",

li e_size= , late = , u _sets= , e t= e , ite_poli ="WB", ite_allo ate="t ue" }

L i st u tio a he:- i =Ca he{ a e="i a he", size=" kB", li e_size= ,

late = , u _sets= , e t=l , ite_poli ="WT", ite_allo ate="false" }

L data a he:- d =Ca he{ a e="d a he", size=" kB", li e_size= ,

late = , u _sets= , e t=l , ite_poli ="WT", ite_allo ate="false" }


What happe s?• T to lau h jpeg ith a he o figu atio A

a d a he o figu atio B o la ge i put

$ ake u _ jpeg_ e h a k_la ge_ e o A$ ake u _ jpeg_ e h a k_la ge_ e o B

– What happe s to iss ate?– What happe s to IPC? – Wh ?


Me or A Me or BI put i put_la ge.pp I put_la ge.ppL d a he size kB kBCPU lesL d a he ite_ issCPU i st u tio s

“i ulatioL ead_ iss_ ate . .L ite_ iss_ ate . .I st u tio Pe C le . .

Ca he “tatisti s


S all I put Large I putL d a he size kB kBMai Me o A essL eadL d a he eadL d a he ite

“i ulatioI st u tio pe C le . .L ite iss ate . .L ead iss ate . .

Ca he “tatisti s


Ca he “tatisti sAs e a see, ith a s all a he e oo figu atio e eed o e pu les a d e

ha e o e ead iss. Miss ate of L a he isette i la ge a he the i s alle as e e pe t.

I stead, he e use a la ge i put i age, eha e o e ai e o a ess a d o e

“i ulatio u e . I fa t la ge e o i put e ui es la ge CPU usage a d high u e of CPU

ope atio s.


DEFINING A REGION OF INTEREST


‘egio Of I te est ‘OI

• The e a e se e al ethods to defi e a ‘OI i COT“o , depe di g o the a u a that e

eed. The ‘OI sele tio is also alled otsot a e i COT“o te i olog :– E it.t igge otso t a e– E te al otso t a e– I te al otso t a e– A u ate otso t a e


optio s = {e it_t igge ="te i ate", …..}

si o . o a ds=fu tio…se d_ke oa d get jpeg.sh jpeg.sh;

h od + jpeg.sh ; ./ jpeg.sh s all A ; put jpeg.sh;

tou h ter i ate; put ter i atee d

e it.t igge t a e


The si ulatio stops o e the te i ate file is itte f o guest to host. This is a fi stapp o i atio of the ‘OI i hi h e a egle t a fe s. “i ple to appl .

otso _tra er is part of the guest tools, prei stalled i the BSD. " " sta d for a i ter al ode that is reser ed to a ti ate a

ethod to s it h et ee fu tio al a d ti i g ode sele ti e sa pler

$ otso _t a e ## s it h to ti i g sta t ‘OI

$ ./ jpeg < i put_s all.pp > output-s all-A.jpeg

$ otso _t a e ## a k to fu tio al e d ‘OI

E te al otso t a e


This ethod tighte the ‘OI the e is a e o of so e us that a e egle ted fo lo g p og a s

I te al otso t a e

• I t ai {i t a, , ;…..COT“ON_INTE‘NAL , , ; // sta t ‘OI <detailed egio of i te est>COT“ON_INTE‘NAL , , ; // e d ‘OI ……

}


This ethod tighte the ‘OI the e is a s all e o of a fe i st u tio s that a e egle ted fo la ge egio s of i te est

A u ate otso t a e

• I t ai {i t a, , ;…..AT‘ACE‘_“TA‘T ; // sta t ‘OI fo egio <detailed egio of i te est>AT‘ACE‘_“TOP ; // e d ‘OI fo egio ……

}


This ethod is o e a u ate si e i te fe es less ith the ode good to test s alle a ples ith highe a u a

Tha ks !


COTSo si ulatio i frastru tureA o e detailed pi tu e of the si ulated s ste a d the elated set of COT“o ele e ts:


COTSo : Ti er ele e tsTi er ele e ts a e esse tiall odels fo oth CPUs a d othe s ste de i es:

A ept i st u tio s, p o ess the a d update the et i s;

All ti e s sha e the e o hie a h ;

Plugga le a hite tu e;

“e e al odels a e eated fo : CPUs; P ofili g; T a e ge e atio ; “i Poi t o ie ted a al sis;

Cu e t a aila le ti e t pes: Tra eStats – si ple li ear odel Ti er - si ple li ea odel ith a he hie a h ; Ti er - i -o de pipeli e odel ith a he hie a h ; Ba d idth – o l li ited the e o a d idth; PTLSi – out-of-o de pipeli e odel ith a he hie a h ;


COTSo : Sa pler ele e tsSa pler ele e ts a e i ha ge of de idi g he to all a ti i g odel a d fo ho u h ti e u the ti i g si ulatio :

Plugga le a hite tu e;

Ma possi le i ple e tatio s;

De ide he to ha ge the si ulatio state:

Fu tio al FN ; War i g WM ; Ti i g si ulatio TS ;

si ulatio ti e

FN WM TS

‘u the fast fu tio al si ulatio of the ta get

Put data i stateful st u tu es su h as a hes

“i ulate the ta get a ou ti g ti i g i fo



Software Methods for ILP


Some Compiler Technologies• SOFTWARE BRANCH PREDICTION (in the compiler)

- Methods illustrated in the branch prediction lesson

• STATIC SCHEDULING + LOOP UNROLLING- Goal: improve the performance of pipeline and multiple-issue processors- These are the fundamental technique for static-issue processors

- i.e., for those processors which statically issue a group of instructionsthat are “packed” in a single long instructions

Very Long Instruction Word (VLIW) processers- Such techniques often improve the performance also in case of dynamic

issue (i.e., superscalar) processors-The impact of branches and dependences is reduced

• SOFTWARE PIPELINING


Static Scheduling – the objective• Find unrelated instructions to be interleaved among dependent instructions in order to hide latencies/stalls• In order to avoid stalls, given two dependent instructions (the

producer instruction P and the consumer instruction C) that need a latency L cycles to be processed in the pipeline in strict sequence

• the distance between P and C - in terms of clock cycles – could be filled up by other instructions (not dependent by P or causing dependences to C) with a total latency of at least L cycles


CP P to C Latencyt (cycles)

dependency

L=3

I1 I2 I3

Hypothesis for static reordering• The latency between the producer instruction P and the consumer instruction C varies a lot with the type of Functional Unit (FU) involved and by the specific dependency

• The compiler can (statically) reorder instructions if• There is enough ILP• The FU latencies are known to the compiler

- Such hypothesis creates a LINK between the software and microarchitecture that “weakens" the separation layer created by the Instruction Set !


Static Scheduling: pipeline hypothesis• 5-stage standard pipeline• Pipelined Functional Units

• (or replicated as many times as the operation latency)goal: to launch on operation of the given type per cycle

• No structural hazards

Producing Instr. Consumer Instr. CNSL

FP ALU op Another FP ALU op 3FP ALU op Store double 2 Load double FP ALU op / INT ALU op 1Branch --- 1INT ALU op Branch 1Load double Store double 0INT ALU op INT ALU op 0

CNSL = No-StallLatency Cycles


Examplefor (k = 1000; k > 0; --k) x[k] = a + x[k];

• “Parallel Loop”: iterations are independent• Assembly translation:

Loop: L.D F0, 0(R1) ;F0=array elem.ADD.D F4,F0,F2 ;add scalar in F2S.D F4, 0(R1) ;store resultSUBI R1,R1,#8 ;decrement pointerBNEZ R1, Loop ;branch

R1 is initially 8000


Version-0: NON-SCHEDULED loop

Loop: L.D F0, 0(R1) 1Stall 2ADD.D F4,F0,F2 3Stall 4Stall 5S.D F4, 0(R1) 6SUBI R1,R1,#8 7Stall 8BNEZ R1, Loop 9Stall 10

10 cycles per iteration (5 stall sycles) Rewrite the codeto reduce the stalls

Cissue(clock cycles)




Version-1: Scheduled loop

Loop: L.D F0, 0(R1) 1SUBI R1,R1,#8 2ADD.D F4,F0,F2 3Stall 4BNEZ R1, Loop 5 ;delayed branchS.D 8(R1),F4 6 ;altered effective address

6 cycles per iteration (1 stall cycle); observation: only 3 instructions really

operate on the vector x[.](L.D, ADD.D, S.D)

It’s possible to unroll the loop 4 times, to expose more ILP

for the static scheduling

i) Anticipate the SUBI, but need to change the offset of the store 0(.) 8(.)

ii) Move S.D after BNEZ by exploiting the branch delay-slot

Cissue(clock cycles)




Version-2: Unrolling 4 times the LoopL.D F0,0(R1)ADD.D F4,F0,F2S.D 0(R1),F4 ; drop SUBI&BNEZL.D F0,-8(R1)ADD.D F4,F0,F2S.D -8(R1),F4 ; drop SUBI&BNEZL.D F0,-16(R1)ADD.D F4,F0,F2S.D -16(R1),F4 ; drop SUBI&BNEZL.D F0,-24(R1)ADD.D F4,F0,F2S.D -24(R1),F4SUBI R1,R1,#32 ; alter to 4*8BNEZ R1,Loop

This loop requires 28 cycles (14 of which are stall cycles) per iteration:- each L.D has 1 cycle stall- ADD.D each has 2- the SUBI has one- the BNEZ it has 1I also have the 14 cycles of instruction-issue or 28/4 = 7 cycles to process each element of the array (slower the (scheduled) version-1) Rewrite the loop to reduce stalls

1 stall cycle2 stall cycles

Hypothesis: the total number of iterations is multiple of 4. OK in our case (1000 iterations)


Removing the “Name Dependencies”L.D F0,0(R1)ADD.D F4,F0,F2S.D 0(R1),F4 ; drop SUBI&BNEZL.D F0,-8(R1)ADD.D F4,F0,F2S.D -8(R1),F4 ; drop SUBI&BNEZL.D F0,-16(R1)ADD.D F4,F0,F2S.D -16(R1),F4 ; drop SUBI&BNEZL.D F0,-24(R1)ADD.D F4,F0,F2S.D -24(R1),F4SUBI R1,R1,#32 ; alter to 4*8BNEZ R1,Loop

How to eliminate them?


Using STATIC register renamingL.D F0,0(R1)ADD.D F4,F0,F2S.D 0(R1),F4 ; drop SUBI&BNEZL.D F6,-8(R1)ADD.D F8,F6,F2S.D -8(R1),F8 ; drop SUBI&BNEZL.D F10,-16(R1)ADD.D F12,F10,F2S.D -16(R1),F12 ; drop SUBI&BNEZL.D F14,-24(R1)ADD.D F16,F14,F2S.D -24(R1),F16SUBI R1,R1,#32 ; alter to 4*8BNEZ R1,Loop

Register Renaming


Version-3: Unrolled Loop with less stalls

Loop: L.D F0,0(R1)L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)ADD.D F4,F0,F2ADD.D F8,F6,F2ADD.D F12,F10,F2ADD.D F16,F14,F2S.D 0(R1),F4S.D -8(R1),F8SUBI R1,R1,#32 S.D 16(R1),F12BNEZ R1,LoopS.D 8(R1),F16 ;

This loop will run 14 cycles (no stalls) per iteration;

or 14/4=3.5 for each element!


Assumptions that make this possible:- move L.Ds before SDs - move S.D after SUBI and BNEZ- use different registers

When is it safe for compiler to do such changes?

Steps Compiler Performed to Unroll• Determine that is OK to move the S.D after SUBI and

BNEZ, and find amount to adjust S.D offset• Determine that unrolling the loop would be useful

by finding that the loop iterations were independent• Rename registers to avoid name dependencies• Eliminate extra test and branch instructions and adjust

the loop termination and iteration code• Determine loads and stores in unrolled loop can be

interchanged by observing that the loads and stores from different iterations are independent• requires analyzing memory addresses and finding that they

do not refer to the same address• Schedule the code, preserving any dependences needed

to yield same result as the original code


VLIW PROCESSORS


“Multiple Issue” PROCESSORS• Motivation: overcome the limitation of standard pipelines whereCPI REAL Pipeline = CPIIDEAL Pipeline +

CStructural stalls + CRAW stalls + CWAR stalls + CWAW stalls + CControl stalls

• Action point: (assuming the Cstalls are 0) go below the CPIIDEAL Pipeline(=1) by issuing more instructions per cycle

• Multiple Issue processor implementations:• Superscalar

- Exploits static scheduling (using the examined compiler technologies)- Exploits dynamic scheduling (using, e.g., Tomasulo’s algorithm)

• VLIW (Very Long Instruction Word)- Use only static scheduling ! (save hardware and power)- EVERY INSTRUCTIONS GROUPS OPERATIONS THAT GO IN PARALLEL

(EXPLICIT PARALLELSIM or EPIC Explicit Parallel Instruction Computer)Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ08-SL di 2915

Static Scheduled Superscalar MIPS• Superscalar MIPS: issue 2 instruction per cycle

• 1 instruction goes in the Floating Point pipeline1 instruction goes in the Integer pipeline (or Load/Store and Branch)

• At every clock cycle, 2 instructions are fetched• The second instruction can be issued only if it goes in a different pipe

- If an instruction stalls than I should block the fetch

Time [clocks]MF D X W

MF D X W

MF D X W

Instr.

I

5 10

MF D X WFP

MF D X W

MF D X W

I

I

FP

FP

Note: the FP operations may have a longer X stage !


Version-4: Superscalar Loop Unrolling

Integer Instr. FP Instr.1 Loop: L.D F0,0(R1)

2 L.D F6,-8(R1)

3 L.D F10,-16(R1) ADD.D F4,F0,F2

4 L.D F14,-24(R1) ADD.D F8,F6,F2

5 L.D F18,-32(R1) ADD.D F12,F10,F2

6 S.D 0(R1),F4 ADD.D F16,F14,F2

7 S.D -8(R1),F8 ADD.D F20,F18,F2

8 S.D -16(R1),F12

9 SUBI R1,R1,#40

10 S.D 16(R1),F16

11 BNEZ R1,Loop

12 S.D 8(R1),F20


Unrolled 5 times to avoid delays

This loop will run 12 cycles (no stalls)

per iteration - or 12/5=2.4 for each

element of the array

Initial Multiple Issue Processors• Superscalar

• IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000- The number of scheduled instruction per cycle (also called ‘ways’) ranges

between 1 and 8- Static scheduling (compiler) or dynamic (e.g., Tomasulo)

• (Very) Long Instruction Words (V)LIW- Crusoe VLIW processor [www.transmeta.com]- Intel Architecture-64 (IA-64) 64-bit address- Majority of DSPs (Digital Signal Processor):

Texas Instrument C6000, PowerPC ... , ST Microelectronics ...- The number of operation in a SINGLE instruction (also called “bundle”)

varies from 4 to 16

Note: since we want CPI > 1 it’s more handy to use the IPC metric (Instructions Per Cycle)


VLIW Implementation

FunctionalUnit

FunctionalUnit

Memory

Memory

Port

Port

to Memory

to Memory

from Memory

from Memory

I-fetch

& issue

Multi-Ported

Register

File

FunctionalUnit


VLIW principles

•VLIWs directly use multiple independent functional units•VLIWs package the multiple operations into one very long instruction

•Compiler is responsible to choose instructions to be issued simultaneously

Time [clocks]

Instr.

Ii

Ii+1

IF

IF

ID

ID

EEE

EEE

W

W


Version-5: VLIW Loop Unrolling

Mem. Ref1 Mem Ref. 2 FP1 FP2 Int/Branch

1 L.D F2,0(R1) L.D F6,-8(R1)

2 L.D F10,-16(R1) L.D F14,-24(R1)

3 L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4,F0,F2 ADD.D F8,F0,F6

4 L.D F26,-48(R1) ADD.D F12,F0,F10 ADD.D F16,F0,F14

5 ADD.D F20,F0,F18 ADD.D F24,F0,F22

6 S.D 0(R1),F4 S.D -8(R1),F8 ADD.D F28,F0,F26 SUBI R1,R1,#56

7 S.D -16(R1),F12 S.D -24(R1),F16

8 S.D 24(R1),F20 S.D 16(R1),F24 BNEZ R1,Loop

9 S.D 8(R1),F28


Unrolled 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per each element (1.8X)

Average: 2.5 ops per clock, 50% efficiency

Note: Need more registers in VLIW (15 vs. 11 in SS)

“Multiple Issue” processor challanges• While Integer/FP split is simple for the HW,

get CPI of 0.5 only for programs with:• Exactly 50% FP operations• No hazards

• If more instructions issue at same time, greater difficulty of decode and issue• Even 2-scalar (2 op. per cycle) => examine 2 opcodes, 6 registers,

and decide if 1 or 2 instructions can issue• VLIW: trade off instruction space for simple decoding

• The long instruction word has room for many operations• By definition, all the operations the compiler puts in

the long instruction word are independent => execute in parallel• E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch

- 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide• Need compiling technique that schedules across several branches


Loop-Carried Dependence

• Let’s consider:for (i=0; i<8; i=i+1) {

A = A + C[i]; /* S1 */}

• We can exploit the “natural” parallelism that stems from the associative operator:”Cycle 1”: temp0 = C[0] + C[1];

temp1 = C[2] + C[3];temp2 = C[4] + C[5];temp3 = C[6] + C[7];

”Cycle 2”: temp4 = temp0 + temp1;temp5 = temp2 + temp3;

”Cycle 3”: A = temp4 + temp5;


The iterations may NOT be easily unrollable !• Let’s consider the following code(A, B, C are distinct and non-overlapped in memory):

for (i=0; i<100; i=i+1) {A[i+1] = A[i] + C[i]; /* S1 */B[i+1] = B[i] + A[i+1]; /* S2 */

}

• We have the following dependencies:1) S2 uses the value, A[i+1], computed by S1 in the same iteration 2) S1 uses the value, A[i], computed by S1 itself in an earlier

iteration (when it was named A[i+i])3) Same dependence in S2 for B[i] computed by the previous

iteration by S2 itself when it was B[i+1]• Even if we unroll, the dependencies create stalls


Method to reduce the loop-carried dependencies

• Let’s consider this other code:

• Let’s transform it by overlapping part of the iterations:

• Now it’s possible to unroll without loop-carried dependencies

for (i=1; i<=100; i=i+1) {A[i] = A[i] + B[i]; /* S1 */B[i+1] = C[i] + D[i]; /* S2 */

}

A[1] = A[1] + B[1];for (i=1; i<100; i=i+1) {

B[i+1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1];

}B[101] = C[100] + D[100];


Another possibility: Software Pipelining

•Observation: if iterations from loops are independent, then can get more ILP by taking instructions from different iterations

•Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop• Someone call this to a sort of Software-Tomasulo

Iteration 0 Iteration

1Iteration

2 Iteration 3

Software-pipelined iteration


Software Pipelining benefits• Increases the distance between production and consumption of result

• Use code more compact than unrolling• Only one "transitory" filling and draining of the pipeline (compared to unrolling where it happens at each iteration)

SW Pipeline

Loop Unrolled

over

lapp

ed o

ps

Time

Time


Version-6: Software Pipelining

3 S.D 16(R1),F4 ; Store X[i]13 SUBI R1,R1,#8

5 ADD.D F4,F0,F2 ; Add X[i-1]14 BNEZ R1,LOOP

7 L.D F0,8(R1); Load X[i-2]

5 cycle per iterationand per element

V-2: 4-times nrolling1 L.D F0,0(R1)2 ADD.D F4,F0,F23 S.D 0(R1),F4 4 L.D F0,-8(R1)5 ADD.D F4,F0,F26 S.D -8(R1),F4 7 L.D F0,-16(R1)8 ADD.D F4,F0,F29 S.D -16(R1),F410 L.D F0,-24(R1)11 ADD.D F4,F0,F212 S.D -24(R1),F413 SUBI R1,R1,#3214 BNEZ R1,LOOP

L.D F0,0(R1)ADD.D F4,F0,F2L.D F0,-8(R1)SUBI R1,R1,#16

S.D 16(R1),F4

ADD.D F4,F0,F2

S.D 8(R1),F4


Summary TableVERSION CYCLES x

ITERATIONCYCLES x ELEMENT

SPEEDUP COMMENTS

V-0 10 10 1 Basic VersionV-1 6 6 1.67 Instr. SchedulingV-2 28 7 1.42 Loop Unrolling (LU) x4V-2a 44 6.29 1.59 Loop Unrolling (LU) x7V-3 14 3.5 2.85 Instr. Sch. + LUx4V-3a 23 3.29 3.04 Instr. Sch. + LUx7V-4 12 2.4 4.17 Superscalarx2 (LUx5)V-4a 16 2.29 4.37 Superscalarx2 (LUx7)V-5 7 1.3 7.69 VLIWx5 (LUx7)V-6 5 5 2 Software Pipelining


high performance computer architecturegiorgi/teaching/... · 13,14: bus architecture...

Documents